HHCA Framework - Hierarchical Hindsight Credit Assignment

Key Metrics

HHCA framework performance highlights across all experiments

0.4507

Pearson Correlation

HHCA credit accuracy (r)

78.3%

Improvement

Over Outcome-Only baseline

0.0011

Transfer Gap

Near-zero cross-task gap

200

Trajectories

Across 4 horizon bins

Key Findings

Five principal discoveries from the HHCA evaluation

Finding 01

Superior Credit Accuracy

HHCA achieves r=0.4507 Pearson and 0.5588 Spearman correlation, substantially outperforming both Outcome-Only (r=0.2526) and Attention-Rollout (r=0.1955) baselines.

Finding 02

Horizon Robustness

Unlike baselines that degrade with longer horizons, HHCA maintains stable Pearson correlation (0.44-0.46) across all horizon bins from 10-step to 100-step trajectories.

Finding 03

Action-Type Sensitivity

HHCA excels at skill-selection credit (r=0.4462) and tool-call credit (r=0.4398), correctly attributing higher importance to strategic decisions over token-level actions.

Finding 04

Cross-Task Generalization

Transfer gap of only 0.0011 between train (r=0.4631) and test (r=0.4620) tasks demonstrates robust generalization without task-specific tuning.

Finding 05

Accuracy-Cost Tradeoff

While HHCA has higher computational overhead (5.1ms per action vs 0.003ms), the 78.3% accuracy improvement justifies the cost for applications requiring precise credit assignment.

Credit Accuracy Comparison

Pearson, Spearman, Precision@K, and Recall@K across three methods

Horizon Robustness Analysis

Pearson correlation by trajectory length bin

Action Type Analysis

Pearson correlation by action type for each method

Cross-Task Transfer

Train vs. test correlations and transfer gap per method

Outcome-Only

Train Pearson0.2561

Test Pearson0.2463

Train P@K0.2514

Test P@K0.2219

Gap: 0.0098

Attention-Rollout

Train Pearson0.1856

Test Pearson0.2029

Train P@K0.2466

Test P@K0.2412

Gap: -0.0173

HHCA
Train Pearson0.4631
Test Pearson0.4620
Train P@K0.3929
Test P@K0.4086
Gap: 0.0011

Scalability Comparison

Computational overhead per action across methods

Method	Mean Time (ms)	Std Dev (ms)	Total Time (ms)	Overhead Ratio	Relative Cost
Outcome-Only	0.0026	0.0015	0.128	1.0x	Baseline
Attention-Rollout	0.1182	0.0469	5.910	45.5x	Low
HHCA	5.1032	34.6112	255.160	1962.8x	High

Interactive Method Comparison

Select a method to inspect its full performance profile

Methodology

The HHCA framework operates through a four-stage pipeline

Trajectory Collection

Generate 200 agentic reasoning trajectories across multi-step tasks with 10-100 action horizons. Each trajectory records token-level, tool-call, skill-selection, and memory operations with ground-truth credit annotations.

Hierarchical Decomposition

Decompose each trajectory into three levels: macro-level (skill selection), meso-level (tool calls and memory ops), and micro-level (token generation). Credit flows top-down through the hierarchy via hindsight analysis.

Hindsight Credit Propagation

After observing outcomes, propagate credit backwards through the hierarchy. Each level receives credit proportional to its counterfactual contribution, using temporal difference decomposition within levels.

Evaluation and Transfer

Evaluate credit accuracy via Pearson/Spearman correlation against ground truth. Test horizon robustness, action-type sensitivity, cross-task transfer, and computational scalability across all methods.