Hierarchical Hindsight Credit Assignment

A three-level decomposition framework for long-horizon agentic reasoning that achieves state-of-the-art credit assignment accuracy across diverse action hierarchies.

ArXiv: AI200 Trajectories4 Action Types3 Methods Compared
Key Metrics
HHCA framework performance highlights across all experiments
0.4507
Pearson Correlation
HHCA credit accuracy (r)
78.3%
Improvement
Over Outcome-Only baseline
0.0011
Transfer Gap
Near-zero cross-task gap
200
Trajectories
Across 4 horizon bins
Key Findings
Five principal discoveries from the HHCA evaluation
Finding 01
Superior Credit Accuracy
HHCA achieves r=0.4507 Pearson and 0.5588 Spearman correlation, substantially outperforming both Outcome-Only (r=0.2526) and Attention-Rollout (r=0.1955) baselines.
Finding 02
Horizon Robustness
Unlike baselines that degrade with longer horizons, HHCA maintains stable Pearson correlation (0.44-0.46) across all horizon bins from 10-step to 100-step trajectories.
Finding 03
Action-Type Sensitivity
HHCA excels at skill-selection credit (r=0.4462) and tool-call credit (r=0.4398), correctly attributing higher importance to strategic decisions over token-level actions.
Finding 04
Cross-Task Generalization
Transfer gap of only 0.0011 between train (r=0.4631) and test (r=0.4620) tasks demonstrates robust generalization without task-specific tuning.
Finding 05
Accuracy-Cost Tradeoff
While HHCA has higher computational overhead (5.1ms per action vs 0.003ms), the 78.3% accuracy improvement justifies the cost for applications requiring precise credit assignment.
Credit Accuracy Comparison
Pearson, Spearman, Precision@K, and Recall@K across three methods
Horizon Robustness Analysis
Pearson correlation by trajectory length bin
Action Type Analysis
Pearson correlation by action type for each method
Cross-Task Transfer
Train vs. test correlations and transfer gap per method
Outcome-Only
Train Pearson0.2561
Test Pearson0.2463
Train P@K0.2514
Test P@K0.2219
Gap: 0.0098
Attention-Rollout
Train Pearson0.1856
Test Pearson0.2029
Train P@K0.2466
Test P@K0.2412
Gap: -0.0173
HHCA
Train Pearson0.4631
Test Pearson0.4620
Train P@K0.3929
Test P@K0.4086
Gap: 0.0011
Scalability Comparison
Computational overhead per action across methods
MethodMean Time (ms)Std Dev (ms)Total Time (ms)Overhead RatioRelative Cost
Outcome-Only0.00260.00150.1281.0x Baseline
Attention-Rollout0.11820.04695.91045.5x Low
HHCA5.103234.6112255.1601962.8x High
Interactive Method Comparison
Select a method to inspect its full performance profile
Methodology
The HHCA framework operates through a four-stage pipeline
01
Trajectory Collection
Generate 200 agentic reasoning trajectories across multi-step tasks with 10-100 action horizons. Each trajectory records token-level, tool-call, skill-selection, and memory operations with ground-truth credit annotations.
02
Hierarchical Decomposition
Decompose each trajectory into three levels: macro-level (skill selection), meso-level (tool calls and memory ops), and micro-level (token generation). Credit flows top-down through the hierarchy via hindsight analysis.
03
Hindsight Credit Propagation
After observing outcomes, propagate credit backwards through the hierarchy. Each level receives credit proportional to its counterfactual contribution, using temporal difference decomposition within levels.
04
Evaluation and Transfer
Evaluate credit accuracy via Pearson/Spearman correlation against ground truth. Test horizon robustness, action-type sensitivity, cross-task transfer, and computational scalability across all methods.