ExpSeek as Rollout Augmentation for Agentic RL

Can experience-seeking strategies improve RL training convergence and sampling quality for web agents?

cs.CLAgentic RL4 Strategies150 Epochs
89.5%
Success Rate (ExpSeek+BoN)
65.2%
Rel. Improvement vs Standard
0.917
Mean Return (ExpSeek+BoN)
0.956
Rollout Diversity

Summary Comparison

StrategySuccess RateMean ReturnCoverageDiversity
Standard0.5420.5780.8540.954
ExpSeek0.5270.5630.8490.954
Best-of-N0.8720.8970.7990.956
ExpSeek+BoN0.8950.9170.8100.956
The hybrid ExpSeek+BoN strategy achieves the highest success rate (89.5%), demonstrating that experience-seeking mechanisms synergize with Best-of-N selection to improve RL training quality.

Success Rate Comparison

Mean Return Comparison

Training Convergence

Coverage vs Success Rate

Diversity Metrics

Standard rollouts achieve highest coverage (0.854) but lowest success. BoN methods sacrifice coverage for quality through selection pressure. ExpSeek+BoN partially recovers coverage (0.810 vs 0.799) while maintaining highest success.

Experiment Setup

States20
Actions per State5
Episode Length10
Task Configurations8
Training Epochs150
Rollouts per Epoch32
ExpSeek Confidence Threshold0.3
Max Backtracks3
Best-of-NN=4

Key Insight

ExpSeek alone does not consistently improve over standard rollouts (52.7% vs 54.2% success), but when combined with Best-of-N selection, it provides higher-quality diverse candidates for selection. This suggests ExpSeek is best understood as a sampling quality enhancer rather than a standalone training improvement -- its targeted exploration of decision-critical states creates better candidates for reward-based selection.