Can experience-seeking strategies improve RL training convergence and sampling quality for web agents?
| Strategy | Success Rate | Mean Return | Coverage | Diversity |
|---|---|---|---|---|
| Standard | 0.542 | 0.578 | 0.854 | 0.954 |
| ExpSeek | 0.527 | 0.563 | 0.849 | 0.954 |
| Best-of-N | 0.872 | 0.897 | 0.799 | 0.956 |
| ExpSeek+BoN | 0.895 | 0.917 | 0.810 | 0.956 |
| States | 20 |
| Actions per State | 5 |
| Episode Length | 10 |
| Task Configurations | 8 |
| Training Epochs | 150 |
| Rollouts per Epoch | 32 |
| ExpSeek Confidence Threshold | 0.3 |
| Max Backtracks | 3 |
| Best-of-N | N=4 |
ExpSeek alone does not consistently improve over standard rollouts (52.7% vs 54.2% success), but when combined with Best-of-N selection, it provides higher-quality diverse candidates for selection. This suggests ExpSeek is best understood as a sampling quality enhancer rather than a standalone training improvement -- its targeted exploration of decision-critical states creates better candidates for reward-based selection.