Avoiding Diversity Collapse in RL with Execution Rewards

Three algorithms -- QD-GRPO, MaxEnt-GRPO, and Population-GRPO -- to prevent convergence collapse in open-ended research idea generation with GRPO.

0.959
Best Max Reward (Pop-GRPO)
0.997
Pop-GRPO Diversity
0.841
Baseline Max Reward
0.968
Baseline Diversity

Method Comparison

Max Reward vs Diversity

Performance Metrics

Reward-Diversity Tradeoff

Complexity (Thinking Trace Length)

Detailed Results

Main Comparison (Last 10 Epochs)

MethodMean RewardMax RewardDiversityComplexity
GRPO (Baseline)0.0720.8410.9683.43
QD-GRPO0.0810.8120.9433.43
MaxEnt-GRPO0.0080.4420.99428.86
Pop-GRPO0.0870.9590.9973.75