Avoiding Diversity Collapse in RL with Execution Rewards

Three algorithms -- QD-GRPO, MaxEnt-GRPO, and Population-GRPO -- to prevent convergence collapse in open-ended research idea generation with GRPO.

0.959

Best Max Reward (Pop-GRPO)

0.997

Pop-GRPO Diversity

0.841

Baseline Max Reward

0.968

Baseline Diversity

Method Comparison

Max Reward vs Diversity

Performance Metrics

Reward-Diversity Tradeoff

Complexity (Thinking Trace Length)

Detailed Results

Main Comparison (Last 10 Epochs)

Method	Mean Reward	Max Reward	Diversity	Complexity
GRPO (Baseline)	0.072	0.841	0.968	3.43
QD-GRPO	0.081	0.812	0.943	3.43
MaxEnt-GRPO	0.008	0.442	0.994	28.86
Pop-GRPO	0.087	0.959	0.997	3.75