Three algorithms -- QD-GRPO, MaxEnt-GRPO, and Population-GRPO -- to prevent convergence collapse in open-ended research idea generation with GRPO.
| Method | Mean Reward | Max Reward | Diversity | Complexity |
|---|---|---|---|---|
| GRPO (Baseline) | 0.072 | 0.841 | 0.968 | 3.43 |
| QD-GRPO | 0.081 | 0.812 | 0.943 | 3.43 |
| MaxEnt-GRPO | 0.008 | 0.442 | 0.994 | 28.86 |
| Pop-GRPO | 0.087 | 0.959 | 0.997 | 3.75 |