Boundary-Aware Policy Optimization maintains persistent reliability advantages over baselines from 1.5B to 72B parameters on multi-hop QA benchmarks.
| Scale | BAPO F1 | Best Baseline F1 | Gap |
|---|---|---|---|
| 1.5B | 0.5791 | 0.4869 (DAPO) | +0.0922 |
| 3B | 0.6013 | 0.5043 (DAPO) | +0.0970 |
| 7B | 0.6313 | 0.5373 (DAPO) | +0.0941 |
| 14B | 0.6569 | 0.5511 (DAPO) | +0.1058 |
| 32B | 0.6791 | 0.5857 (DAPO) | +0.0933 |
| 72B | 0.7030 | 0.5951 (DAPO) | +0.1079 |
| Method | Acc Slope | Acc R-sq | F1 Slope | F1 R-sq |
|---|---|---|---|---|
| SFT | 0.0662 | 0.920 | 0.0484 | 0.975 |
| GRPO | 0.0848 | 0.901 | 0.0666 | 0.977 |
| PPO | 0.0716 | 0.907 | 0.0549 | 0.928 |
| DAPO | 0.1028 | 0.975 | 0.0678 | 0.985 |
| BAPO | 0.0898 | 0.990 | 0.0744 | 0.997 |
| Method | IDK Rate | Error Rate | Cal. Error | IDK-Err Corr. |
|---|---|---|---|---|
| SFT | 0.0238 | 0.5181 | 0.4943 | 0.0758 |
| GRPO | 0.0379 | 0.4491 | 0.4113 | 0.1511 |
| PPO | 0.0342 | 0.4685 | 0.4343 | 0.4888 |
| DAPO | 0.0502 | 0.4279 | 0.3777 | 0.2149 |
| BAPO | 0.1203 | 0.3948 | 0.2745 | 0.2344 |