Quantifying the impact of standardized trace collection on agent evaluation reliability
| Regime | Reproducibility | Completeness | Eff. Leakage | Schema Compliance |
|---|---|---|---|---|
| No Protocol | 0.393 | 0.297 | 0.091 | 0.000 |
| Partial Logging | 0.507 | 0.603 | 0.100 | 0.000 |
| Full Logging | 0.750 | 0.949 | 0.104 | 0.720 |
| Full + Sanitized | 0.939 | 0.948 | 0.099 | 0.735 |
| Full Protocol | 0.981 | 0.979 | 0.010 | 0.890 |