Calibrated Stop/Continue Criteria Under Distribution Shift

Comparing stopping strategies across retrievers, corpora, and LLM backbones

Best ECE
0.103 (Bayesian)
Best Accuracy
0.481 (Fixed-5)
Configurations
36

ECE Comparison Across Criteria

Calibration Diagram

Noise Sensitivity: ECE

Accuracy by Hop Depth