Principled Mitigation of Spurious Linguistic Artifacts in SDFT

Explore how counterfactual token weighting compares to heuristic masking for preventing student models from inheriting teacher-conditioned artifacts.

Artifact Boost (log-prob)

3.0

CF Threshold tau

1.0

Mask-k (heuristic)

Training Epochs

Naive Artifact Rate

Mask-k Artifact Rate

CF Weighting Artifact Rate

CF Task Performance