Productionizing Activation Capping & Preventative Training-Time Steering

Practical methods to stabilize language model personas via inference-time activation capping along the Assistant Axis and training-time preventative steering approaches.

0.985
Optimal F1 Score (threshold=0.4)
100%
Harm Reduction at Optimal
96.97%
Capability Preserved
0.709
Best Drift (Combined Aux+Reg)
0.0007%
Overhead at 175B Scale

Experiment 1: Axis Estimation Methods

Axis Alignment vs. Noise Level

Axis Alignment vs. Sample Size

Experiment 2: Capping Threshold Optimization

Harm Reduction & Capability vs. Threshold

Calibration Sensitivity

Experiment 3: Training-Time Steering

Persona Drift During Training (200 Epochs)

Steering Method Comparison

MethodFinal DriftDefense ScoreASR
No Steering0.9520.80.2
Auxiliary Loss0.7271.00.0
Activation Reg.0.9450.80.2
Contrastive Grad.0.9530.80.2
Combined (Aux+Reg)0.7091.00.0

Experiment 4: Scalability Analysis

Capping Overhead vs. Model Size

Scalability Details

Model SizeOverhead %Cap. LatencyMemory (KB)
125M0.00740.13 us36
350M0.00700.34 us96
1.3B0.00380.67 us192
6.7B0.00201.79 us512
13B0.00162.80 us800
70B0.00098.95 us2,560
175B0.000716.11 us4,608

Log-linear scaling: slope = -0.349, R-squared = 0.990, p < 10^-5

Experiment 5: Hyperparameter Sensitivity

Auxiliary Loss Steering Strength Sensitivity

Combined Strategy Performance