Practical methods to stabilize language model personas via inference-time activation capping along the Assistant Axis and training-time preventative steering approaches.
| Method | Final Drift | Defense Score | ASR |
|---|---|---|---|
| No Steering | 0.952 | 0.8 | 0.2 |
| Auxiliary Loss | 0.727 | 1.0 | 0.0 |
| Activation Reg. | 0.945 | 0.8 | 0.2 |
| Contrastive Grad. | 0.953 | 0.8 | 0.2 |
| Combined (Aux+Reg) | 0.709 | 1.0 | 0.0 |
| Model Size | Overhead % | Cap. Latency | Memory (KB) |
|---|---|---|---|
| 125M | 0.0074 | 0.13 us | 36 |
| 350M | 0.0070 | 0.34 us | 96 |
| 1.3B | 0.0038 | 0.67 us | 192 |
| 6.7B | 0.0020 | 1.79 us | 512 |
| 13B | 0.0016 | 2.80 us | 800 |
| 70B | 0.0009 | 8.95 us | 2,560 |
| 175B | 0.0007 | 16.11 us | 4,608 |
Log-linear scaling: slope = -0.349, R-squared = 0.990, p < 10^-5