A computational analysis of attack-defense dynamics in memorization, extraction, and safety measures across production language model configurations.
| Model | Size | Std Rate | JB Rate | Defense Eff. | Memorization | JB Uplift |
|---|---|---|---|---|---|---|
| Model-A | 175B | 0.1326 | 0.3396 | 0.8265 | 0.780 | 0.207 |
| Model-B | 540B | 0.1434 | 0.3782 | 0.8454 | 0.946 | 0.235 |
| Model-C | 65B | 0.0780 | 0.1998 | 0.8307 | 0.432 | 0.122 |
| Model-D | 1000B | 0.1488 | 0.3826 | 0.8482 | 0.974 | 0.234 |
| Average | --- | 0.1257 | 0.3251 | 0.8377 | 0.783 | 0.199 |
| Configuration | Effectiveness | FP Rate | Quality Loss | JB Vulnerability |
|---|---|---|---|---|
| No defense | 0.1577 | 0.069 | 0.000 | 0.100 |
| Output filter | 0.7069 | 0.120 | 0.000 | 0.100 |
| Activation cap | 0.3365 | 0.069 | 0.016 | 0.100 |
| RLHF alignment | 0.8110 | 0.069 | 0.000 | 0.456 |
| Refusal training | 0.7732 | 0.241 | 0.000 | 0.061 |
| Filter + RLHF | 0.9016 | 0.120 | 0.000 | 0.456 |
| Filter + refusal | 0.8885 | 0.283 | 0.000 | 0.061 |
| RLHF + refusal | 0.8709 | 0.241 | 0.000 | 0.279 |
| Full stack | 0.8427 | 0.283 | 0.016 | 0.279 |
| Model 1 | Model 2 | Rate 1 | Rate 2 | z-stat | p-value | Significant | Cohen's h |
|---|---|---|---|---|---|---|---|
| Model-A | Model-B | 0.1326 | 0.1434 | -0.700 | 0.484 | No | 0.031 |
| Model-A | Model-C | 0.1326 | 0.0780 | 3.978 | <0.001 | Yes | 0.179 |
| Model-A | Model-D | 0.1326 | 0.1488 | -1.042 | 0.298 | No | 0.047 |
| Model-B | Model-C | 0.1434 | 0.0780 | 4.661 | <0.001 | Yes | 0.211 |
| Model-B | Model-D | 0.1434 | 0.1488 | -0.342 | 0.732 | No | 0.015 |
| Model-C | Model-D | 0.0780 | 0.1488 | -4.993 | <0.001 | Yes | 0.226 |