Can an LLM's operation be captured by a non-transformer effective model? We address this open question through a systematic multi-architecture distillation competition across Transformer, SSM, GRU, and TCN architectures on five sequence tasks of varying complexity.
Single-layer, 2-head causal self-attention with residual connections and 2-layer FFN. The baseline architecture from effective model theory.
Diagonal state transition matrix A with tanh stability, input/output projections B, C, skip connection D, and selective gating.
Single-layer Gated Recurrent Unit with update gate z, reset gate r, and candidate hidden state for sequential processing.
Two-layer dilated causal convolution with kernel size 3, dilation factors {1,2}, ReLU, and residual connections.
| Metric | Notation | Description | Ideal |
|---|---|---|---|
| Behavioral Consistency | BC | Fraction of inputs where teacher and student agree on argmax prediction | 1.0 |
| KL Divergence | D_KL | Mean KL divergence of teacher from student distributions | 0.0 |
| Total Variation | TV | Mean half L1-norm between teacher and student distributions | 0.0 |
| Calibration Error | ECE | Expected calibration error with 10 confidence bins | 0.0 |
| Error Correlation | r_err | Pearson correlation of teacher and student error indicators | 1.0 |
| Architecture | d=4 | d=8 | d=16 | d=32 | d=64 | Params (d=16) |
|---|---|---|---|---|---|---|
| Transformer | 0.4219 | 0.4844 | 0.3906 | 0.4844 | 0.5625 | 3,200 |
| SSM | 0.5000 | 0.5000 | 0.5156 | 0.7500 | 0.7500 | 928 |
| GRU | 0.4688 | 0.4844 | 0.5781 | 0.5625 | 0.5938 | 1,664 |
| TCN | 0.3594 | 0.3281 | 0.2656 | 0.3438 | 0.4062 | 1,664 |
| Task | B Frob. Ratio | C Frob. Ratio | B Rank-1 Var. | C Rank-1 Var. | Spectral Radius | B Eff. Dim. | C Eff. Dim. |
|---|---|---|---|---|---|---|---|
| Copy-Last | 0.9009 | 0.8820 | 0.1884 | 0.2221 | 0.1920 | 11.92 | 11.34 |
| Copy-First | 0.8860 | 0.8742 | 0.2151 | 0.2358 | 0.2138 | 11.11 | 11.52 |
| Majority | 0.8939 | 0.8940 | 0.2010 | 0.2008 | 0.2642 | 11.21 | 11.60 |
| Reverse-Sum | 0.9041 | 0.9162 | 0.1825 | 0.1605 | 0.3529 | 11.84 | 12.36 |
| Pattern-Detect | 0.8797 | 0.8851 | 0.2261 | 0.2166 | 0.2125 | 11.55 | 11.38 |
| Task | H(Y) | H(Y|X) | I(X;Y) | Eff. Classes | Best NT | Best NT Arch | Transformer |
|---|---|---|---|---|---|---|---|
| Copy-Last | 1.3863 | 0.2519 | 1.1344 | 4.00 | 0.8125 | SSM | 0.4844 |
| Copy-First | 1.3863 | 0.2505 | 1.1358 | 4.00 | 0.3906 | GRU | 0.3438 |
| Majority | 1.3009 | 0.2493 | 1.0516 | 3.67 | 0.5000 | SSM | 0.5312 |
| Reverse-Sum | 1.3863 | 0.2515 | 1.1348 | 4.00 | 0.2969 | GRU/TCN | 0.3125 |
| Pattern-Detect | 0.8318 | 0.2513 | 0.5805 | 2.30 | 0.5625 | TCN | 0.5625 |
SSMs achieve 0.8125 behavioral consistency on Copy-Last with only 928 parameters -- less than one-third the 3,200 parameters of the Transformer baseline, which only reaches 0.4844 BC. Non-transformer architectures are viable for memory-access tasks.
All architectures struggle on Reverse-Sum (best BC: 0.3125). Compositional reasoning requiring arithmetic composition may need attention-like mechanisms that none of the small effective models fully provide.
SSM Frobenius perturbation ratios of 0.88--0.92 show a distributed, not low-rank, structure. Effective dimensions of 11.1--12.4 (out of 16) indicate near-uniform utilization, suggesting a different perturbation theory than the transformer case.
Copy-Last and Reverse-Sum have near-identical mutual information (1.13 nats) but very different architecture rankings. Task nature -- memory access vs. arithmetic composition -- is the key determinant, not just information-theoretic complexity.
SSM spectral radius increases from 0.192 (Copy-Last) to 0.353 (Reverse-Sum), providing a natural complexity measure: harder tasks require longer recurrent memory reflected in larger spectral radii.
TCN matches the Transformer at 0.5625 BC on Pattern-Detect, consistent with the task's local pattern structure aligning with convolutional receptive fields. Architecture inductive bias matches task structure.