Research Track · LG

Non-Transformer Effective Seq2Seq Models for Capturing LLM Operation

Can an LLM's operation be captured by a non-transformer effective model? We address this open question through a systematic multi-architecture distillation competition across Transformer, SSM, GRU, and TCN architectures on five sequence tasks of varying complexity.

4
Architectures
5
Sequence Tasks
0.8125
Best SSM Consistency
928
SSM Parameters
Extending Effective Model Theory Beyond Transformers
Raju et al. (2026) model an LLM's operation on a fixed prompt as a small effective transformer with perturbed parameters. They leave open whether non-transformer architectures can equally well capture LLM behavior. We address this through systematic experimentation.
Simulated LLM Teacher Setup
Vocabulary size V=4, Sequence length T=3, 64 distinct inputs, noise level epsilon=0.05
Copy-Last
Memory Access
Copy-First
Memory Retention
Majority
Aggregation
Reverse-Sum
Composition
Pattern-Detect
Local Pattern
Candidate Architecture Families
Four architecture families compete as candidate effective models, each with distinct inductive biases for sequence processing.
T

Transformer

3,200 parameters

Single-layer, 2-head causal self-attention with residual connections and 2-layer FFN. The baseline architecture from effective model theory.

S

SSM (Mamba-style)

928 parameters

Diagonal state transition matrix A with tanh stability, input/output projections B, C, skip connection D, and selective gating.

G

GRU

1,664 parameters

Single-layer Gated Recurrent Unit with update gate z, reset gate r, and candidate hidden state for sequential processing.

C

TCN

1,664 parameters

Two-layer dilated causal convolution with kernel size 3, dilation factors {1,2}, ReLU, and residual connections.

Agreement Metrics
Five metrics measure multi-dimensional agreement between teacher and student over all 64 inputs
MetricNotationDescriptionIdeal
Behavioral ConsistencyBCFraction of inputs where teacher and student agree on argmax prediction1.0
KL DivergenceD_KLMean KL divergence of teacher from student distributions0.0
Total VariationTVMean half L1-norm between teacher and student distributions0.0
Calibration ErrorECEExpected calibration error with 10 confidence bins0.0
Error Correlationr_errPearson correlation of teacher and student error indicators1.0
Multi-Architecture Distillation Competition
Behavioral consistency and distributional agreement across architectures on five tasks. Select a metric to update the table below.
Behavioral Consistency by Task
Higher is better. SSM leads on Copy-Last (0.8125) with only 928 parameters.
KL Divergence by Task
Lower is better. SSM and GRU consistently outperform Transformer.
Full Distillation Results
Select a metric to explore all architecture-task combinations
Scaling Analysis
How does behavioral consistency scale with model size? SSMs show the steepest scaling on Copy-Last.
Scaling on Copy-Last Task
Behavioral consistency vs. hidden dimension d in {4, 8, 16, 32, 64}
Transformer
SSM
GRU
TCN
Parameter Efficiency
Behavioral consistency vs. number of parameters (log scale)
Scaling Data: Copy-Last Task
Architectured=4d=8d=16d=32d=64Params (d=16)
Transformer0.42190.48440.39060.48440.56253,200
SSM0.50000.50000.51560.75000.7500928
GRU0.46880.48440.57810.56250.59381,664
TCN0.35940.32810.26560.34380.40621,664
SSM Perturbation Structure
SSM parameters decompose as M = M_ideal + Delta_M. Frobenius ratios of 0.88--0.92 indicate distributed, not low-rank, structure.
Frobenius Perturbation Ratios
|Delta M|_F / |M|_F for input (B) and output (C) projections
Spectral Radius vs. Behavioral Consistency
Higher spectral radius reflects need for longer recurrent memory
Full Perturbation Analysis
SSM parameter decomposition across all five tasks
TaskB Frob. RatioC Frob. RatioB Rank-1 Var.C Rank-1 Var.Spectral RadiusB Eff. Dim.C Eff. Dim.
Copy-Last0.90090.88200.18840.22210.192011.9211.34
Copy-First0.88600.87420.21510.23580.213811.1111.52
Majority0.89390.89400.20100.20080.264211.2111.60
Reverse-Sum0.90410.91620.18250.16050.352911.8412.36
Pattern-Detect0.87970.88510.22610.21660.212511.5511.38
Task Complexity Taxonomy
Information-theoretic measures reveal why some tasks are harder. Mutual information I(X;Y) ranges from 0.58 to 1.14 nats.
Information-Theoretic Task Profile
Output entropy H(Y), conditional entropy H(Y|X), and mutual information I(X;Y)
Mutual Information vs. Best Non-Transformer BC
Task complexity alone does not predict architecture success
Task Complexity and Architecture Suitability
TaskH(Y)H(Y|X)I(X;Y)Eff. ClassesBest NTBest NT ArchTransformer
Copy-Last1.38630.25191.13444.000.8125SSM0.4844
Copy-First1.38630.25051.13584.000.3906GRU0.3438
Majority1.30090.24931.05163.670.5000SSM0.5312
Reverse-Sum1.38630.25151.13484.000.2969GRU/TCN0.3125
Pattern-Detect0.83180.25130.58052.300.5625TCN0.5625
Architecture Efficiency on Reverse-Sum
Fine-grained scaling on the hardest compositional task. All architectures plateau around 0.25--0.33 BC regardless of model size.
Behavioral Consistency vs. Hidden Dimension
Reverse-Sum: d in {4, 8, 12, 16, 24, 32, 48, 64}
Transformer
SSM
GRU
TCN
KL Divergence vs. Parameters
Reverse-Sum: Transformer KL decreases but stays above 1.4
Conclusions and Implications
1

SSMs as Viable Effective Models

SSMs achieve 0.8125 behavioral consistency on Copy-Last with only 928 parameters -- less than one-third the 3,200 parameters of the Transformer baseline, which only reaches 0.4844 BC. Non-transformer architectures are viable for memory-access tasks.

2

Compositional Tasks Remain Hard

All architectures struggle on Reverse-Sum (best BC: 0.3125). Compositional reasoning requiring arithmetic composition may need attention-like mechanisms that none of the small effective models fully provide.

3

Distributed Perturbation Structure

SSM Frobenius perturbation ratios of 0.88--0.92 show a distributed, not low-rank, structure. Effective dimensions of 11.1--12.4 (out of 16) indicate near-uniform utilization, suggesting a different perturbation theory than the transformer case.

4

Task Structure Determines Architecture Choice

Copy-Last and Reverse-Sum have near-identical mutual information (1.13 nats) but very different architecture rankings. Task nature -- memory access vs. arithmetic composition -- is the key determinant, not just information-theoretic complexity.

5

Spectral Radius as Complexity Measure

SSM spectral radius increases from 0.192 (Copy-Last) to 0.353 (Reverse-Sum), providing a natural complexity measure: harder tasks require longer recurrent memory reflected in larger spectral radii.

6

TCN Excels on Local Patterns

TCN matches the Transformer at 0.5625 BC on Pattern-Detect, consistent with the task's local pattern structure aligning with convolutional receptive fields. Architecture inductive bias matches task structure.