Research Track · LG

Non-Transformer Effective Seq2Seq Models for Capturing LLM Operation

Can an LLM's operation be captured by a non-transformer effective model? We address this open question through a systematic multi-architecture distillation competition across Transformer, SSM, GRU, and TCN architectures on five sequence tasks of varying complexity.

Architectures

Sequence Tasks

0.8125

Best SSM Consistency

928

SSM Parameters

Problem Statement

Extending Effective Model Theory Beyond Transformers

Raju et al. (2026) model an LLM's operation on a fixed prompt as a small effective transformer with perturbed parameters. They leave open whether non-transformer architectures can equally well capture LLM behavior. We address this through systematic experimentation.

Simulated LLM Teacher Setup

Vocabulary size V=4, Sequence length T=3, 64 distinct inputs, noise level epsilon=0.05

Copy-Last

Memory Access

Copy-First

Memory Retention

Majority

Aggregation

Reverse-Sum

Composition

Pattern-Detect

Local Pattern

Methods

Candidate Architecture Families

Four architecture families compete as candidate effective models, each with distinct inductive biases for sequence processing.

Transformer

3,200 parameters

Single-layer, 2-head causal self-attention with residual connections and 2-layer FFN. The baseline architecture from effective model theory.

SSM (Mamba-style)

928 parameters

Diagonal state transition matrix A with tanh stability, input/output projections B, C, skip connection D, and selective gating.

GRU

1,664 parameters

Single-layer Gated Recurrent Unit with update gate z, reset gate r, and candidate hidden state for sequential processing.

TCN

1,664 parameters

Two-layer dilated causal convolution with kernel size 3, dilation factors {1,2}, ReLU, and residual connections.

Agreement Metrics

Five metrics measure multi-dimensional agreement between teacher and student over all 64 inputs

Metric	Notation	Description	Ideal
Behavioral Consistency	BC	Fraction of inputs where teacher and student agree on argmax prediction	1.0
KL Divergence	D_KL	Mean KL divergence of teacher from student distributions	0.0
Total Variation	TV	Mean half L1-norm between teacher and student distributions	0.0
Calibration Error	ECE	Expected calibration error with 10 confidence bins	0.0
Error Correlation	r_err	Pearson correlation of teacher and student error indicators	1.0

Experiment 1

Multi-Architecture Distillation Competition

Behavioral consistency and distributional agreement across architectures on five tasks. Select a metric to update the table below.

Behavioral Consistency by Task

Higher is better. SSM leads on Copy-Last (0.8125) with only 928 parameters.

KL Divergence by Task

Lower is better. SSM and GRU consistently outperform Transformer.

Full Distillation Results

Select a metric to explore all architecture-task combinations

Experiment 2

Scaling Analysis

How does behavioral consistency scale with model size? SSMs show the steepest scaling on Copy-Last.

Scaling on Copy-Last Task

Behavioral consistency vs. hidden dimension d in {4, 8, 16, 32, 64}

Transformer

SSM

GRU

TCN

Parameter Efficiency

Behavioral consistency vs. number of parameters (log scale)

Scaling Data: Copy-Last Task

Architecture	d=4	d=8	d=16	d=32	d=64	Params (d=16)
Transformer	0.4219	0.4844	0.3906	0.4844	0.5625	3,200
SSM	0.5000	0.5000	0.5156	0.7500	0.7500	928
GRU	0.4688	0.4844	0.5781	0.5625	0.5938	1,664
TCN	0.3594	0.3281	0.2656	0.3438	0.4062	1,664

Experiment 3

SSM Perturbation Structure

SSM parameters decompose as M = M_ideal + Delta_M. Frobenius ratios of 0.88--0.92 indicate distributed, not low-rank, structure.

Frobenius Perturbation Ratios

|Delta M|_F / |M|_F for input (B) and output (C) projections

Spectral Radius vs. Behavioral Consistency

Higher spectral radius reflects need for longer recurrent memory

Full Perturbation Analysis

SSM parameter decomposition across all five tasks

Task	B Frob. Ratio	C Frob. Ratio	B Rank-1 Var.	C Rank-1 Var.	Spectral Radius	B Eff. Dim.	C Eff. Dim.
Copy-Last	0.9009	0.8820	0.1884	0.2221	0.1920	11.92	11.34
Copy-First	0.8860	0.8742	0.2151	0.2358	0.2138	11.11	11.52
Majority	0.8939	0.8940	0.2010	0.2008	0.2642	11.21	11.60
Reverse-Sum	0.9041	0.9162	0.1825	0.1605	0.3529	11.84	12.36
Pattern-Detect	0.8797	0.8851	0.2261	0.2166	0.2125	11.55	11.38

Experiment 4

Task Complexity Taxonomy

Information-theoretic measures reveal why some tasks are harder. Mutual information I(X;Y) ranges from 0.58 to 1.14 nats.

Information-Theoretic Task Profile

Output entropy H(Y), conditional entropy H(Y|X), and mutual information I(X;Y)

Mutual Information vs. Best Non-Transformer BC

Task complexity alone does not predict architecture success

Task Complexity and Architecture Suitability

Task	H(Y)	H(Y\|X)	I(X;Y)	Eff. Classes	Best NT	Best NT Arch	Transformer
Copy-Last	1.3863	0.2519	1.1344	4.00	0.8125	SSM	0.4844
Copy-First	1.3863	0.2505	1.1358	4.00	0.3906	GRU	0.3438
Majority	1.3009	0.2493	1.0516	3.67	0.5000	SSM	0.5312
Reverse-Sum	1.3863	0.2515	1.1348	4.00	0.2969	GRU/TCN	0.3125
Pattern-Detect	0.8318	0.2513	0.5805	2.30	0.5625	TCN	0.5625

Experiment 5

Architecture Efficiency on Reverse-Sum

Fine-grained scaling on the hardest compositional task. All architectures plateau around 0.25--0.33 BC regardless of model size.

Behavioral Consistency vs. Hidden Dimension

Reverse-Sum: d in {4, 8, 12, 16, 24, 32, 48, 64}

Transformer

SSM

GRU

TCN

KL Divergence vs. Parameters

Reverse-Sum: Transformer KL decreases but stays above 1.4

Key Findings

Conclusions and Implications

SSMs as Viable Effective Models

SSMs achieve 0.8125 behavioral consistency on Copy-Last with only 928 parameters -- less than one-third the 3,200 parameters of the Transformer baseline, which only reaches 0.4844 BC. Non-transformer architectures are viable for memory-access tasks.

Compositional Tasks Remain Hard

All architectures struggle on Reverse-Sum (best BC: 0.3125). Compositional reasoning requiring arithmetic composition may need attention-like mechanisms that none of the small effective models fully provide.

Distributed Perturbation Structure

SSM Frobenius perturbation ratios of 0.88--0.92 show a distributed, not low-rank, structure. Effective dimensions of 11.1--12.4 (out of 16) indicate near-uniform utilization, suggesting a different perturbation theory than the transformer case.

Task Structure Determines Architecture Choice

Copy-Last and Reverse-Sum have near-identical mutual information (1.13 nats) but very different architecture rankings. Task nature -- memory access vs. arithmetic composition -- is the key determinant, not just information-theoretic complexity.

Spectral Radius as Complexity Measure

SSM spectral radius increases from 0.192 (Copy-Last) to 0.353 (Reverse-Sum), providing a natural complexity measure: harder tasks require longer recurrent memory reflected in larger spectral radii.

TCN Excels on Local Patterns

TCN matches the Transformer at 0.5625 BC on Pattern-Detect, consistent with the task's local pattern structure aligning with convolutional receptive fields. Architecture inductive bias matches task structure.