Scaling AR Model Capacity with Growing iFSQ Codebook |

Problem Statement

How should autoregressive model capacity scale as the iFSQ codebook grows?

Background

Improved Finite Scalar Quantization (iFSQ) replaces learned VQ codebooks with a fixed grid of quantization levels per latent dimension. Each dimension is quantized to L = 2^K + 1 levels, yielding an implicit codebook of size V = L^d. While iFSQ avoids codebook collapse and achieves full per-dimension utilization, autoregressive generation quality peaks at K = 4 bits per dimension and degrades at higher K, even as reconstruction quality continues to improve.

Lin et al. (2026) conjecture that the fixed-capacity autoregressive model cannot effectively predict tokens from increasingly large vocabularies. This work addresses this conjecture through information-theoretic analysis, systematic scaling experiments, and architectural innovations.

Research Questions

Does increasing the iFSQ codebook size require proportionally scaling the autoregressive transformer capacity?
What is the quantitative relationship (scaling law) between codebook size V and required model parameters N?
How does the effective vocabulary compare to the nominal codebook size for natural image distributions?
Can factored prediction heads fundamentally alter the scaling relationship by reducing output layer complexity?
How does spatial correlation in image data affect the capacity requirements?

Methods

Three complementary analyses: information-theoretic, empirical scaling, and architectural

Experimental Setup

Synthetic data: AR(1) process with correlation rho = 0.8, S = 64 tokens, d = 8 latent dimensions
Codebook sweep: K in {2, 3, 4, 5, 6} bits per dimension
Training: 10 epochs, 5,000 sequences, batch size 64, AdamW with lr = 3e-4
Evaluation: 1,000 held-out sequences, cross-entropy loss
Scaling law: log(L) = a + b*log(N) + c*log(V)

Model Architectures

Causal transformer with pre-norm residual connections, factored embeddings, and factored output heads

Tiny

d=64, 2 layers
2 heads
~175K params

Small

d=128, 4 layers
4 heads
~1.3M params

Medium

d=256, 6 layers
8 heads
~8M params

Large

d=384, 8 layers
8 heads
~24M params

Information-Theoretic Analysis

Codebook growth and output layer parameter requirements (d = 16 latent dimensions)

Codebook Capacity vs. Output Parameters

Factored output heads scale linearly while joint heads scale exponentially

Data Table: Information-Theoretic Analysis

d = 16 latent dimensions, d_model = 768

K	L	log2(V)	Bits/Token	Factored Params	log2(Joint Params)

Scaling Law Analysis

Power-law relationship between model size, codebook size, and evaluation loss

log(L) = 0.711 − 0.112 · log(N) + 0.086 · log(V)

R² = 0.996 | Capacity exponent gamma = -c/b = 0.768 | Sub-linear scaling: 2^0.768 ~ 1.70x per doubling of V

Evaluation Loss vs. Model Parameters

By codebook size K, showing consistent power-law decay

Predicted vs. Actual Loss

Tight clustering around identity line confirms R-squared = 0.996

Scaling Law Data Points

K	Model	Parameters	log2(V)	Eval Loss	Predicted Loss	Error

Codebook Utilization Analysis

Effective vocabulary saturates near ~17.6 bits regardless of nominal codebook size

Effective vs. Nominal Vocabulary

log2 scale; effective vocabulary plateaus while nominal grows linearly with K

Utilization Ratio (log scale)

Drops from 0.929 at K=2 to 0.314 at K=7

Codebook Utilization Data (zero correlation)

K	L	log2(Nominal V)	Unique Codes	log2(Effective V)	Utilization Ratio	Per-Dim Utilization

Scaling Heatmap

Evaluation loss across codebook sizes (K) and model sizes

Loss Heatmap: K vs. Model Size

Larger models achieve disproportionately larger improvements at higher K

Training Curves: Small Model Across Codebook Sizes

Higher K consistently leads to higher loss throughout training

Joint vs. Factored Prediction Heads

Factored heads match joint performance with dramatically fewer parameters

Loss Comparison at K = 2

Joint vs. factored head performance by model size

Parameter Efficiency

Factored heads achieve comparable loss with far fewer parameters

Joint vs. Factored Head Data (K = 2)

Model (d_model)	Joint Loss	Factored Loss	Joint Params	Factored Params	Loss Ratio	Param Ratio

Correlation Sensitivity

Effect of spatial correlation on capacity requirements

Evaluation Loss by Correlation Strength

Higher correlation reduces loss; effect is stronger at lower K

Codebook K:

Correlation Sensitivity Data

Small model (d_model = 128), evaluation loss

K	rho = 0.0	rho = 0.3	rho = 0.5	rho = 0.7	rho = 0.9	Reduction

Key Findings

Summary of results and practical recommendations

Sub-linear Capacity Scaling

The capacity scaling exponent gamma = 0.768 means model parameters need grow sub-linearly with codebook size. When V doubles, parameters only need to scale by 2^0.768 = 1.70x, not 2x. This confirms the conjecture of Lin et al. (2026) while showing the relationship is sub-linear.

Effective Vocabulary Saturation

The effective vocabulary saturates near log2(V_eff) = 17.6 bits regardless of nominal codebook size, with utilization ratio dropping from 0.929 at K=2 to 0.314 at K=7. The AR model only needs to predict over the effective vocabulary, not the full nominal codebook.

Factored Heads Resolve Output Bottleneck

Factored prediction heads reduce output layer parameters from exponential O(L^d) to linear O(d*L) growth. At K=2 with d=8, factored heads match joint head performance with 7.3x to 144.2x fewer parameters when sufficient body capacity is available.

Practical Recommendations

For iFSQ-based AR generation with K > 4: (1) always use factored prediction heads, (2) scale transformer body sub-linearly with codebook size (exponent ~0.77), (3) prioritize model capacity based on image complexity rather than nominal codebook size.