Establishing a power-law relationship between iFSQ codebook size and transformer capacity requirements for image generation, revealing sub-linear scaling and effective vocabulary saturation.
How should autoregressive model capacity scale as the iFSQ codebook grows?
Improved Finite Scalar Quantization (iFSQ) replaces learned VQ codebooks with a fixed grid of quantization levels per latent dimension. Each dimension is quantized to L = 2K + 1 levels, yielding an implicit codebook of size V = Ld. While iFSQ avoids codebook collapse and achieves full per-dimension utilization, autoregressive generation quality peaks at K = 4 bits per dimension and degrades at higher K, even as reconstruction quality continues to improve.
Lin et al. (2026) conjecture that the fixed-capacity autoregressive model cannot effectively predict tokens from increasingly large vocabularies. This work addresses this conjecture through information-theoretic analysis, systematic scaling experiments, and architectural innovations.
Three complementary analyses: information-theoretic, empirical scaling, and architectural
Causal transformer with pre-norm residual connections, factored embeddings, and factored output heads
Codebook growth and output layer parameter requirements (d = 16 latent dimensions)
| K | L | log2(V) | Bits/Token | Factored Params | log2(Joint Params) |
|---|
Power-law relationship between model size, codebook size, and evaluation loss
| K | Model | Parameters | log2(V) | Eval Loss | Predicted Loss | Error |
|---|
Effective vocabulary saturates near ~17.6 bits regardless of nominal codebook size
| K | L | log2(Nominal V) | Unique Codes | log2(Effective V) | Utilization Ratio | Per-Dim Utilization |
|---|
Evaluation loss across codebook sizes (K) and model sizes
Larger models achieve disproportionately larger improvements at higher K
Factored heads match joint performance with dramatically fewer parameters
| Model (d_model) | Joint Loss | Factored Loss | Joint Params | Factored Params | Loss Ratio | Param Ratio |
|---|
Effect of spatial correlation on capacity requirements
| K | rho = 0.0 | rho = 0.3 | rho = 0.5 | rho = 0.7 | rho = 0.9 | Reduction |
|---|
Summary of results and practical recommendations
The capacity scaling exponent gamma = 0.768 means model parameters need grow sub-linearly with codebook size. When V doubles, parameters only need to scale by 2^0.768 = 1.70x, not 2x. This confirms the conjecture of Lin et al. (2026) while showing the relationship is sub-linear.
The effective vocabulary saturates near log2(V_eff) = 17.6 bits regardless of nominal codebook size, with utilization ratio dropping from 0.929 at K=2 to 0.314 at K=7. The AR model only needs to predict over the effective vocabulary, not the full nominal codebook.
Factored prediction heads reduce output layer parameters from exponential O(L^d) to linear O(d*L) growth. At K=2 with d=8, factored heads match joint head performance with 7.3x to 144.2x fewer parameters when sufficient body capacity is available.
For iFSQ-based AR generation with K > 4: (1) always use factored prediction heads, (2) scale transformer body sub-linearly with codebook size (exponent ~0.77), (3) prioritize model capacity based on image complexity rather than nominal codebook size.