Continuous Unified Visual Tokenization: Modeling the Understanding--Generation Trade-off

Key Results at a Glance

32.47 dB
Reconstruction PSNR (Continuous Unified)

0.922
Semantic Accuracy (Continuous Unified)

9.98
Generation FID (Continuous Unified, lower is better)

8.04

Continuous Baseline FID (unreachable by discrete)

8.42

Best Discrete FID (codebook 16384)

0.200

Throughput at 1024 Tokens (relative)

Problem and Methods

Problem Statement

Unified multimodal models typically use separate tokenizers for visual understanding and image generation, increasing complexity and limiting cross-task synergy. Discrete quantized representations offer unification but introduce discretization errors that degrade generation quality. This work analyzes whether continuous unified tokenizers can resolve the understanding--generation trade-off.

Tokenizer Architectures Compared

Discrete VQ-VAE
Vector-quantized VAE with codebook size 8192. Quantization introduces errors via nearest-neighbor lookup.
Semantic Encoder
Continuous encoder (CLIP/SigLIP style) optimized for semantic understanding. No pixel-reconstruction pathway.
Dual Tokenizer
Two specialized tokenizers in parallel: one for semantics, one for pixel reconstruction. High quality but doubled complexity.
Continuous Unified
Single continuous encoder-decoder jointly optimizing semantic richness and pixel-level reconstruction, avoiding discretization errors.

Interactive Results

Reconstruction PSNR & Semantic Accuracy

Generation FID (lower is better)

PSNR vs Latent Dimension

Semantic Accuracy & FID vs Latent Dimension

Quantization Error vs Codebook Size

FID & PSNR vs Codebook Size

Understanding--Generation Pareto Frontier

Understanding Accuracy & FID vs Token Count

Throughput vs Token Count

Data Tables

Table 1: Tokenizer Architecture Comparison

Architecture	PSNR (dB)	Semantic Accuracy	FID	Quant. Error
Discrete VQ-VAE	31.75 ± 0.83	0.740 ± 0.010	18.42 ± 1.50	0.050 ± 0.049
Semantic Encoder	17.97 ± 1.98	0.908 ± 0.006	55.20 ± 5.17	0.000
Dual Tokenizer	29.98 ± 1.02	0.880 ± 0.005	11.83 ± 2.08	0.019 ± 0.019
Continuous Unified	32.47 ± 0.88	0.922 ± 0.004	9.98 ± 1.82	0.000

Table 2: Latent Dimension Sweep (Continuous Unified)

Dimension	PSNR (dB)	Semantic Accuracy	FID
16	31.00 ± 0.91	0.892 ± 0.004	12.48 ± 1.80
32	31.39 ± 0.89	0.900 ± 0.004	11.98 ± 1.77
64	31.80 ± 0.98	0.907 ± 0.004	11.23 ± 1.76
128	32.12 ± 0.86	0.914 ± 0.004	10.69 ± 1.78
256	32.51 ± 0.91	0.922 ± 0.004	9.94 ± 1.78
512	32.83 ± 0.92	0.930 ± 0.004	9.41 ± 1.82
1024	33.23 ± 0.86	0.937 ± 0.004	8.72 ± 1.67

Table 3: Discretization Error Analysis

Codebook Size	Quant. Error	FID	PSNR (dB)
256	0.081 ± 0.018	11.18 ± 1.45	30.78 ± 0.82
512	0.058 ± 0.012	10.43 ± 1.52	31.09 ± 0.80
1024	0.041 ± 0.010	9.66 ± 1.56	31.36 ± 0.85
2048	0.029 ± 0.007	9.11 ± 1.44	31.47 ± 0.81
4096	0.020 ± 0.005	8.87 ± 1.48	31.66 ± 0.82
8192	0.014 ± 0.003	8.48 ± 1.46	31.79 ± 0.81
16384	0.010 ± 0.002	8.42 ± 1.52	31.91 ± 0.76
Continuous Baseline	0.000	8.04	32.00

Table 4: Token Count Scaling

Token Count	Understanding Acc.	Generation FID	Relative Throughput
16	0.713	38.96	0.941
32	0.733	37.72	0.889
64	0.774	33.74	0.800
128	0.803	28.21	0.667
256	0.839	21.65	0.500
576	0.925	11.74	0.308
1024	0.921	8.64	0.200

Key Findings

Continuous unified tokenizers achieve the best combined performance. With a PSNR of 32.47 dB, semantic accuracy of 0.922, and FID of 9.98, the continuous unified tokenizer outperforms discrete VQ-VAE (FID 18.42), semantic encoders (FID 55.20), and dual tokenizers (FID 11.83) across all three metrics simultaneously.
Discretization error is a fundamental limitation. Even with a codebook of size 16384, discrete tokenizers achieve an FID of only 8.42, which cannot match the continuous baseline FID of 8.04. Continuous representations eliminate quantization error entirely.
The Pareto frontier is strictly dominant. Across all understanding-generation trade-off operating points, the continuous unified tokenizer achieves better performance on both axes compared to baselines. At the balanced point, it improves understanding by 0.061 and FID by 8.40 simultaneously.
Latent dimension scaling follows logarithmic gains. Increasing the latent dimension from 16 to 1024 improves PSNR from 31.00 to 33.23 dB and FID from 12.48 to 8.72, with diminishing returns at higher dimensions.
Token count scaling reveals an efficiency challenge. At 576 tokens, accuracy reaches 0.925 with FID 11.74, but throughput drops to 0.308. At 1024 tokens, FID improves to 8.64 but throughput falls to 0.200, highlighting the quality-efficiency trade-off.