Continuous Unified Visual Tokenization: Modeling the Understanding--Generation Trade-off

A systematic simulation-based analysis comparing discrete VQ-VAE, semantic encoders, dual tokenizers, and continuous unified tokenizers across reconstruction, understanding, and generation.

Category IV

Key Results at a Glance

32.47 dB
Reconstruction PSNR (Continuous Unified)
0.922
Semantic Accuracy (Continuous Unified)
9.98
Generation FID (Continuous Unified, lower is better)
8.04
Continuous Baseline FID (unreachable by discrete)
8.42
Best Discrete FID (codebook 16384)
0.200
Throughput at 1024 Tokens (relative)

Problem and Methods

Problem Statement

Unified multimodal models typically use separate tokenizers for visual understanding and image generation, increasing complexity and limiting cross-task synergy. Discrete quantized representations offer unification but introduce discretization errors that degrade generation quality. This work analyzes whether continuous unified tokenizers can resolve the understanding--generation trade-off.

Tokenizer Architectures Compared

  • Discrete VQ-VAE
    Vector-quantized VAE with codebook size 8192. Quantization introduces errors via nearest-neighbor lookup.
  • Semantic Encoder
    Continuous encoder (CLIP/SigLIP style) optimized for semantic understanding. No pixel-reconstruction pathway.
  • Dual Tokenizer
    Two specialized tokenizers in parallel: one for semantics, one for pixel reconstruction. High quality but doubled complexity.
  • Continuous Unified
    Single continuous encoder-decoder jointly optimizing semantic richness and pixel-level reconstruction, avoiding discretization errors.

Interactive Results

Reconstruction PSNR & Semantic Accuracy

Generation FID (lower is better)

PSNR vs Latent Dimension

Semantic Accuracy & FID vs Latent Dimension

Quantization Error vs Codebook Size

FID & PSNR vs Codebook Size

Understanding--Generation Pareto Frontier

Understanding Accuracy & FID vs Token Count

Throughput vs Token Count

Data Tables

Table 1: Tokenizer Architecture Comparison

ArchitecturePSNR (dB)Semantic AccuracyFIDQuant. Error
Discrete VQ-VAE31.75 ± 0.830.740 ± 0.01018.42 ± 1.500.050 ± 0.049
Semantic Encoder17.97 ± 1.980.908 ± 0.00655.20 ± 5.170.000
Dual Tokenizer29.98 ± 1.020.880 ± 0.00511.83 ± 2.080.019 ± 0.019
Continuous Unified32.47 ± 0.880.922 ± 0.0049.98 ± 1.820.000

Table 2: Latent Dimension Sweep (Continuous Unified)

DimensionPSNR (dB)Semantic AccuracyFID
1631.00 ± 0.910.892 ± 0.00412.48 ± 1.80
3231.39 ± 0.890.900 ± 0.00411.98 ± 1.77
6431.80 ± 0.980.907 ± 0.00411.23 ± 1.76
12832.12 ± 0.860.914 ± 0.00410.69 ± 1.78
25632.51 ± 0.910.922 ± 0.0049.94 ± 1.78
51232.83 ± 0.920.930 ± 0.0049.41 ± 1.82
102433.23 ± 0.860.937 ± 0.0048.72 ± 1.67

Table 3: Discretization Error Analysis

Codebook SizeQuant. ErrorFIDPSNR (dB)
2560.081 ± 0.01811.18 ± 1.4530.78 ± 0.82
5120.058 ± 0.01210.43 ± 1.5231.09 ± 0.80
10240.041 ± 0.0109.66 ± 1.5631.36 ± 0.85
20480.029 ± 0.0079.11 ± 1.4431.47 ± 0.81
40960.020 ± 0.0058.87 ± 1.4831.66 ± 0.82
81920.014 ± 0.0038.48 ± 1.4631.79 ± 0.81
163840.010 ± 0.0028.42 ± 1.5231.91 ± 0.76
Continuous Baseline0.0008.0432.00

Table 4: Token Count Scaling

Token CountUnderstanding Acc.Generation FIDRelative Throughput
160.71338.960.941
320.73337.720.889
640.77433.740.800
1280.80328.210.667
2560.83921.650.500
5760.92511.740.308
10240.9218.640.200

Key Findings

  1. Continuous unified tokenizers achieve the best combined performance. With a PSNR of 32.47 dB, semantic accuracy of 0.922, and FID of 9.98, the continuous unified tokenizer outperforms discrete VQ-VAE (FID 18.42), semantic encoders (FID 55.20), and dual tokenizers (FID 11.83) across all three metrics simultaneously.
  2. Discretization error is a fundamental limitation. Even with a codebook of size 16384, discrete tokenizers achieve an FID of only 8.42, which cannot match the continuous baseline FID of 8.04. Continuous representations eliminate quantization error entirely.
  3. The Pareto frontier is strictly dominant. Across all understanding-generation trade-off operating points, the continuous unified tokenizer achieves better performance on both axes compared to baselines. At the balanced point, it improves understanding by 0.061 and FID by 8.40 simultaneously.
  4. Latent dimension scaling follows logarithmic gains. Increasing the latent dimension from 16 to 1024 improves PSNR from 31.00 to 33.23 dB and FID from 12.48 to 8.72, with diminishing returns at higher dimensions.
  5. Token count scaling reveals an efficiency challenge. At 576 tokens, accuracy reaches 0.925 with FID 11.74, but throughput drops to 0.308. At 1024 tokens, FID improves to 8.64 but throughput falls to 0.200, highlighting the quality-efficiency trade-off.