A systematic simulation-based analysis comparing discrete VQ-VAE, semantic encoders, dual tokenizers, and continuous unified tokenizers across reconstruction, understanding, and generation.
Category IV
Unified multimodal models typically use separate tokenizers for visual understanding and image generation, increasing complexity and limiting cross-task synergy. Discrete quantized representations offer unification but introduce discretization errors that degrade generation quality. This work analyzes whether continuous unified tokenizers can resolve the understanding--generation trade-off.
| Architecture | PSNR (dB) | Semantic Accuracy | FID | Quant. Error |
|---|---|---|---|---|
| Discrete VQ-VAE | 31.75 ± 0.83 | 0.740 ± 0.010 | 18.42 ± 1.50 | 0.050 ± 0.049 |
| Semantic Encoder | 17.97 ± 1.98 | 0.908 ± 0.006 | 55.20 ± 5.17 | 0.000 |
| Dual Tokenizer | 29.98 ± 1.02 | 0.880 ± 0.005 | 11.83 ± 2.08 | 0.019 ± 0.019 |
| Continuous Unified | 32.47 ± 0.88 | 0.922 ± 0.004 | 9.98 ± 1.82 | 0.000 |
| Dimension | PSNR (dB) | Semantic Accuracy | FID |
|---|---|---|---|
| 16 | 31.00 ± 0.91 | 0.892 ± 0.004 | 12.48 ± 1.80 |
| 32 | 31.39 ± 0.89 | 0.900 ± 0.004 | 11.98 ± 1.77 |
| 64 | 31.80 ± 0.98 | 0.907 ± 0.004 | 11.23 ± 1.76 |
| 128 | 32.12 ± 0.86 | 0.914 ± 0.004 | 10.69 ± 1.78 |
| 256 | 32.51 ± 0.91 | 0.922 ± 0.004 | 9.94 ± 1.78 |
| 512 | 32.83 ± 0.92 | 0.930 ± 0.004 | 9.41 ± 1.82 |
| 1024 | 33.23 ± 0.86 | 0.937 ± 0.004 | 8.72 ± 1.67 |
| Codebook Size | Quant. Error | FID | PSNR (dB) |
|---|---|---|---|
| 256 | 0.081 ± 0.018 | 11.18 ± 1.45 | 30.78 ± 0.82 |
| 512 | 0.058 ± 0.012 | 10.43 ± 1.52 | 31.09 ± 0.80 |
| 1024 | 0.041 ± 0.010 | 9.66 ± 1.56 | 31.36 ± 0.85 |
| 2048 | 0.029 ± 0.007 | 9.11 ± 1.44 | 31.47 ± 0.81 |
| 4096 | 0.020 ± 0.005 | 8.87 ± 1.48 | 31.66 ± 0.82 |
| 8192 | 0.014 ± 0.003 | 8.48 ± 1.46 | 31.79 ± 0.81 |
| 16384 | 0.010 ± 0.002 | 8.42 ± 1.52 | 31.91 ± 0.76 |
| Continuous Baseline | 0.000 | 8.04 | 32.00 |
| Token Count | Understanding Acc. | Generation FID | Relative Throughput |
|---|---|---|---|
| 16 | 0.713 | 38.96 | 0.941 |
| 32 | 0.733 | 37.72 | 0.889 |
| 64 | 0.774 | 33.74 | 0.800 |
| 128 | 0.803 | 28.21 | 0.667 |
| 256 | 0.839 | 21.65 | 0.500 |
| 576 | 0.925 | 11.74 | 0.308 |
| 1024 | 0.921 | 8.64 | 0.200 |