End-to-End Controllable Song Generation with Multi-Condition Inputs

Cross-modal fusion framework for jointly conditioning on style descriptions, lyrics, and reference audio. 12,800 generations across 8 genres and 4 fusion methods.

0.774
Best OCI (Gated Attention)
+123.5% vs baseline
9.08
Best FAD (lower=better)
-56% vs unconditional
70.9%
Synergy (Superadditivity)
Gated Attention
12,800
Total Generations
8 genres x 8 configs x 4 models

OCI by Condition Configuration

Fusion Comparison (Triple-Condition)

Condition Ablation (Gated Attention)

Synergy Analysis

Genre Performance

Detailed Results

ModelConfigFADMelodyLyricsStyleOCI