Cross-modal fusion framework for jointly conditioning on style descriptions, lyrics, and reference audio. 12,800 generations across 8 genres and 4 fusion methods.