Fine-Grained Spatiotemporal Control in Human Motion Generation

Hierarchical composition for per-body-part spatial control with temporal phase alignment.

cs.CVMotion Generation5 Methods2-12 Constraints
0.762
Composite (Ours, C=4)
5.8x
Spatial Error Reduction
0.940
Temporal Alignment
8.4x
Faster than ST-Graph

Results at 4 Simultaneous Constraints

MethodSpatial Err. (lower)Temporal Al.Part Ind.Natural.Composite
Global-Text1.4270.0280.0000.1430.123
Part-Masked0.7240.2260.2480.1960.348
Keyframe Interp0.6520.9260.5230.2570.636
ST-Graph0.3020.8210.6530.3280.697
Ours0.1650.9400.6500.3720.762
Hierarchical Composition achieves the best composite score across all constraint complexities, with the lowest spatial error and highest temporal alignment, while being 8.4x faster than the Spatiotemporal Graph approach.

Composite Score vs Number of Constraints

Generation Time (ms)

Performance Retention (% of C=2)

Spatial Error at C=4

Temporal Alignment at C=4

Method Descriptions

Global-TextStandard text-conditioned diffusion, no part-level control
Part-Masked DiffusionPart-specific attention masks during diffusion for spatial control
Temporal Keyframe InterpKeyframe generation at constraint boundaries with interpolation
Spatiotemporal GraphGraph modeling of part-temporal interactions (high compute cost)
Hierarchical CompositionTwo-layer: part-level spatial conditioning + temporal phase alignment