Effective Training of Flow Policies for Boltzmann Distributions

Comparing five training methodologies for continuous normalizing flow policies in maximum entropy RL.

0.964
RFM Quality Score
21.1%
Improvement over Diffusion
57
RFM Convergence Iter
0.930
RFM ESS Ratio
5
Methods Compared

Method Comparison: Quality Scores

Quality Across Q-Function Types

Dimension Scaling (Quality Score)

Temperature Sensitivity

Convergence: Training Loss

ESS Ratio Across Dimensions

Method Comparison (Quadratic Q, d=8, alpha=1.0)

MethodQualityEnergy DistMMDESSLossConv. Iter
RFM0.96420.05080.01460.92980.044657
SFT0.79660.46630.07460.59660.0840101
KL0.77120.50700.06720.57580.1260126
VFM0.85890.29190.03710.69700.069484
Diffusion0.79630.40130.05880.65890.070087

Temperature Sensitivity Table

Methoda=0.1a=0.5a=1.0a=2.0a=5.0
RFM0.9190.9860.9780.9410.938
SFT0.7520.7770.8080.8410.782
KL0.7260.7030.7940.7510.719
VFM0.8140.8190.8270.8670.805
Diffusion0.7510.8240.8570.8470.780