A principled framework that replaces discrete per-frame appearance embeddings with a continuous, pose-conditioned mapping, enabling smooth interpolation of photometric parameters to novel viewpoints in radiance field pipelines.
Modern radiance field pipelines optimize per-frame photometric parameters independently for each training image. At inference, how should one assign these parameters to a novel viewpoint for which no ground-truth image exists?
Each image gets its own photometric compensation parameters (exposure, white balance, etc.) optimized via reconstruction loss against the ground-truth image. This works well for known views.
No ground-truth image exists for the novel viewpoint. The parameters were optimized independently per frame, so there is no mechanism to generalize them to unseen camera poses.
Use the mean of all training embeddings. Simple but discards all spatial information about which appearance regime applies at the novel view.
Assign parameters of the closest training view by pose distance. Fails when the novel view sits between training clusters with different exposures.
Networks like Ha-NeRF or CR-NeRF map an input image to its embedding. But this requires an image at the novel view, contradicting the problem setup.
Our approach learns a continuous mapping from camera pose to appearance parameters, regularized by low-frequency positional encoding, and optionally refined at test time via multi-view consistency.
The MLP is trained with three loss terms: reconstruction loss on training views, a leave-one-out cross-validation term for generalization, and a Lipschitz smoothness penalty on weight matrices.
Optionally refines the MLP prediction by optimizing against photometric consistency with nearby training views. The second term anchors refinement to the MLP's prediction, preventing divergence.
Comparison of methods for assigning per-frame appearance parameters to 15 held-out novel views from a 50-view synthetic benchmark with realistic photometric variation.
| Method | Scale MAE | Bias MAE | Log-Exp Error | Param PSNR (dB) | Correlation |
|---|
The central finding: increasing positional encoding frequency monotonically decreases training loss but increases test error beyond L=2. This is the classic bias-variance tradeoff manifested in the frequency domain.
| L (Frequencies) | Scale MAE | PSNR (dB) | Final Train Loss | Gap vs Optimal |
|---|
The MLP's advantage grows with noise level. Its smooth parametric form provides implicit denoising that non-parametric methods like k-NN lack.
| Sigma | Mean | k-NN | Ours (MLP) | Winner | MLP Advantage |
|---|
The MLP benefits more from additional training views than k-NN. The crossover occurs at approximately N=30, where the MLP has sufficient coverage to learn the underlying function.
| N (Views) | Mean | k-NN | Ours (MLP) | Winner |
|---|
Multi-view photometric consistency can further refine the MLP prediction for individual views, with up to 72% reduction in scale error for well-conditioned views.
| View | Scale Before | Scale After | PSNR Before | PSNR After | Scale Improved? |
|---|
A comprehensive characterization of when and why continuous appearance fields outperform discrete alternatives.
The pose-conditioned appearance MLP reduces Scale MAE by 10.0% over k-NN interpolation and 38.5% over mean embedding, achieving 21.25 dB parameter PSNR.
Using L=2 frequencies (vs. L=6-8 typical for spatial encoding) enforces the smoothness prior that appearance varies slowly with viewpoint. This single design choice accounts for a 30% error gap.
At high noise levels (sigma=0.20), the MLP reduces error by 20.4% over k-NN because its smooth parametric form averages out per-image noise.
Multi-view consistency can improve individual predictions by up to 72% in scale error, though it requires careful tuning of the smoothness anchor.