Pose-Conditioned Appearance Fields for Assigning Per-Frame Photometric Parameters to Novel Views

A principled framework that replaces discrete per-frame appearance embeddings with a continuous, pose-conditioned mapping, enabling smooth interpolation of photometric parameters to novel viewpoints in radiance field pipelines.

Novel View Synthesis Neural Radiance Fields Appearance Modeling Photometric Compensation Per-Frame Parameters

The Open Problem

Modern radiance field pipelines optimize per-frame photometric parameters independently for each training image. At inference, how should one assign these parameters to a novel viewpoint for which no ground-truth image exists?

During Training

Each image gets its own photometric compensation parameters (exposure, white balance, etc.) optimized via reconstruction loss against the ground-truth image. This works well for known views.

Per-Frame Optimization
minθ,{αi} Σi Lphoto( R(Fθ, xi, di, αi), Ii )

At Inference (Novel Views)

No ground-truth image exists for the novel viewpoint. The parameters were optimized independently per frame, so there is no mechanism to generalize them to unseen camera poses.

The Open Question
α* = ? for novel pose (x*, d*) without I*

Existing Approaches and Limitations

Mean / Zero Embedding

Use the mean of all training embeddings. Simple but discards all spatial information about which appearance regime applies at the novel view.

Nearest-Neighbor Lookup

Assign parameters of the closest training view by pose distance. Fails when the novel view sits between training clusters with different exposures.

Appearance Encoders

Networks like Ha-NeRF or CR-NeRF map an input image to its embedding. But this requires an image at the novel view, contradicting the problem setup.

Key insight: Photometric properties (exposure, white balance) vary smoothly with camera pose. A continuous mapping from pose to appearance parameters can exploit this spatial structure to generalize to novel viewpoints.

Two-Stage Framework

Our approach learns a continuous mapping from camera pose to appearance parameters, regularized by low-frequency positional encoding, and optionally refined at test time via multi-view consistency.

1
Camera Pose
6-DoF pose: position (x,y,z) + viewing direction
2
Low-Freq Encoding
Positional encoding with L=2 octaves (critical choice)
3
Appearance MLP
3 layers, 128 units, SiLU activations
4
Appearance Params
6D affine color transform: scale (3) + bias (3)
5
TTA (Optional)
Multi-view consistency refinement at inference

Stage 1: Pose-Conditioned Appearance MLP

Continuous Mapping
α* = gφ( γL(p*) )

The MLP is trained with three loss terms: reconstruction loss on training views, a leave-one-out cross-validation term for generalization, and a Lipschitz smoothness penalty on weight matrices.

Training Objective
L = Lrecon + 0.5 Lloo + 10-3 Llip

Stage 2: Test-Time Adaptation

Multi-View Consistency
minα* Σk | R(F, p*, α*) - Ik1 + λ | α* - gφ(γ(p*)) |2

Optionally refines the MLP prediction by optimizing against photometric consistency with nearby training views. The second term anchors refinement to the MLP's prediction, preventing divergence.

Critical design choice: The positional encoding uses only L=2 frequency bands (vs. L=6-10 typical for spatial encoding in NeRF). This enforces the physical prior that photometric properties vary slowly across the camera pose space, preventing memorization of per-image noise.

Main Results

Comparison of methods for assigning per-frame appearance parameters to 15 held-out novel views from a 50-view synthetic benchmark with realistic photometric variation.

0.0665
Scale MAE
-10.0% vs k-NN
0.0155
Bias MAE
Best overall
21.25
Param PSNR (dB)
+0.89 dB vs k-NN
0.993
Correlation
Highest

Full Comparison Table

Method Scale MAE Bias MAE Log-Exp Error Param PSNR (dB) Correlation

Scale MAE and Log-Exposure Error

Parameter PSNR (dB)

Progressive improvement: Mean embedding ignores pose entirely. Nearest neighbor uses a single reference. k-NN averages multiple references. The MLP learns the underlying functional relationship, achieving 38.5% reduction in Scale MAE over the mean baseline.

Positional Encoding Frequency Ablation

The central finding: increasing positional encoding frequency monotonically decreases training loss but increases test error beyond L=2. This is the classic bias-variance tradeoff manifested in the frequency domain.

Scale MAE vs. Frequency Bands (L)

Train Loss vs. Test PSNR

Detailed Frequency Ablation

L (Frequencies) Scale MAE PSNR (dB) Final Train Loss Gap vs Optimal
At L=8, the MLP achieves 37% lower training loss than at L=2, but 43% higher test Scale MAE (0.0952 vs. 0.0665). The optimal L=2 provides just enough capacity to capture smooth spatial structure of exposure and white balance variation without fitting per-image noise.

Noise-Level Ablation

The MLP's advantage grows with noise level. Its smooth parametric form provides implicit denoising that non-parametric methods like k-NN lack.

Scale MAE Across Noise Levels

Sigma Mean k-NN Ours (MLP) Winner MLP Advantage
At high noise (sigma=0.20), the MLP reduces error by 20.4% over k-NN. The smooth parametric form averages out per-image noise through the learned function, rather than propagating it directly as k-NN does.

Training Set Size Ablation

The MLP benefits more from additional training views than k-NN. The crossover occurs at approximately N=30, where the MLP has sufficient coverage to learn the underlying function.

Scale MAE by Training Set Size

N (Views) Mean k-NN Ours (MLP) Winner
From N=10 to N=50, MLP error decreases by 43% (0.1171 to 0.0665) while k-NN decreases by 38% (0.1195 to 0.0739). The MLP requires sufficient pose-space coverage to learn the underlying function, but then extrapolates more effectively than local interpolation.

Test-Time Adaptation

Multi-view photometric consistency can further refine the MLP prediction for individual views, with up to 72% reduction in scale error for well-conditioned views.

Scale MAE: Before vs. After TTA

Param PSNR: Before vs. After TTA

Per-View TTA Results

View Scale Before Scale After PSNR Before PSNR After Scale Improved?
Mixed results: TTA improves scale MAE for 4 of 5 views (up to 72% for view 0), but can increase bias error. TTA is most beneficial when the MLP's initial prediction is already close, and when nearby training views have similar appearance.

Key Conclusions

A comprehensive characterization of when and why continuous appearance fields outperform discrete alternatives.

1. Continuous mappings outperform discrete lookup

The pose-conditioned appearance MLP reduces Scale MAE by 10.0% over k-NN interpolation and 38.5% over mean embedding, achieving 21.25 dB parameter PSNR.

2. Low-frequency positional encoding is critical

Using L=2 frequencies (vs. L=6-8 typical for spatial encoding) enforces the smoothness prior that appearance varies slowly with viewpoint. This single design choice accounts for a 30% error gap.

3. Learned mappings provide implicit denoising

At high noise levels (sigma=0.20), the MLP reduces error by 20.4% over k-NN because its smooth parametric form averages out per-image noise.

4. Test-time adaptation is a complementary refinement

Multi-view consistency can improve individual predictions by up to 72% in scale error, though it requires careful tuning of the smoothness anchor.

Experimental Setup

50
Training Views
15
Test Views
64 x 64
Image Resolution
6D
Parameter Dim (3 scale + 3 bias)
2000
Training Epochs
128
MLP Hidden Units