Pose-Conditioned Appearance Fields for Novel Views

The Open Problem

Modern radiance field pipelines optimize per-frame photometric parameters independently for each training image. At inference, how should one assign these parameters to a novel viewpoint for which no ground-truth image exists?

During Training

Each image gets its own photometric compensation parameters (exposure, white balance, etc.) optimized via reconstruction loss against the ground-truth image. This works well for known views.

Per-Frame Optimization

min_{θ,{α_i}} Σ_i L_photo( R(F_θ, x_i, d_i, α_i), I_i )

At Inference (Novel Views)

No ground-truth image exists for the novel viewpoint. The parameters were optimized independently per frame, so there is no mechanism to generalize them to unseen camera poses.

The Open Question

α_* = ? for novel pose (x_*, d_*) without I_*

Existing Approaches and Limitations

Mean / Zero Embedding

Use the mean of all training embeddings. Simple but discards all spatial information about which appearance regime applies at the novel view.

Nearest-Neighbor Lookup

Assign parameters of the closest training view by pose distance. Fails when the novel view sits between training clusters with different exposures.

Appearance Encoders

Networks like Ha-NeRF or CR-NeRF map an input image to its embedding. But this requires an image at the novel view, contradicting the problem setup.

Key insight: Photometric properties (exposure, white balance) vary smoothly with camera pose. A continuous mapping from pose to appearance parameters can exploit this spatial structure to generalize to novel viewpoints.

Two-Stage Framework

Our approach learns a continuous mapping from camera pose to appearance parameters, regularized by low-frequency positional encoding, and optionally refined at test time via multi-view consistency.

Camera Pose

6-DoF pose: position (x,y,z) + viewing direction

→

Low-Freq Encoding

Positional encoding with L=2 octaves (critical choice)

→

Appearance MLP

3 layers, 128 units, SiLU activations

→

Appearance Params

6D affine color transform: scale (3) + bias (3)

→

TTA (Optional)

Multi-view consistency refinement at inference

Stage 1: Pose-Conditioned Appearance MLP

Continuous Mapping

α_* = g_φ( γ_L(p_*) )

The MLP is trained with three loss terms: reconstruction loss on training views, a leave-one-out cross-validation term for generalization, and a Lipschitz smoothness penalty on weight matrices.

Training Objective

L = L_recon + 0.5 L_loo + 10^-3 L_lip

Stage 2: Test-Time Adaptation

Multi-View Consistency

min_α* Σ_k | R(F, p*, α*) - I_k₁ + λ | α* - g_φ(γ(p*)) |²

Optionally refines the MLP prediction by optimizing against photometric consistency with nearby training views. The second term anchors refinement to the MLP's prediction, preventing divergence.

Critical design choice: The positional encoding uses only L=2 frequency bands (vs. L=6-10 typical for spatial encoding in NeRF). This enforces the physical prior that photometric properties vary slowly across the camera pose space, preventing memorization of per-image noise.

Main Results

Comparison of methods for assigning per-frame appearance parameters to 15 held-out novel views from a 50-view synthetic benchmark with realistic photometric variation.

0.0665

Scale MAE

-10.0% vs k-NN

0.0155

Bias MAE

Best overall

21.25

Param PSNR (dB)

+0.89 dB vs k-NN

0.993

Correlation

Highest

Full Comparison Table

Method	Scale MAE	Bias MAE	Log-Exp Error	Param PSNR (dB)	Correlation

Scale MAE and Log-Exposure Error

Parameter PSNR (dB)

Progressive improvement: Mean embedding ignores pose entirely. Nearest neighbor uses a single reference. k-NN averages multiple references. The MLP learns the underlying functional relationship, achieving 38.5% reduction in Scale MAE over the mean baseline.

Positional Encoding Frequency Ablation

The central finding: increasing positional encoding frequency monotonically decreases training loss but increases test error beyond L=2. This is the classic bias-variance tradeoff manifested in the frequency domain.

Scale MAE vs. Frequency Bands (L)

Train Loss vs. Test PSNR

Detailed Frequency Ablation

L (Frequencies)	Scale MAE	PSNR (dB)	Final Train Loss	Gap vs Optimal

At L=8, the MLP achieves 37% lower training loss than at L=2, but 43% higher test Scale MAE (0.0952 vs. 0.0665). The optimal L=2 provides just enough capacity to capture smooth spatial structure of exposure and white balance variation without fitting per-image noise.

Noise-Level Ablation

The MLP's advantage grows with noise level. Its smooth parametric form provides implicit denoising that non-parametric methods like k-NN lack.

Scale MAE Across Noise Levels

Sigma	Mean	k-NN	Ours (MLP)	Winner	MLP Advantage

At high noise (sigma=0.20), the MLP reduces error by 20.4% over k-NN. The smooth parametric form averages out per-image noise through the learned function, rather than propagating it directly as k-NN does.

Training Set Size Ablation

The MLP benefits more from additional training views than k-NN. The crossover occurs at approximately N=30, where the MLP has sufficient coverage to learn the underlying function.

Scale MAE by Training Set Size

N (Views)	Mean	k-NN	Ours (MLP)	Winner

From N=10 to N=50, MLP error decreases by 43% (0.1171 to 0.0665) while k-NN decreases by 38% (0.1195 to 0.0739). The MLP requires sufficient pose-space coverage to learn the underlying function, but then extrapolates more effectively than local interpolation.

Test-Time Adaptation

Multi-view photometric consistency can further refine the MLP prediction for individual views, with up to 72% reduction in scale error for well-conditioned views.

Scale MAE: Before vs. After TTA

Param PSNR: Before vs. After TTA

Per-View TTA Results

View	Scale Before	Scale After	PSNR Before	PSNR After	Scale Improved?

Mixed results: TTA improves scale MAE for 4 of 5 views (up to 72% for view 0), but can increase bias error. TTA is most beneficial when the MLP's initial prediction is already close, and when nearby training views have similar appearance.

Key Conclusions

A comprehensive characterization of when and why continuous appearance fields outperform discrete alternatives.

1. Continuous mappings outperform discrete lookup

The pose-conditioned appearance MLP reduces Scale MAE by 10.0% over k-NN interpolation and 38.5% over mean embedding, achieving 21.25 dB parameter PSNR.

2. Low-frequency positional encoding is critical

Using L=2 frequencies (vs. L=6-8 typical for spatial encoding) enforces the smoothness prior that appearance varies slowly with viewpoint. This single design choice accounts for a 30% error gap.

3. Learned mappings provide implicit denoising

At high noise levels (sigma=0.20), the MLP reduces error by 20.4% over k-NN because its smooth parametric form averages out per-image noise.

4. Test-time adaptation is a complementary refinement

Multi-view consistency can improve individual predictions by up to 72% in scale error, though it requires careful tuning of the smoothness anchor.

Experimental Setup

Training Views

Test Views

64 x 64

Image Resolution

Parameter Dim (3 scale + 3 bias)

2000

Training Epochs

128

MLP Hidden Units