Bridging classical geometric consistency with learned robustness through differentiable pose graph optimization and chunked attention
Learning physically consistent geometry at scale remains a challenging open problem (Xu et al., GPA-VGGT 2026). Without structured constraints, learned predictions suffer from scale drift, inconsistent geometry across viewpoints, and violation of physical laws. This work resolves the tension by making classical constraints differentiable and embedding them as hierarchical losses within a learning framework.
Symmetric Sampson distance between frame pairs ensures predicted depth and pose are mutually consistent. Largest single contribution: removing it causes +29.0% translation error increase.
Penalizes deviation from identity for pose cycles and enforces depth scale agreement across overlapping windows via log-ratio loss.
Self-supervised enforcement of consistent gravity direction and coplanar ground points. Reduces gravity misalignment by ~60% without ground-truth gravity.
| Noise | No Optimization | Seq. Only | With Loops | |||
|---|---|---|---|---|---|---|
| Trans(m) | Rot(deg) | Trans | Rot | Trans | Rot | |
| 0.01 | 0.40 | 1.41 | 0.40 | 1.41 | 0.34 | 1.18 |
| 0.05 | 1.09 | 6.80 | 1.09 | 6.80 | 0.94 | 6.27 |
| 0.10 | 1.52 | 17.68 | 1.52 | 17.68 | 1.32 | 14.02 |
| 0.20 | 4.19 | 47.38 | 4.19 | 47.38 | 3.34 | 40.64 |
| 0.30 | 4.62 | 37.67 | 4.62 | 37.67 | 4.21 | 30.63 |
| Config | Trans(m) | Rot(deg) | Scale CV | Grav |
|---|---|---|---|---|
| Full (Ours) | 0.359 | 3.46 | 7.8e-5 | 0.972 |
| No Epipolar | 0.463 | 3.93 | 10.0e-5 | 0.971 |
| No Composition | 0.411 | 3.70 | 8.9e-5 | 0.971 |
| No Gravity | 0.379 | 3.56 | 8.3e-5 | 0.971 |
| No Scale | 0.411 | 3.70 | 8.9e-5 | 0.971 |
| No Ground | 0.369 | 3.51 | 8.1e-5 | 0.972 |
| Baseline | 0.598 | 4.46 | 12.6e-5 | 0.970 |