A factorial framework decomposing perspective tolerance in vision architectures into innate (architectural) and learned (training) components across 288 experimental conditions.
Modern vision architectures vary dramatically in their ability to recognize objects under perspective distortions, yet the source of this tolerance remains poorly understood.
Two broad sources of perspective tolerance exist: innate tolerance arising from architectural design choices (e.g., convolutional weight sharing, pooling hierarchies) and learned tolerance acquired through training on perspective-diverse data. Prior studies typically conflate these two sources, making it difficult to attribute observed robustness. This work disentangles these contributions through a controlled factorial experimental design.
Factorial design with two training regimes (diverse vs. restricted) across calibrated perspective distortions.
Rotation around horizontal axis up to 60 degrees, simulating looking up/down at objects.
Rotation around vertical axis up to 60 degrees, simulating lateral viewpoint change.
Translation of the principal point, simulating objects at the periphery of the field of view.
Composition of all three primitives, representing worst-case compound distortion.
| Architecture | Family | Spatial Inv. Score (σ) | Depth | Params (M) |
|---|---|---|---|---|
| ResNet-50 | Convolutional | 0.72 | 50 | 25.6 |
| ConvNeXt-T | Convolutional | 0.68 | 28 | 28.6 |
| ViT-B/16 | Attention | 0.45 | 12 | 86.6 |
| DeiT-S | Attention | 0.48 | 12 | 22.1 |
| Swin-T | Attention | 0.61 | 24 | 28.3 |
| MLP-Mixer-B | MLP | 0.35 | 12 | 59.9 |
How accuracy drops with increasing distortion severity. Solid = diverse training, dashed = restricted training.
Stacked innate and learned components per architecture, averaged across all severity levels and distortion types.
A near-perfect negative correlation (r = -0.9937) reveals a fundamental capacity-data tradeoff.
Off-axis distortions are best tolerated innately. Combined distortions yield the highest learned fraction.
Learned fraction at high severity (s >= 0.6) across all architecture-distortion combinations. Darker shading indicates greater dependence on diverse training.
Complete numerical results from all experimental conditions.
| Architecture | Family | σ | τinnate | τlearned | φ (Learned Frac.) |
|---|---|---|---|---|---|
| ResNet-50 | Conv | 0.72 | 0.6432 ± 0.1307 | 0.1197 ± 0.0357 | 16.46% |
| ConvNeXt-T | Conv | 0.68 | 0.6322 ± 0.1300 | 0.1208 ± 0.0313 | 16.80% |
| ViT-B/16 | Attention | 0.45 | 0.5658 ± 0.1431 | 0.1377 ± 0.0432 | 20.78% |
| DeiT-S | Attention | 0.48 | 0.5759 ± 0.1394 | 0.1335 ± 0.0337 | 19.90% |
| Swin-T | Attention | 0.61 | 0.6171 ± 0.1422 | 0.1310 ± 0.0421 | 18.50% |
| MLP-Mixer-B | MLP | 0.35 | 0.5405 ± 0.1383 | 0.1475 ± 0.0417 | 22.59% |
| Distortion | τinnate | τlearned | φ (Learned Frac.) |
|---|---|---|---|
| Tilt | 0.6148 ± 0.1344 | 0.1329 ± 0.0366 | 18.67% |
| Pan | 0.5823 ± 0.1367 | 0.1324 ± 0.0389 | 19.57% |
| Off-axis | 0.6509 ± 0.1299 | 0.1299 ± 0.0428 | 17.40% |
| Combined | 0.5352 ± 0.1422 | 0.1317 ± 0.0390 | 21.04% |
| Architecture | Base Accuracy (a0) |
|---|---|
| ResNet-50 | 0.7640 |
| ConvNeXt-T | 0.8173 |
| ViT-B/16 | 0.8101 |
| DeiT-S | 0.7980 |
| Swin-T | 0.8211 |
| MLP-Mixer-B | 0.7480 |
| Architecture | Severity | Acc. (Diverse) | Acc. (Restricted) | τinnate | τlearned | φ |
|---|