Multi-Scale Trajectory Forensics for Robotic Demonstration Authenticity

Problem Statement

When a robotic manipulation trajectory appears successful, existing benchmarks provide no mechanism to verify whether the behavior was generated by an autonomous policy or via hidden human teleoperation. This source authenticity ambiguity undermines trustworthy evaluation and enables result fabrication.

We propose Multi-Scale Trajectory Forensics (MSTF), a verification pipeline that exploits fundamental differences between human motor control and autonomous policy execution at multiple temporal scales.

Pipeline Architecture

Trajectory Input
positions, timestamps

→

Spectral
Forensics

Submovement
Decomposition

Watermark
Verification

→

Score Fusion
& Classification

→

Verdict
Auto / Teleop / Inc.

Key Results

86%

Classification Accuracy

1.000

Composite AUC

100%

Precision (Autonomous)

0%

Watermark False Positives

Core Insight

Human neuromuscular control leaves multi-scale statistical fingerprints that are jointly difficult to forge:

Spectral: Bandwidth limited to ~8 Hz with physiological tremor at 8-12 Hz
Temporal: Velocity profiles decompose into minimum-jerk submovements
Smoothness: Low dimensionless jerk from motor planning optimization

Autonomous policies (diffusion, transformer) lack these biomechanical signatures and exhibit distinct spectral/temporal patterns.

Module 1: Spectral Forensics

Exploits the bandwidth limits of human neuromuscular control. We compute the power spectral density of velocity signals and extract diagnostic band ratios:

Band	Range	Interpretation
Submovement	0.5 - 4.0 Hz	Correction frequency for biological movements
Voluntary	4.0 - 8.0 Hz	Upper limit of voluntary motor control
Tremor	8.0 - 12.0 Hz	Physiological tremor (diagnostic for human origin)
High-frequency	12.0 - 50.0 Hz	Above human voluntary bandwidth (policy artifacts)

Key feature: The presence of 8-12 Hz tremor is a strong indicator of human teleoperation; its absence suggests autonomous execution.

Module 2: Submovement Decomposition

Human reaching movements decompose into overlapping minimum-jerk submovements (Flash & Hogan, 1985):

            v(t) = A · 30 · τ² · (1 - τ)²,   τ = (t - t0) / D
        

We fit up to 8 submovements using a greedy iterative algorithm and evaluate:

R²: Reconstruction quality (dominant feature, weight 50%)
Physiological fraction: Submovements with durations in [0.15, 1.0] s
Interval regularity: Inter-onset intervals exceeding 80 ms

Module 3: Cryptographic Watermarking

Adapts text watermarking (Kirchenbauer et al., 2023) to continuous action spaces. During inference, the policy biases action sampling so that:

            SHA-256(quantized_action | nonce | step) mod M < K
        

Verification uses a one-sided binomial test against the null rate K/M. Watermark is detected when observed rate significantly exceeds the null expectation (p < 0.01).

Score Fusion

Composite scores combine module outputs with reliability-weighted fusion:

            Sauto = 0.35 · sspec + 0.40 · ssub + 0.25 · swm
        

Classification uses margin-based decision with consensus relaxation: when 2+ modules agree, the decision threshold is lowered for higher sensitivity.

Classification Performance

Class	Precision	Recall	F1 Score	Inconclusive
Autonomous	1.000	0.720	0.837	2
Teleoperated	0.806	1.000	0.893	0
Overall	Accuracy = 86.0%

Confusion Matrix

Pred: Auto

Pred: Teleop

Pred: Inc.

Actual: Auto

36

12

2

Actual: Teleop

0

50

0

Conservative bias: the pipeline never falsely labels a human trajectory as autonomous.

ROC Curves (AUC by Module)

Composite Score

1.000

Spectral Forensics

0.994

Submovement Decomp.

0.985

Module Ablation Study

Submovement Only

86.0%

Spectral + Sub.

85.0%

Full Pipeline

85.0%

Spectral Only

78.0%

Watermark Only

51.0%

Watermark Verification

Condition	Detection Rate	Distortion
Correct key, watermarked	50.0%	0.114
Wrong key, watermarked	0.0%
Correct key, unwatermarked	0.0%
Correct key, human traj.	0.0%

Zero false positives across all negative conditions. The 50% true positive rate indicates room for improvement in embedding strength.

Duration Sensitivity

Classification accuracy improves monotonically from 75% at 1.0s to 100% at 10.0s. Longer trajectories provide more statistical evidence for both spectral and submovement analysis.

Interactive Trajectory Verifier

Adjust the parameters below to simulate a trajectory and see how the verification pipeline classifies it. These controls model the key signal characteristics that distinguish human from autonomous control.

Trajectory Parameters

Tremor Amplitude: 0.003 Human: 0.001-0.005 | Policy: ~0

High-Freq Power Ratio: 0.05 Human: <0.05 | Policy: 0.1-0.3

Submovement R²: 0.85 Human: >0.7 | Policy: <0.5

Log-Dimensionless Jerk: 12 Human: 5-15 (smooth) | Policy: >20 (jerky)

Watermark Detected: No

Verification Result

Verdict: TELEOPERATED

Spectral Score (Autonomous):

0.20

Submovement Score (Autonomous):

0.15

Watermark Score:

0.00

Composite Score (Autonomous):

0.14

Bandwidth concentration and tremor presence suggest human teleoperation. Submovement decomposition is consistent with biological motor control.

Presets

Primary Reference

Source Paper Liu et al. (2026). Trustworthy Evaluation of Robotic Manipulation: A New Benchmark and AutoEval Methods. arXiv:2601.18723.

Key References

Motor Control Flash, T. & Hogan, N. (1985). The coordination of arm movements: an experimentally confirmed mathematical model. Journal of Neuroscience, 5(7), 1688-1703.

Motor Control Hogan, N. & Sternad, D. (2009). Sensitivity of smoothness measures to movement duration, amplitude, and arrests. Journal of Motor Behavior, 41(6), 529-534.

Motor Control Balasubramanian, S. et al. (2012). On the analysis of movement smoothness. J. NeuroEngineering and Rehabilitation, 9(1), 1-12.

Watermarking Kirchenbauer, J. et al. (2023). A watermark for large language models. ICML, 17061-17084.

Robot Learning Chi, C. et al. (2024). Diffusion Policy: Visuomotor Policy Learning via Action Diffusion. IJRR, 43(2), 159-178.

Robot Learning Brohan, A. et al. (2023). RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control. CoRL.

Robot Learning Mandlekar, A. et al. (2021). What Matters in Learning from Offline Human Demonstrations for Robot Manipulation. CoRL, 1678-1690.

Benchmarks James, S. et al. (2020). RLBench: The Robot Learning Benchmark & Learning Environment. IEEE RA-L, 5(2), 3019-3026.

Multi-Scale Trajectory Forensics for Verifying Source Authenticity of Robotic Demonstrations