Determining whether a robotic manipulation demonstration was generated by an autonomous policy or by hidden human teleoperation
When a robotic manipulation trajectory appears successful, existing benchmarks provide no mechanism to verify whether the behavior was generated by an autonomous policy or via hidden human teleoperation. This source authenticity ambiguity undermines trustworthy evaluation and enables result fabrication.
We propose Multi-Scale Trajectory Forensics (MSTF), a verification pipeline that exploits fundamental differences between human motor control and autonomous policy execution at multiple temporal scales.
Human neuromuscular control leaves multi-scale statistical fingerprints that are jointly difficult to forge:
Autonomous policies (diffusion, transformer) lack these biomechanical signatures and exhibit distinct spectral/temporal patterns.
Exploits the bandwidth limits of human neuromuscular control. We compute the power spectral density of velocity signals and extract diagnostic band ratios:
| Band | Range | Interpretation |
|---|---|---|
| Submovement | 0.5 - 4.0 Hz | Correction frequency for biological movements |
| Voluntary | 4.0 - 8.0 Hz | Upper limit of voluntary motor control |
| Tremor | 8.0 - 12.0 Hz | Physiological tremor (diagnostic for human origin) |
| High-frequency | 12.0 - 50.0 Hz | Above human voluntary bandwidth (policy artifacts) |
Key feature: The presence of 8-12 Hz tremor is a strong indicator of human teleoperation; its absence suggests autonomous execution.
Human reaching movements decompose into overlapping minimum-jerk submovements (Flash & Hogan, 1985):
We fit up to 8 submovements using a greedy iterative algorithm and evaluate:
Adapts text watermarking (Kirchenbauer et al., 2023) to continuous action spaces. During inference, the policy biases action sampling so that:
Verification uses a one-sided binomial test against the null rate K/M. Watermark is detected when observed rate significantly exceeds the null expectation (p < 0.01).
Composite scores combine module outputs with reliability-weighted fusion:
Classification uses margin-based decision with consensus relaxation: when 2+ modules agree, the decision threshold is lowered for higher sensitivity.
| Class | Precision | Recall | F1 Score | Inconclusive |
|---|---|---|---|---|
| Autonomous | 1.000 | 0.720 | 0.837 | 2 |
| Teleoperated | 0.806 | 1.000 | 0.893 | 0 |
| Overall | Accuracy = 86.0% | |||
Conservative bias: the pipeline never falsely labels a human trajectory as autonomous.
| Condition | Detection Rate | Distortion |
|---|---|---|
| Correct key, watermarked | 50.0% | 0.114 |
| Wrong key, watermarked | 0.0% | |
| Correct key, unwatermarked | 0.0% | |
| Correct key, human traj. | 0.0% |
Zero false positives across all negative conditions. The 50% true positive rate indicates room for improvement in embedding strength.
Classification accuracy improves monotonically from 75% at 1.0s to 100% at 10.0s. Longer trajectories provide more statistical evidence for both spectral and submovement analysis.
Adjust the parameters below to simulate a trajectory and see how the verification pipeline classifies it. These controls model the key signal characteristics that distinguish human from autonomous control.