Reliability of Agentic LLMs in Physics-Governed Planning Domains

Simulation-based evaluation of LLM planning under physical constraints

cs.AI  |  Jan 2026

Problem Statement

This study evaluates the reliability of large language models when used as autonomous planning agents in domains governed by physical laws and hard constraints. Four distinct prompting strategies are benchmarked across six physics-governed planning domains, varying planning horizon length and constraint tightness to quantify how well current LLMs respect physical feasibility.

Strategies

Direct Prompt ReAct-Style CoT Planning Physics-Augmented

Domains

Orbit Transfer Resource Allocation Multi-Agent Scheduling Trajectory Optimization Rendezvous & Docking Constellation Mgmt.
56.79%
Best Success Rate
+52.31%
Physics-Aug. Gain
-0.53%
Per-Step Decay
79.7%
Violation Reduction

Overall Strategy Comparison

Mean success rate across all domains and horizons (with standard deviation)

Horizon Degradation

Success rate as planning horizon increases from 3 to 30 steps

Success Rate vs. Constraint Tightness

Higher tightness = stricter physical constraints

Avg. Violations vs. Constraint Tightness

Mean number of constraint violations per plan

Cross-Domain Comparison

Direct Prompt vs. Physics-Augmented across six planning domains

Mean Constraint Violations per Plan

Average number of physical constraint violations across all scenarios

Key Findings

  • Physics-augmented planning improves reliability by +52.31% over direct prompting, establishing explicit physics integration as the most impactful strategy.
  • Even the best strategy achieves only a 56.79% success rate, indicating that current LLMs remain far from dependable in constrained physical domains.
  • Horizon degradation: success decays at approximately -0.53% per additional planning step, with direct prompting collapsing to near-zero beyond 15 steps.
  • Domain gap: 7.94 percentage points between the best-performing domain (Orbit Transfer) and the worst (Constellation Mgmt.), suggesting domain-specific tuning is necessary.
  • Physics-based constraint checking reduces violations by 79.7% compared to unchecked direct prompting.
  • Current LLMs cannot reliably operate in physics-governed planning domains without external physics-aware scaffolding.