Simulation-based evaluation of LLM planning under physical constraints
cs.AI | Jan 2026This study evaluates the reliability of large language models when used as autonomous planning agents in domains governed by physical laws and hard constraints. Four distinct prompting strategies are benchmarked across six physics-governed planning domains, varying planning horizon length and constraint tightness to quantify how well current LLMs respect physical feasibility.
Strategies
Domains
Mean success rate across all domains and horizons (with standard deviation)
Success rate as planning horizon increases from 3 to 30 steps
Higher tightness = stricter physical constraints
Mean number of constraint violations per plan
Direct Prompt vs. Physics-Augmented across six planning domains
Average number of physical constraint violations across all scenarios