Reliability of Prompt-Induced Long CoT Structures in Instruction-Tuned LLMs

How reliably can molecular-like reasoning structures be induced through prompting alone vs. distillation?

cs.CLLong CoT4 Strategies3 Difficulty Levels
12.9%
Reliability Gap (Hard Problems)
0.671
Best Prompt Score (Hard)
0.770
Distillation Score (Hard)

Composite Fidelity Scores

StrategyEasyMediumHard
Basic0.4030.4590.513
Structured0.4600.5650.586
Molecular0.5080.5490.671
Distilled0.6340.6100.770
A persistent reliability gap exists: the best prompt-based strategy (Molecular) achieves only 87.1% of distillation quality on hard problems. Transition fidelity is the primary bottleneck, with prompts struggling to reproduce fine-grained behavior transitions despite approximating global topology.

Composite Scores by Strategy and Difficulty

Strategy Ranking by Difficulty

Transition Fidelity (Hard)

Topological Similarity (Hard)

Bond Distribution Divergence (Hard, lower is better)

Reliability Gap: Molecular vs Distilled

The reliability gap between the best prompt strategy (Molecular) and Distillation is largest for Transition Fidelity (23% deficit) and smallest for Topological Similarity (6.3% deficit), revealing that prompts convey global structural intent but fail at fine-grained transition control.

Reasoning Behavior Types (Atoms)

I - InitializationProblem setup and restating
D - DeductionLogical inference steps
B - BacktrackingRevising previous reasoning
E - ExplorationConsidering alternative approaches
V - VerificationChecking intermediate results

Key Findings

  • Global vs Local: Prompts approximate global topology (TS up to 0.899) but miss fine-grained transitions (TF only 0.464 for Molecular on hard).
  • Difficulty scaling: Molecular prompting narrows the relative gap as difficulty increases, becoming proportionally more valuable for complex reasoning.
  • Bond distributions: Prompts induce approximately correct proportions of reasoning behaviors even when specific transitions are missed.
  • Practical impact: When generating synthetic Long CoT data via prompting, approximately 20-30% of expected transitions may be missing.