A systematic comparison of LLM-based time series forecasting approaches against established baselines on five intermittent demand datasets with matched distribution heads
Addressing the open question from Damato et al. (2026): Do LLM-based forecasting models improve upon established neural architectures for intermittent demand?
D-Linear with a negative binomial head remains the best approach for intermittent demand forecasting.
D-Linear (NB) achieves the lowest QL50 with only 0.12M parameters and 45.2s training time
Lag-Llama (NB) is the best LLM at QL50=0.2170, requiring 48M params and 943.6s -- 7.00% worse
Replacing the LLM backbone with a single attention or linear layer degrades QL50 by only 2-4%
Lag-Llama requires 20.9x more training time than D-Linear for 7% worse accuracy
Zero-shot Chronos achieves QL50=0.2831, worse than even FNN and DeepAR baselines
Negative Binomial achieves avg QL50=0.2471 vs HSNB 0.2491 and Tweedie 0.2515 across all models
Average QL50 across five intermittent demand datasets (lower is better). Blue = baseline, red = LLM-based.
Best baseline vs best LLM model on each dataset. The LLM deficit grows with higher zero rates.
Comparing NB, HSNB, and Tweedie distribution heads across all models. NB achieves best accuracy, HSNB best calibration.
Following Tan et al. (NeurIPS 2024): replacing the LLM backbone with simpler alternatives has minimal impact, confirming the distribution head drives performance.
Training time vs QL50. Bubble size proportional to parameter count. D-Linear (NB) occupies the Pareto-optimal position.
Complete performance metrics for all 19 model configurations averaged across 5 datasets.
| Rank | Model | Type | QL50 | Std | QL90 | QL99 | CRPS | Cal. Err. | Time (s) | Params (M) |
|---|
| Dataset | Zero Rate | Best Baseline | QL50 | Best LLM | QL50 | Gap (%) |
|---|