Do LLM-based Forecasting Models Improve Probabilistic Prediction of Intermittent Demand?

A systematic comparison of LLM-based time series forecasting approaches against established baselines on five intermittent demand datasets with matched distribution heads

19
Model Configurations
5
Datasets
95
Experiments
7.00%
Best LLM Gap

Problem & Methodology

Addressing the open question from Damato et al. (2026): Do LLM-based forecasting models improve upon established neural architectures for intermittent demand?

Baseline Models

  • D-Linear (0.12-0.14M params)
  • DeepAR (2.5-2.7M params)
  • Transformer (8.2-8.4M params)
  • FNN (1.1M params)

LLM-based Models

  • Chronos -- zero-shot & fine-tuned (710M)
  • Lag-Llama -- NB, HSNB, Tweedie (48M)
  • Time-LLM -- NB, HSNB (7B)
  • Moirai -- zero-shot & fine-tuned (311M)

Distribution Heads

  • Negative Binomial (NB)
  • Hurdle-Shifted NB (HSNB)
  • Tweedie

Datasets

  • M5 (zero rate 0.72, 3,049 series)
  • CarParts (0.68, 2,674 series)
  • RAF (0.81, 5,000 series)
  • Auto (0.65, 3,200 series)
  • OldParts (0.85, 1,442 series)

Key Findings

D-Linear with a negative binomial head remains the best approach for intermittent demand forecasting.

Best Overall Model

0.2028

D-Linear (NB) achieves the lowest QL50 with only 0.12M parameters and 45.2s training time

Best LLM Model

0.2170

Lag-Llama (NB) is the best LLM at QL50=0.2170, requiring 48M params and 943.6s -- 7.00% worse

Ablation Insight

2-4%

Replacing the LLM backbone with a single attention or linear layer degrades QL50 by only 2-4%

Efficiency Gap

20.9x

Lag-Llama requires 20.9x more training time than D-Linear for 7% worse accuracy

Zero-Shot Failure

0.2831

Zero-shot Chronos achieves QL50=0.2831, worse than even FNN and DeepAR baselines

Distribution Head Matters

NB Best

Negative Binomial achieves avg QL50=0.2471 vs HSNB 0.2491 and Tweedie 0.2515 across all models

Overall Model Ranking

Average QL50 across five intermittent demand datasets (lower is better). Blue = baseline, red = LLM-based.

Baseline Models
LLM-based Models

Per-Dataset Performance

Best baseline vs best LLM model on each dataset. The LLM deficit grows with higher zero rates.

Dataset Comparison

LLM Deficit by Dataset

Per-Dataset Breakdown (All Models)

Distribution Head Analysis

Comparing NB, HSNB, and Tweedie distribution heads across all models. NB achieves best accuracy, HSNB best calibration.

QL50 & CRPS by Distribution Head

Calibration Error by Head

Ablation Study

Following Tan et al. (NeurIPS 2024): replacing the LLM backbone with simpler alternatives has minimal impact, confirming the distribution head drives performance.

Cost-Accuracy Tradeoff

Training time vs QL50. Bubble size proportional to parameter count. D-Linear (NB) occupies the Pareto-optimal position.

Full Results

Complete performance metrics for all 19 model configurations averaged across 5 datasets.

Model Summary (Averaged Across Datasets)

RankModelTypeQL50Std QL90QL99CRPSCal. Err. Time (s)Params (M)

Per-Dataset Best Models

DatasetZero RateBest BaselineQL50Best LLMQL50Gap (%)