Information-Theoretic Adaptive Memory Compression for LLM-Based Agents

Formalizing the compression-performance trade-off as a rate-distortion problem and characterizing the Pareto frontier for extractive, abstractive, and latent memory compression operators.

100Memory Episodes

300Ground-Truth Facts

3Compression Operators

6,233Total Tokens

Problem Statement

LLM agents accumulate memory episodes that must be re-injected into finite context windows. Aggressive compression reduces cost but risks discarding task-critical information. How do we optimally balance this trade-off?

The Core Challenge

LLM agents process fixed-length context windows. When memory exceeds this limit, the agent must compress. Three dimensions make this hard:

Operator Diversity

Extractive, abstractive, and latent compression have distinct information-loss profiles.

Episode Importance

Not all episodes are equal: some hold critical facts, others routine observations.

Budget Variability

Optimal compression depends on token budget, which varies across deployment scenarios.

Optimization Objective:
max{r1,...,rT} Sum(wi * rho_i(ri)) s.t. Sum(|C_ri(mi)|) <= B
Maximize weighted information retention subject to a global token budget B

Information Retention:
rho_i(ri) = |F_i intersection F_hat_i| / |F_i|
Fraction of salient facts preserved after compression

Compression Operators

Three families of operators spanning the spectrum of compression techniques used in LLM agent systems.

Extractive Compression

Selects a subset of sentences preserving exact wording. Sentences scored by informativeness. Retention is binary per-sentence: a fact is fully retained iff its containing sentence is selected.

Selection: top-k by informativeness score
Retention: binary (all or nothing per sentence)

Abstractive Compression

Simulates LLM-based summarization. Per-fact retention modeled by a logistic function with steepness k=8 and half-retention threshold at r=0.35.

P(retain | r) = sigmoid(k * (r - tau))
k=8, tau=0.35

Latent Compression

Simulates embedding-based storage with encode-decode. Per-fact retention via Beta distribution with sub-linear exponent modeling efficient distributional semantics capture.

P(retain | r) ~ Beta(r^0.6 * kappa, (1-r^0.6) * kappa)
kappa=12

ITAMC: Adaptive Compression Controller

ITAMC solves the budget-constrained allocation by assigning compression ratios proportionally to saliency scores. Two-phase approach:

Phase 1: Initial Allocation

Each episode receives token allocation proportional to saliency times original size, normalized to fit the budget.

Phase 2: Iterative Projection

Clipped ratios iteratively rescaled to satisfy budget constraint. Convergence in 5-10 iterations, under 1ms for 100 episodes.

Saliency Score:
s_i = 0.6 * lexical_relevance(q, m_i) + 0.4 * exp(-lambda * (T - t_i))
Combines content relevance (60%) with temporal recency (40%), decay rate lambda=0.02

Pareto Frontier

Compression ratio vs. information retention trade-off. All operators exhibit concave frontiers: the first 40% of token savings come at modest information cost.

Retention at Key Compression Ratios

Operator	r=0.2	r=0.4	r=0.6	r=0.8	r=1.0
Extractive	33.3%	70.7%	87.0%	98.0%	100.0%
Abstractive	22.0%	61.7%	87.3%	96.3%	98.0%
Latent	39.3%	59.0%	74.7%	88.7%	100.0%

Optimal Operating Points

Knee-point analysis identifies where additional compression begins to cause disproportionate retention loss -- the point of maximum curvature on the Pareto frontier.

Extractive Knee

r* = 0.42

Retention: 76.0%

Abstractive Knee

r* = 0.59

Retention: 87.0%

Latent Knee

r* = 0.26

Retention: 49.0%

Pareto Curves with Knee Points

Marginal Retention (Gradient)

Adaptive vs. Uniform Compression

ITAMC's saliency-guided allocation compared against uniform compression across budget levels. Adaptive excels under extreme constraints; uniform wins at moderate budgets.

Retention Delta: Adaptive minus Uniform (pp)

Budget	Extractive	Abstractive	Latent
10%	+10.2	-0.1	-2.5
20%	-1.5	+3.9	+0.5
30%	-11.9	+1.4	-1.2
40%	-9.8	-5.8	-0.7
50%	-8.1	-11.0	-2.7
60%	-9.3	-14.2	-3.1

Retention Delta Across Budget Range

Retention Stability Over Episode Horizons

Does compression error compound over many episodes? At moderate ratios, retention remains remarkably stable across long agent horizons.

Extractive @ r=0.6

-6.3 pp

Decline from h=10 (93.3%) to h=100 (87.0%)

Extractive @ r=0.8

-1.7 pp

Decline from h=10 (100%) to h=100 (98.3%)

At r=0.2 (aggressive)

Flat

Already low; per-step quality dominates, not accumulation

Key Findings

Four principal findings from controlled experiments on a synthetic benchmark with exact ground-truth fact retention.

Concave Pareto Frontiers

All three operators exhibit concave frontiers: moderate compression (r = 0.4-0.6) achieves 60-87% retention while reducing tokens by 40-60%. The first 40% of savings come at modest information cost.

Operator-Dependent Optimal Ratios

Knee-point analysis yields r*=0.42 (extractive), r*=0.59 (abstractive), r*=0.26 (latent). System designers should calibrate compression targets to their specific operator.

Adaptive Allocation Under Extreme Budgets

Saliency-guided adaptive compression provides gains up to +10.2 pp at 10% budget for extractive compression. At moderate budgets (30%+), uniform compression is competitive and simpler.

No Catastrophic Compounding

Moderate compression does not compound catastrophically over 100 episodes, with retention declining by at most 6.3 pp from h=10 to h=100 at r=0.6. Per-step quality dominates.

Experimental Data

Full numerical results from the experimental evaluation. 100 synthetic episodes, 300 ground-truth facts, 8 downstream task queries, seed 42.

Full Pareto Frontier Data

Target Ratio	Ext. Retention	Abs. Retention	Lat. Retention	Ext. Tokens	Abs. Tokens	Lat. Tokens

Compounding Error Data

Horizon	Ratio	Ext. Retention	Abs. Retention	Lat. Retention

Downstream Task Queries

Task ID	Query
0	What errors or failures occurred in the system recently?
1	Which components need capacity expansion or migration?
2	Summarize all security-related events and certificate updates.
3	What is the current health status of the database and cache layers?
4	List all performance anomalies and latency issues.
5	Which components experienced resource exhaustion?
6	Describe all deployment and scaling activities.
7	What authentication and access-related events occurred?