Formalizing the compression-performance trade-off as a rate-distortion problem and characterizing the Pareto frontier for extractive, abstractive, and latent memory compression operators.
LLM agents accumulate memory episodes that must be re-injected into finite context windows. Aggressive compression reduces cost but risks discarding task-critical information. How do we optimally balance this trade-off?
LLM agents process fixed-length context windows. When memory exceeds this limit, the agent must compress. Three dimensions make this hard:
Extractive, abstractive, and latent compression have distinct information-loss profiles.
Not all episodes are equal: some hold critical facts, others routine observations.
Optimal compression depends on token budget, which varies across deployment scenarios.
Three families of operators spanning the spectrum of compression techniques used in LLM agent systems.
Selects a subset of sentences preserving exact wording. Sentences scored by informativeness. Retention is binary per-sentence: a fact is fully retained iff its containing sentence is selected.
Simulates LLM-based summarization. Per-fact retention modeled by a logistic function with steepness k=8 and half-retention threshold at r=0.35.
Simulates embedding-based storage with encode-decode. Per-fact retention via Beta distribution with sub-linear exponent modeling efficient distributional semantics capture.
ITAMC solves the budget-constrained allocation by assigning compression ratios proportionally to saliency scores. Two-phase approach:
Each episode receives token allocation proportional to saliency times original size, normalized to fit the budget.
Clipped ratios iteratively rescaled to satisfy budget constraint. Convergence in 5-10 iterations, under 1ms for 100 episodes.
Compression ratio vs. information retention trade-off. All operators exhibit concave frontiers: the first 40% of token savings come at modest information cost.
| Operator | r=0.2 | r=0.4 | r=0.6 | r=0.8 | r=1.0 |
|---|---|---|---|---|---|
| Extractive | 33.3% | 70.7% | 87.0% | 98.0% | 100.0% |
| Abstractive | 22.0% | 61.7% | 87.3% | 96.3% | 98.0% |
| Latent | 39.3% | 59.0% | 74.7% | 88.7% | 100.0% |
Knee-point analysis identifies where additional compression begins to cause disproportionate retention loss -- the point of maximum curvature on the Pareto frontier.
Retention: 76.0%
Retention: 87.0%
Retention: 49.0%
ITAMC's saliency-guided allocation compared against uniform compression across budget levels. Adaptive excels under extreme constraints; uniform wins at moderate budgets.
| Budget | Extractive | Abstractive | Latent |
|---|---|---|---|
| 10% | +10.2 | -0.1 | -2.5 |
| 20% | -1.5 | +3.9 | +0.5 |
| 30% | -11.9 | +1.4 | -1.2 |
| 40% | -9.8 | -5.8 | -0.7 |
| 50% | -8.1 | -11.0 | -2.7 |
| 60% | -9.3 | -14.2 | -3.1 |
Does compression error compound over many episodes? At moderate ratios, retention remains remarkably stable across long agent horizons.
Decline from h=10 (93.3%) to h=100 (87.0%)
Decline from h=10 (100%) to h=100 (98.3%)
Already low; per-step quality dominates, not accumulation
At a fixed compression ratio (r=0.4), retention is largely independent of saliency level -- validating that saliency should determine allocation, not predict compressibility.
Select a compression ratio and see the predicted retention for each operator in real time.
Four principal findings from controlled experiments on a synthetic benchmark with exact ground-truth fact retention.
All three operators exhibit concave frontiers: moderate compression (r = 0.4-0.6) achieves 60-87% retention while reducing tokens by 40-60%. The first 40% of savings come at modest information cost.
Knee-point analysis yields r*=0.42 (extractive), r*=0.59 (abstractive), r*=0.26 (latent). System designers should calibrate compression targets to their specific operator.
Saliency-guided adaptive compression provides gains up to +10.2 pp at 10% budget for extractive compression. At moderate budgets (30%+), uniform compression is competitive and simpler.
Moderate compression does not compound catastrophically over 100 episodes, with retention declining by at most 6.3 pp from h=10 to h=100 at r=0.6. Per-step quality dominates.
Full numerical results from the experimental evaluation. 100 synthetic episodes, 300 ground-truth facts, 8 downstream task queries, seed 42.
| Target Ratio | Ext. Retention | Abs. Retention | Lat. Retention | Ext. Tokens | Abs. Tokens | Lat. Tokens |
|---|
| Horizon | Ratio | Ext. Retention | Abs. Retention | Lat. Retention |
|---|
| Task ID | Query |
|---|---|
| 0 | What errors or failures occurred in the system recently? |
| 1 | Which components need capacity expansion or migration? |
| 2 | Summarize all security-related events and certificate updates. |
| 3 | What is the current health status of the database and cache layers? |
| 4 | List all performance anomalies and latency issues. |
| 5 | Which components experienced resource exhaustion? |
| 6 | Describe all deployment and scaling activities. |
| 7 | What authentication and access-related events occurred? |