Investigating whether persistent bias in actor-critic algorithms originates from online Markovian data, coupled actor-critic dynamics, or their interaction, using a controlled 2x2 factorial design.
Neural actor-critic algorithms exhibit a distinct persistent bias component compared to standard neural network regression. This paper investigates the origin of that persistent bias.
Georgoudios et al. (2026) observe that neural actor-critic algorithms trained via SGD under polynomial step-size schedules exhibit a more persistent bias than neural network regression. They conjecture this arises from:
A 2x2 factorial design that independently varies data distribution (i.i.d. vs. Markovian) and network coupling (single vs. actor-critic pair).
Supervised regression baseline. A single critic network trained on i.i.d. state samples with exact targets. No online or coupling effects.
TD learning with a fixed policy. A single critic learns from Markovian trajectory data, isolating the effect of non-stationary data.
Actor-critic with oracle sampling. Both networks updated, but states sampled approximately i.i.d. from the current policy's stationary distribution.
Full online actor-critic. Both sources of persistent bias are present. The primary setting of interest.
Squared bias trajectories and power-law decay exponents across all four regimes, averaged over 5 random seeds.
| Regime | Data Distribution | Network Coupling | Decay Rate (ρ) | Tail Bias | Persistence Gap |
|---|---|---|---|---|---|
| R1 (i.i.d. + single) | i.i.d. | Decoupled | -0.0344 | 0.3213 | — |
| R2 (Markov + single) | Markovian | Decoupled | 1.3338 | 0.2703 | +1.3682 |
| R3 (i.i.d. + coupled) | i.i.d. | Coupled | 1.3776 | 0.2107 | +1.4120 |
| R4 (Markov + coupled) | Markovian | Coupled | 1.2180 | 0.3668 | +1.2524 |
A simplified stochastic approximation model provides a cleaner separation of the two mechanisms.
| Analytical Regime | Coupling (c) | Mixing (m) | Decay Rate | Rate Reduction vs. Baseline |
|---|---|---|---|---|
| A1 (baseline) | 0 | 1 | 1.1297 | — |
| A2 (online only) | 0 | 3 | 1.1210 | -0.0087 (0.8%) |
| A3 (coupled only) | 0.3 | 1 | 0.7944 | -0.3353 (29.7%) |
| A4 (full AC) | 0.3 | 3 | 0.8354 | -0.2943 (26.1%) |
Robustness of findings across the full range of step-size exponent β ∈ (1/2, 1), sweeping nine values.
| β | Rate R1 | Rate R2 | Rate R3 | Rate R4 | Tail R1 | Tail R4 |
|---|---|---|---|---|---|---|
| 0.55 | 0.7479 | 3.1921 | 1.5566 | -1.1797 | 0.2568 | 0.0290 |
| 0.60 | 5.4024 | 4.4751 | 2.9465 | -0.1956 | 0.0013 | 0.0298 |
| 0.65 | 0.2890 | 2.7718 | 2.5421 | 2.5994 | 0.3028 | 0.0292 |
| 0.70 | 0.2634 | 1.3868 | 1.3944 | 1.5672 | 0.2415 | 0.1423 |
| 0.75 | 0.3767 | 0.6648 | 0.6864 | 0.7553 | 0.0548 | 0.4808 |
| 0.80 | 0.0214 | 0.3185 | 0.3278 | 0.3543 | 0.0637 | 0.9166 |
| 0.85 | -0.0931 | 0.1588 | 0.1633 | 0.1699 | 0.2500 | 1.2728 |
| 0.90 | -0.0660 | 0.0839 | 0.0835 | 0.0791 | 0.2509 | 1.5186 |
| 0.95 | -0.1027 | 0.0465 | 0.0439 | 0.0363 | 0.2436 | 1.6739 |
ANOVA-style decomposition of tail bias reveals a dominant interaction effect that exceeds both marginal effects.
Bias-variance trade-off comparison between regression baseline (R1) and full actor-critic (R4).
| Metric | R1 (Baseline) | R4 (Full AC) | Ratio (R1/R4) |
|---|---|---|---|
| Mean Bias (last 100 steps) | 0.5173 | 0.1510 | 3.43x |
| Cross-Seed Variance | 0.0968 | 0.0181 | 5.35x |
| Bias / Variance Ratio | 5.34 | 8.34 | — |
Summary of evidence regarding the origin of persistent bias in neural actor-critic algorithms.
Actor-critic coupling reduces the analytical decay rate from 1.1297 to 0.7944 (a 29.7% reduction), creating a persistent bias floor through the perpetual drift of the critic's target as actor parameters update.
Online (Markovian) sampling has a comparatively smaller structural effect, reducing the analytical rate from 1.1297 to 1.1210 (only 0.8% reduction). It primarily amplifies the bias constant rather than changing the decay structure.
The factorial decomposition reveals an interaction effect of 0.2071 that exceeds both marginal effects in magnitude (online: -0.0510, coupling: -0.1106). The combination of non-stationary data and coupled dynamics produces emergent persistence not captured by either factor alone.
The full actor-critic (R4) achieves a tail bias of 0.3668, compared to the baseline of 0.3213. This confirms that both sources contribute, with their interaction as the primary driver of persistent bias in the combined setting.
The findings are robust across the full range of step-size exponents β in (1/2, 1). The persistence gap varies with β, with coupling and online effects showing similar magnitudes at most values.