Disentangling Persistent Bias in Neural Actor-Critic:
A Factorial Analysis of Online Coupling vs. Markovian Sampling

Investigating whether persistent bias in actor-critic algorithms originates from online Markovian data, coupled actor-critic dynamics, or their interaction, using a controlled 2x2 factorial design.

0.2071
Interaction Effect
29.7%
Decay Rate Reduction
2 x 2
Factorial Design
4
Experimental Regimes

Problem Statement

Neural actor-critic algorithms exhibit a distinct persistent bias component compared to standard neural network regression. This paper investigates the origin of that persistent bias.

The Conjecture

Georgoudios et al. (2026) observe that neural actor-critic algorithms trained via SGD under polynomial step-size schedules exhibit a more persistent bias than neural network regression. They conjecture this arises from:

  1. Online (Markovian) sampling: Non-i.i.d. data from trajectory rollouts introduces correlated noise.
  2. Actor-critic coupling: The coupled dynamics between actor and critic networks create a moving target for the critic.
  3. Their interaction: The combination may produce emergent persistence mechanisms.

Experimental Parameters

Step-size exponent
β = 0.7
Initial step-size
α₀ = 0.5
SGD Steps
N = 3000
Hidden Units
H = 16
Discount Factor
γ = 0.9
Random Seeds
5
State Space
s ∈ [-1, 1]
Actions
{0, 1}

Methods

A 2x2 factorial design that independently varies data distribution (i.i.d. vs. Markovian) and network coupling (single vs. actor-critic pair).

Key Equations

Step-Size Schedule
αn = α0 / nβ,   β ∈ (1/2, 1)
Analytical Bias Dynamics (Equation 1)
Bn+1 = (1 − 2Aαn) Bn + c · αn2 + m · σ2 αn / n
where A = 0.5 (contraction rate), c = coupling strength, m = Markovian mixing factor, σ2 = 0.1
MDP Transitions
s' = clip(γenv · s + aeff(a) + ε, −1, 1),   ε ~ N(0, 0.01)
Decay Rate Estimation
log Bn ~ −ρ log n + const   (fitted in tail of trajectory)

Four Experimental Regimes

R1 — Baseline

i.i.d. + Single Network

Supervised regression baseline. A single critic network trained on i.i.d. state samples with exact targets. No online or coupling effects.

R2 — Online Only

Markov + Single Network

TD learning with a fixed policy. A single critic learns from Markovian trajectory data, isolating the effect of non-stationary data.

R3 — Coupling Only

i.i.d. + Coupled Networks

Actor-critic with oracle sampling. Both networks updated, but states sampled approximately i.i.d. from the current policy's stationary distribution.

R4 — Full AC

Markov + Coupled Networks

Full online actor-critic. Both sources of persistent bias are present. The primary setting of interest.

Factorial Simulation Results

Squared bias trajectories and power-law decay exponents across all four regimes, averaged over 5 random seeds.

Bias Trajectories (Simulation)

Decay Rate Comparison

Factorial Summary Table

Regime Data Distribution Network Coupling Decay Rate (ρ) Tail Bias Persistence Gap
R1 (i.i.d. + single) i.i.d. Decoupled -0.0344 0.3213
R2 (Markov + single) Markovian Decoupled 1.3338 0.2703 +1.3682
R3 (i.i.d. + coupled) i.i.d. Coupled 1.3776 0.2107 +1.4120
R4 (Markov + coupled) Markovian Coupled 1.2180 0.3668 +1.2524

Analytical Model

A simplified stochastic approximation model provides a cleaner separation of the two mechanisms.

Analytical Bias Trajectories

Analytical Decay Rates

Analytical Model Parameters and Results

Analytical Regime Coupling (c) Mixing (m) Decay Rate Rate Reduction vs. Baseline
A1 (baseline) 0 1 1.1297
A2 (online only) 0 3 1.1210 -0.0087 (0.8%)
A3 (coupled only) 0.3 1 0.7944 -0.3353 (29.7%)
A4 (full AC) 0.3 3 0.8354 -0.2943 (26.1%)

Beta Sweep Analysis

Robustness of findings across the full range of step-size exponent β ∈ (1/2, 1), sweeping nine values.

Decay Rates vs. β

Tail Bias: R1 vs. R4

Beta Sweep Data

β Rate R1 Rate R2 Rate R3 Rate R4 Tail R1 Tail R4
0.550.74793.19211.5566-1.17970.25680.0290
0.605.40244.47512.9465-0.19560.00130.0298
0.650.28902.77182.54212.59940.30280.0292
0.700.26341.38681.39441.56720.24150.1423
0.750.37670.66480.68640.75530.05480.4808
0.800.02140.31850.32780.35430.06370.9166
0.85-0.09310.15880.16330.16990.25001.2728
0.90-0.06600.08390.08350.07910.25091.5186
0.95-0.10270.04650.04390.03630.24361.6739

Factorial Decomposition

ANOVA-style decomposition of tail bias reveals a dominant interaction effect that exceeds both marginal effects.

Tail Bias by Regime

Factorial Decomposition

Decomposition Summary

Baseline (R1) 0.3213
0.3213
Online Marginal Effect (R2 - R1) -0.0510
-0.051
Coupling Marginal Effect (R3 - R1) -0.1106
-0.111
Interaction Effect +0.2071
+0.207
Total Excess (R4 - R1) +0.0455
+0.046

Variance Decomposition

Bias-variance trade-off comparison between regression baseline (R1) and full actor-critic (R4).

R1 (Regression Baseline): Bias vs. Variance

R4 (Full Actor-Critic): Bias vs. Variance

Final Training Step Statistics

Metric R1 (Baseline) R4 (Full AC) Ratio (R1/R4)
Mean Bias (last 100 steps) 0.5173 0.1510 3.43x
Cross-Seed Variance 0.0968 0.0181 5.35x
Bias / Variance Ratio 5.34 8.34

Key Findings

Summary of evidence regarding the origin of persistent bias in neural actor-critic algorithms.

1

Coupling is the Structural Mechanism

Actor-critic coupling reduces the analytical decay rate from 1.1297 to 0.7944 (a 29.7% reduction), creating a persistent bias floor through the perpetual drift of the critic's target as actor parameters update.

2

Markovian Sampling is an Amplifier

Online (Markovian) sampling has a comparatively smaller structural effect, reducing the analytical rate from 1.1297 to 1.1210 (only 0.8% reduction). It primarily amplifies the bias constant rather than changing the decay structure.

3

Dominant Interaction Effect

The factorial decomposition reveals an interaction effect of 0.2071 that exceeds both marginal effects in magnitude (online: -0.0510, coupling: -0.1106). The combination of non-stationary data and coupled dynamics produces emergent persistence not captured by either factor alone.

4

Full Actor-Critic Shows Highest Tail Bias

The full actor-critic (R4) achieves a tail bias of 0.3668, compared to the baseline of 0.3213. This confirms that both sources contribute, with their interaction as the primary driver of persistent bias in the combined setting.

5

Robustness Across β

The findings are robust across the full range of step-size exponents β in (1/2, 1). The persistence gap varies with β, with coupling and online effects showing similar magnitudes at most values.