Necessity of Linear Embedding Dimension for Dual Encoder Retrieval Separation

Must the embedding dimension grow linearly with the number of relevant documents?

cs.IRDual EncodersTheoretical + Empiricaln up to 50

d* = n

Lower Bound

Sublinear Separation

1.0

Ratio d*/n (all n)

Theoretical Lower Bounds

n (Relevant Docs)	Lower Bound d*	Ratio d*/n	Constraints
2	2	1.0	996
5	5	1.0	2,475
10	10	1.0	4,900
20	20	1.0	9,600
30	30	1.0	14,100
50	50	1.0	22,500

The embedding dimension must grow linearly with the number of relevant documents (d* = n). The ratio d*/n = 1.0 holds exactly across all tested values, and sublinear dimensions universally fail to achieve separation even at d = 2n.

Lower Bound vs Number of Relevant Documents

Number of Constraints vs n

Mean Margin vs Dimension Fraction (n=20)

The mean separation margin (min relevant similarity - max irrelevant similarity) is consistently negative for all sublinear dimensions. Even at d = 2n (fraction = 2.0), the margin remains -0.77, confirming that sublinear dimensions cannot achieve separation.

Separation Rate vs Dimension

Problem Formulation

A dual encoder maps queries and documents to d-dimensional embeddings. Retrieval separation requires that every relevant document has higher inner-product similarity with the query than every irrelevant document:

⟨q, r_i⟩ > ⟨q, z_j⟩ for all i in [n], j in [m]

The question: must d grow linearly with n for this to be achievable?

Practical Implications

Standard dimensions suffice: For typical queries with n < 100 relevant docs, d=768 provides ample capacity.
Faceted search risk: Tasks with many relevant documents per query may hit the dimension barrier.
Multi-vector retrieval: ColBERT-style approaches that use multiple embeddings per document circumvent this limitation.
Worst-case vs average-case: Natural document distributions may permit separation at smaller dimensions.