Necessity of Linear Embedding Dimension for Dual Encoder Retrieval Separation

Must the embedding dimension grow linearly with the number of relevant documents?

cs.IRDual EncodersTheoretical + Empiricaln up to 50
d* = n
Lower Bound
0%
Sublinear Separation
1.0
Ratio d*/n (all n)

Theoretical Lower Bounds

n (Relevant Docs)Lower Bound d*Ratio d*/nConstraints
221.0996
551.02,475
10101.04,900
20201.09,600
30301.014,100
50501.022,500
The embedding dimension must grow linearly with the number of relevant documents (d* = n). The ratio d*/n = 1.0 holds exactly across all tested values, and sublinear dimensions universally fail to achieve separation even at d = 2n.

Lower Bound vs Number of Relevant Documents

Number of Constraints vs n

Mean Margin vs Dimension Fraction (n=20)

The mean separation margin (min relevant similarity - max irrelevant similarity) is consistently negative for all sublinear dimensions. Even at d = 2n (fraction = 2.0), the margin remains -0.77, confirming that sublinear dimensions cannot achieve separation.

Separation Rate vs Dimension

Problem Formulation

A dual encoder maps queries and documents to d-dimensional embeddings. Retrieval separation requires that every relevant document has higher inner-product similarity with the query than every irrelevant document:

⟨q, ri⟩ > ⟨q, zj⟩ for all i in [n], j in [m]

The question: must d grow linearly with n for this to be achievable?

Practical Implications

  • Standard dimensions suffice: For typical queries with n < 100 relevant docs, d=768 provides ample capacity.
  • Faceted search risk: Tasks with many relevant documents per query may hit the dimension barrier.
  • Multi-vector retrieval: ColBERT-style approaches that use multiple embeddings per document circumvent this limitation.
  • Worst-case vs average-case: Natural document distributions may permit separation at smaller dimensions.