A RAG system has two halves: a retriever that ranks documents and a
generator that answers from them. If the right document never makes the
top-k, no prompt engineering can save the answer. The retrieval-ranking family
scores the retriever in isolation, using the classical metrics of
information retrieval.
The math is domain-agnostic: it consumes a ranked list of opaque ids (and,
for containment, texts) plus the relevant ground truth, and returns a number in
easy-to-get-wrong ranking math as a single tested source of truth.
The input contract
Your SUT emits actualOutput as JSON — an ordered list, rank 1 first:
{"retrieved": [{"id": "doc-7", "text": "..."}, {"id": "doc-3", "text": "..."}]}
A bare array (["doc-7", "doc-3"]) is also accepted. The sample’s
expected_output is the relevant ground truth — either binary relevance (a
list of relevant ids) or graded gains (an id → gain map, used by nDCG):
samples:
- id: q-1
input: { question: "What is our refund window?" }
expected_output: ["doc-3", "doc-9"] # binary relevance
metadata: { k: 5 } # optional per-sample cutoff
- id: q-2
input: { question: "..." }
expected_output: { "doc-3": 3, "doc-9": 1 } # graded gains
How k resolves
The cutoff
The metrics
Let the ranked result list be
rank
top-
hit@k
Did any relevant document make the top-k? A binary success signal.
recall@k
What fraction of all relevant documents made the top-k? The completeness of
retrieval.
MRR (mean reciprocal rank)
How high did the first relevant document rank? Reciprocal rank rewards getting
a relevant hit early; it is cutoff-independent in spirit (a later hit is worth
less). For a single query with first relevant document at rank
Averaged across the dataset this is the familiar
scores 1.0, rank 2 scores 0.5, rank 4 scores 0.25.
nDCG@k (normalized discounted cumulative gain)
The richest ranking metric: it rewards placing highly relevant documents
high in the list, with a logarithmic positional discount. It supports
graded gains, so “very relevant” beats “somewhat relevant”.
Discounted Cumulative Gain at
Normalize by the ideal DCG (the same gains sorted into their best possible
order, truncated to
With binary relevance,
matters far more than the gap between rank 19 and rank 20. When IDCG@k is 0 (no
relevant gain reachable within 0.0.
answer-containment@k
An end-to-end reachability check: does the expected answer span actually appear
in the text of any top-k retrieved document?
This bridges retrieval and generation: even with perfect ranking ids, if the
answer text is not present in the retrieved passages, the generator cannot ground
its answer. Containment catches that.
A worked comparison
Consider
retriever returning [doc-7, doc-3, doc-1, doc-9, doc-2]:
| metric | value | reasoning |
|---|---|---|
hit@5 |
1.0 |
doc-3 is in the top-5 |
recall@5 |
1.0 |
both relevant ids are in the top-5 |
retrieval-mrr |
0.5 |
first relevant (doc-3) is at rank 2 → |
nDCG@5 (binary) |
≈ 0.65 |
hits at ranks 2 and 4 ( |
The same retriever is complete (recall 1.0) but poorly ordered (MRR 0.5,
nDCG ≈ 0.65) — exactly the diagnostic split these metrics are designed to surface.
Registering them
The retrieval aliases and answer-containment-at-k auto-wire from the container
with zero extra binding:
$engine->dataset('rag.retrieval')->withMetrics([
'retrieval-hit-at-k',
'retrieval-recall-at-k',
'retrieval-mrr',
'retrieval-ndcg-at-k',
'answer-containment-at-k',
]);
Run recall@k and nDCG@k together. Recall tells you whether the right
documents are retrieved at all; nDCG tells you whether they are ranked well.
A regression that drops nDCG while holding recall is a re-ranking bug, not a
retrieval-coverage bug — and you want to know which.