Context Ranking Precision (ctx_rank_prec)
Contents
Metric Description
Context ranking precision measures how well the retrieved context chunks are ranked relative to their relevance for answering the original question (golden answer). It assesses whether the most relevant chunks appear higher in the ordering—not just whether relevant chunks exist. The metric blends (1) an LLM-as-a-Judge and (2) heuristic.
The score runs from 0 (poor ranking quality) to 100 (ideal ranking).
How to interpret the score
- Closer to 100: chunks are ranked in near-ideal order; the most relevant chunks appear first.
- Closer to 0: ranking is poor; irrelevant or weakly relevant chunks are ranked higher than highly relevant ones.
This metric evaluates the ranking quality of your RAG retrieval/reranking pipeline. It assumes the context has already been retrieved; it does not measure recall (whether all relevant documents were retrieved). Pair with context recall when you need to assess retrieval completeness.
API usage
Prerequisites
After the environment variables are configured, the next step is to create a JSON payload for the custom-runs request. For a field-by-field description of the payload (top-level keys, evaluations, and each row in data), see Custom run request body.
Shortname: ctx_rank_prec
Default threshold: 80
Inputs (each object in data)
context(strorlistrequired): The retrieved context chunks to evaluate. Can be a single string or a list of chunk strings. The order of chunks represents the actual ranking (position 0 is rank 0, etc.).golden_answer(strrequired): The reference answer or query that defines what the context should be relevant to. Used as the target for relevance scoring.
metric_args
ranks(list[number]optional): List of ranks mapped to each context chunk. If provided, chunks are reordered by these ranks before evaluation (e.g., to test a reranker’s output). Must have the same length as the number of context chunks. If omitted, the position of each chunk in the list is treated as its rank.
Evaluation metadata
On successful evaluation, the metric returns eval_metadata to interpret ranking quality and per-chunk relevance notes:
non_relevant_chunks_reasons(list[dict]): Chunk-level explanations for why a chunk is not perfectly relevant. Each object haschunk_position(int, 1-based position in the evaluated chunk list) andreason(str).actual_ranks(list[int]): 0-based chunk ranks provided by the user or derived from chunk order of appearance in the context.ideal_ranks(list[int]): Calculated ideal 0-based ranks (0throughn − 1) parallel toactual_ranks—the target rank for each chunk in that optimal ordering (always0, 1, 2, …by construction).
Example
import json
import os
import requests
from dotenv import load_dotenv
load_dotenv(override=True)
_API_KEY = os.getenv("AEGIS_API_KEY")
_BASE_URL = os.getenv("AEGIS_API_BASE_URL")
_CUSTOM_RUN_URL = f"{_BASE_URL}/runs/custom"
def post_custom_run(payload: dict) -> requests.Response:
"""POST JSON payload to Aegis custom runs; returns the raw response."""
headers = {
"Content-Type": "application/json",
"Authorization": f"Bearer {_API_KEY}",
}
return requests.post(
_CUSTOM_RUN_URL,
headers=headers,
data=json.dumps(payload),
)
if __name__ == "__main__":
data = [
{
"golden_answer": "The new laptop model features fingerprint authentication.",
"context": [
"The new laptop model has a high-resolution Retina display.",
"It includes a fast-charging battery with up to 12 hours of usage.",
"Security features include fingerprint authentication and an encrypted SSD.",
"Every purchase comes with a one-year warranty.",
"Pineapples taste great on pizza.",
],
}
]
payload = {
"threshold": 80,
"model_slug": "o4-mini",
"is_blocking": True,
"data_collection_id": None,
"evaluations": [
{
"metrics": [
{
"metric": "ctx_rank_prec",
"metric_args": {"ranks": [1, 0, 2, 3, 4]},
},
],
"threshold": 80,
"model_slug": "o4-mini",
"data": data,
}
],
}
response = post_custom_run(payload)
response.raise_for_status()
print(json.dumps(response.json(), indent=2))