Context Ranking Precision (ctx_rank_prec)

Metric Description
API usage
Evaluation metadata

Metric Description

Context ranking precision measures how well the retrieved context chunks are ranked relative to their relevance for answering the original question (golden answer). It assesses whether the most relevant chunks appear higher in the ordering—not just whether relevant chunks exist. The metric blends (1) an LLM-as-a-Judge and (2) heuristic.

The score runs from 0 (poor ranking quality) to 100 (ideal ranking).

How to interpret the score

Closer to 100: chunks are ranked in near-ideal order; the most relevant chunks appear first.
Closer to 0: ranking is poor; irrelevant or weakly relevant chunks are ranked higher than highly relevant ones.

Important

This metric evaluates the ranking quality of your RAG retrieval/reranking pipeline. It assumes the context has already been retrieved; it does not measure recall (whether all relevant documents were retrieved). Pair with context recall when you need to assess retrieval completeness.

API usage

Prerequisites

After the environment variables are configured, the next step is to create a JSON payload for the custom-runs request. For a field-by-field description of the payload (top-level keys, evaluations, and each row in data), see Custom run request body.

Shortname: ctx_rank_prec

Default threshold: 80

Inputs (each object in data)

context (str or list required): The retrieved context chunks to evaluate. Can be a single string or a list of chunk strings. The order of chunks represents the actual ranking (position 0 is rank 0, etc.).
golden_answer (str required): The reference answer or query that defines what the context should be relevant to. Used as the target for relevance scoring.

metric_args

ranks (list[number] optional): List of ranks mapped to each context chunk. If provided, chunks are reordered by these ranks before evaluation (e.g., to test a reranker’s output). Must have the same length as the number of context chunks. If omitted, the position of each chunk in the list is treated as its rank.

Evaluation metadata

On successful evaluation, the metric returns eval_metadata to interpret ranking quality and per-chunk relevance notes:

non_relevant_chunks_reasons (list[dict]): Chunk-level explanations for why a chunk is not perfectly relevant. Each object has chunk_position (int, 1-based position in the evaluated chunk list) and reason (str).
actual_ranks (list[int]): 0-based chunk ranks provided by the user or derived from chunk order of appearance in the context.
ideal_ranks (list[int]): Calculated ideal 0-based ranks (0 through n − 1) parallel to actual_ranks—the target rank for each chunk in that optimal ordering (always 0, 1, 2, … by construction).

Example

import json
import os

import requests
from dotenv import load_dotenv

load_dotenv(override=True)

_API_KEY = os.getenv("AEGIS_API_KEY")
_BASE_URL = os.getenv("AEGIS_API_BASE_URL")
_CUSTOM_RUN_URL = f"{_BASE_URL}/runs/custom"


def post_custom_run(payload: dict) -> requests.Response:
    """POST JSON payload to Aegis custom runs; returns the raw response."""
    headers = {
        "Content-Type": "application/json",
        "Authorization": f"Bearer {_API_KEY}",
    }
    return requests.post(
        _CUSTOM_RUN_URL,
        headers=headers,
        data=json.dumps(payload),
    )


if __name__ == "__main__":
    data = [
        {
            "golden_answer": "The new laptop model features fingerprint authentication.",
            "context": [
                "The new laptop model has a high-resolution Retina display.",
                "It includes a fast-charging battery with up to 12 hours of usage.",
                "Security features include fingerprint authentication and an encrypted SSD.",
                "Every purchase comes with a one-year warranty.",
                "Pineapples taste great on pizza.",
            ],
        }
    ]

    payload = {
        "threshold": 80,
        "model_slug": "o4-mini",
        "is_blocking": True,
        "data_collection_id": None,
        "evaluations": [
            {
                "metrics": [
                    {
                        "metric": "ctx_rank_prec",
                        "metric_args": {"ranks": [1, 0, 2, 3, 4]},
                    },
                ],
                "threshold": 80,
                "model_slug": "o4-mini",
                "data": data,
            }
        ],
    }

    response = post_custom_run(payload)
    response.raise_for_status()
    print(json.dumps(response.json(), indent=2))

Contents​

Metric Description​

How to interpret the score​

API usage​

Evaluation metadata​

Contents

Metric Description

How to interpret the score

API usage

Evaluation metadata