Skip to main content

Context Relevancy (ctx_rel)

Contents

Metric Description

Context relevancy measures whether the retrieved context is relevant to the input (user question or instruction). It evaluates whether the context contains information that actually addresses what the user asked, and whether it avoids irrelevant content that might mislead or distract from the answer. In other words: is the retrieved context on-topic for the input?

The score runs from 0 (poor relevancy) to 100 (strong relevancy). The implementation blends (1) an LLM-as-a-Judge of how relevant each extracted statement from the context is to the input, and (2) a heuristic semantic similarity score between the context and the input.

How to interpret the score

  • Closer to 100: the context tends to focus on the user’s input; little obvious filler or irrelevant information.
  • Closer to 0: the context contains significant off-topic material, ignores the input, or only partially relates to what was asked.
Important

High context relevancy does not guarantee the context is complete or sufficient for a good answer. Pair this with metrics like context recall and context waste when evaluating retrieval quality end-to-end.

API usage

Prerequisites

After the environment variables are configured, the next step is to create a JSON payload for the custom-runs request. For a field-by-field description of the payload (top-level keys, evaluations, and each row in data), see Custom run request body.

Shortname: ctx_rel

Default threshold: 80

Inputs (each object in data)

  • input (str required): The user’s question or instruction (what the context should be relevant to).
  • context (str or list[str] required): The retrieved context chunks (documents or passages) to evaluate.

metric_args

  • max_n_statements (int optional): Maximum number of statements to generate from the context. Each statement is compared with the input to check relevancy. An optimal value of statements to be generated is automatically calculated, this parameter will only define the cap. Default = 50.

Evaluation metadata

On successful evaluation, the metric returns eval_metadata describing context statements that scored poorly against the input:

  • irrelevant_statements (list[dict]): List of statements, derived from context, that were found not relevant to input. Each object contains statement (str) (the extracted context statement) and reason (str) (why it was judged weakly relevant to the input).

Example

import json
import os

import requests
from dotenv import load_dotenv

load_dotenv(override=True)

_API_KEY = os.getenv("AEGIS_API_KEY")
_BASE_URL = os.getenv("AEGIS_API_BASE_URL")
_CUSTOM_RUN_URL = f"{_BASE_URL}/runs/custom"


def post_custom_run(payload: dict) -> requests.Response:
"""POST JSON payload to Aegis custom runs; returns the raw response."""
headers = {
"Content-Type": "application/json",
"Authorization": f"Bearer {_API_KEY}",
}
return requests.post(
_CUSTOM_RUN_URL,
headers=headers,
data=json.dumps(payload),
)


if __name__ == "__main__":
context = [
"Paris is the capital of France and its largest city.",
"The Eiffel Tower was completed in 1889 for the World's Fair.",
"France is known for its cuisine, including baguettes and cheese.",
]
data = [
{"input": "What is the capital of France?", "context": context},
]

payload = {
"threshold": 80,
"model_slug": "o4-mini",
"is_blocking": True,
"data_collection_id": None,
"evaluations": [
{
"metrics": [
{"metric": "ctx_rel", "metric_args": {"max_n_statements": 5}},
],
"threshold": 80,
"model_slug": "o4-mini",
"data": data,
}
],
}

response = post_custom_run(payload)
response.raise_for_status()
print(json.dumps(response.json(), indent=2))