Context Waste (ctx_waste)

Metric Description
API usage
Evaluation metadata

Metric Description

Context waste measures how efficient the provided context is for generating the model’s output. It assesses whether the retrieved context contains information that is not used in the output, as well as redundant or contradictory chunks. A context is wasteful when it includes unused passages, duplicate information, or mutually incompatible claims.

The score runs from 0 (high waste) to 100 (efficient context). The implementation combines (1) an LLM-as-a-Judge to interpret the content and (2) heuristic calculations that aggregate these into component scores.

How to interpret the score

Closer to 100: Most chunks are relevant to the output, contain little redundancy, and do not contradict each other. The context is efficient.
Closer to 0: Many chunks are unused, redundant, or contradictory. The context is wasteful and could be improved by retrieval or deduplication.

Important

Context waste does not measure answer quality or factual correctness. Pair this with metrics like faithfulness and context recall when you need to evaluate answer grounding and retrieval coverage.

API usage

Prerequisites

After the environment variables are configured, the next step is to create a JSON payload for the custom-runs request. For a field-by-field description of the payload (top-level keys, evaluations, and each row in data), see Custom run request body.

Shortname: ctx_waste

Default threshold: 80

Inputs (each object in data)

output (str required): The model-generated answer to evaluate.
context (str or list required): The context chunks provided to the model. Can be a string or a list of strings (one string per chunk).

Evaluation metadata

On successful evaluation, the metric returns eval_metadata highlighting structural inefficiencies in the context:

redundant_groups (list[list[int]]): Groups of chunk IDs (each inner list has at least two IDs) judged as saying the same thing or overlapping strongly. Chunk IDs are 0-based indices into the chunk list passed to the metric.
contradictory_pairs (list[dict]): Pairs of chunks judged to have incompatible information. Each object has a (int) and b (int) (0-based chunk IDs) and reason (str) (short explanation of the conflict).

Example

import json
import os

import requests
from dotenv import load_dotenv

load_dotenv(override=True)

_API_KEY = os.getenv("AEGIS_API_KEY")
_BASE_URL = os.getenv("AEGIS_API_BASE_URL")
_CUSTOM_RUN_URL = f"{_BASE_URL}/runs/custom"


def post_custom_run(payload: dict) -> requests.Response:
    """POST JSON payload to Aegis custom runs; returns the raw response."""
    headers = {
        "Content-Type": "application/json",
        "Authorization": f"Bearer {_API_KEY}",
    }
    return requests.post(
        _CUSTOM_RUN_URL,
        headers=headers,
        data=json.dumps(payload),
    )


if __name__ == "__main__":
    data = [
        {
            "output": "Paris is the capital of France and has a population of about 2.1 million.",
            "context": [
                "Paris is the capital and largest city of France.",
                "As of 2024, Paris has a population of approximately 2.1 million within city limits.",
            ],
        },
    ]

    payload = {
        "threshold": 80,
        "model_slug": "o4-mini",
        "is_blocking": True,
        "data_collection_id": None,
        "evaluations": [
            {
                "metrics": ["ctx_waste"],
                "threshold": 80,
                "model_slug": "o4-mini",
                "data": data,
            }
        ],
    }

    response = post_custom_run(payload)
    response.raise_for_status()
    print(json.dumps(response.json(), indent=2))

Contents​

Metric Description​

How to interpret the score​

API usage​

Evaluation metadata​

Contents

Metric Description

How to interpret the score

API usage

Evaluation metadata