Summarization (summ)
Contents
Metric Description
Summarization evaluates whether an LLM’s output (summary) effectively condenses the input (source text) into a concise, coherent, and factually faithful summary. The metric checks that the model captures central points, removes superfluous information, and preserves the text’s main concepts and conclusions without distortion.
The score runs from 0 (poor summary) to 100 (high-quality summary). The implementation blends (1) LLM-as-a-Judge components and (2) heuristic metrics.
How to interpret the score
- Closer to 100: The summary captures key information, aligns with the source, avoids redundancy, and stays concise.
- Closer to 0: The summary omits important points, contradicts or adds unsupported facts, is redundant, or is disproportionately long.
This metric focuses on summarization quality: coverage, alignment, compression, and length. For entities are crucial for your use case, use entity faithfulness for a more robust check.
API usage
Prerequisites
After the environment variables are configured, the next step is to create a JSON payload for the custom-runs request. For a field-by-field description of the payload (top-level keys, evaluations, and each row in data), see Custom run request body.
Shortname: summ
Default threshold: 80
Inputs (each object in data)
input(strrequired): The original source text to be summarized.output(strrequired): The model-generated summary to evaluate.
metric_args
n_key_points(intoptional): Number of key points to extract from the input. If not provided, an optimal value is derived from the input length.max_n_claims(intoptional): Maximum number of claims to generate from the output for alignment checks. An optimal value of claims to be generated is automatically calculated, this parameter will only define the cap. Default = 50.
Evaluation metadata
On successful evaluation, the metric returns eval_metadata with coverage and alignment diagnostics:
missing_details(list[str]): Reasons corresponding to source details that are not fully represented in the summary.misaligned_details(list[str]): Reasons for summary-side content that does not align well with the source.
Example
import json
import os
import requests
from dotenv import load_dotenv
load_dotenv(override=True)
_API_KEY = os.getenv("AEGIS_API_KEY")
_BASE_URL = os.getenv("AEGIS_API_BASE_URL")
_CUSTOM_RUN_URL = f"{_BASE_URL}/runs/custom"
def post_custom_run(payload: dict) -> requests.Response:
"""POST JSON payload to Aegis custom runs; returns the raw response."""
headers = {
"Content-Type": "application/json",
"Authorization": f"Bearer {_API_KEY}",
}
return requests.post(
_CUSTOM_RUN_URL,
headers=headers,
data=json.dumps(payload),
)
if __name__ == "__main__":
data = [
{
"input": "The quick brown fox jumps over the lazy dog. This is a classic pangram that contains every letter of the English alphabet at least once. It is often used for testing purposes in typography, printing, and typewriters.",
"output": "The quick brown fox pangram contains all alphabet letters and is used for testing.",
},
]
payload = {
"threshold": 80,
"model_slug": "o4-mini",
"is_blocking": True,
"data_collection_id": None,
"evaluations": [
{
"metrics": [
{"metric": "summ", "metric_args": {"n_key_points": 5, "max_n_claims": 5}},
],
"threshold": 80,
"model_slug": "o4-mini",
"data": data,
}
],
}
response = post_custom_run(payload)
response.raise_for_status()
print(json.dumps(response.json(), indent=2))