Skip to main content

Answer Correctness (ans_corr)

Contents

Metric Description

Answer correctness measures how well the model’s output matches a reference (golden) answer in terms of factual content.

The score runs from 0 (no match) to 100 (perfect match). The implementation blends (1) an LLM-as-a-Judge of how well the answer stays on the topic and (2) heuristic metrics.

This metric evaluates content correctness only, not format. Output and golden answer may use different formats (e.g., golden in JSON, output in XML); if they convey the same information, they can receive a high score. For format-related validation, consider structural metrics (e.g. json_schema_match).

How to interpret the score

  • Closer to 100: the output covers the expected facts from the golden answer with few extra or missing claims.
  • Closer to 0: the output omits important facts, adds incorrect claims, or deviates significantly from the golden answer.
Important

Answer correctness requires a golden answer as ground truth. For use cases without a reference answer (e.g., open-ended generation), consider metrics like answer relevancy or factfulness instead.

API usage

Prerequisites

After the environment variables are configured, the next step is to create a JSON payload for the custom-runs request. For a field-by-field description of the payload (top-level keys, evaluations, and each row in data), see Custom run request body.

Shortname: ans_corr

Default threshold: 80

Inputs (each object in data)

  • output (str required): The model-generated answer to evaluate.
  • golden_answer (str required): The reference (ground truth) answer to compare against.

metric_args

  • max_n_statements (int optional): Maximum number of statements to extract from each text for comparison. An optimal value of statements to be generated is automatically calculated, this parameter will only define the cap. Default = 50.

Evaluation metadata

On successful evaluation, the metric returns eval_metadata with statement-level gaps between the output and the golden answer:

  • statements_not_covered_in_golden_answer (list[dict]): Entries for statements produced from the output that are not fully reflected in the golden answer. Each item has statement (the output-side statement text) and reason (why it is not fully covered).
  • statements_not_covered_in_output (list[dict]): Entries for statements produced from the golden answer that are not fully reflected in the output. Each item has statement (the golden-side statement text) and reason (why it is not fully covered).

Example

import json
import os

import requests
from dotenv import load_dotenv

load_dotenv(override=True)

_API_KEY = os.getenv("AEGIS_API_KEY")
_BASE_URL = os.getenv("AEGIS_API_BASE_URL")
_CUSTOM_RUN_URL = f"{_BASE_URL}/runs/custom"


def post_custom_run(payload: dict) -> requests.Response:
"""POST JSON payload to Aegis custom runs; returns the raw response."""
headers = {
"Content-Type": "application/json",
"Authorization": f"Bearer {_API_KEY}",
}
return requests.post(
_CUSTOM_RUN_URL,
headers=headers,
data=json.dumps(payload),
)


if __name__ == "__main__":
# Records must match the columns your metric expects (output, golden_answer).
data = [
{
"output": "Paris is the capital of France.",
"golden_answer": "Paris is the capital of France.",
},
]

payload = {
"threshold": 80, # threshold on the run level
"model_slug": "o4-mini",
"is_blocking": True,
"data_collection_id": None,
"evaluations": [
{
"metrics": [
{"metric": "ans_corr", "metric_args": {"max_n_statements": 5}},
],
"threshold": 80, # threshold on the metric level
"model_slug": "o4-mini",
"data": data,
}
],
}

response = post_custom_run(payload)
response.raise_for_status()
print(json.dumps(response.json(), indent=2))