ROUGE Score (rouge)

Metric description
API usage
Eval metadata

Metric description

ROUGE measures overlap between the output and golden answer, emphasizing recall-oriented overlap. You can choose ROUGE-N (by n) or ROUGE-L (string "L"). Optional stemming and case sensitivity adjust tokenization.

How to interpret the score

Closer to 100: higher F1-style overlap in the selected ROUGE variant.
Closer to 0: little overlap with the reference.

API usage

Prerequisites

After the environment variables are configured, the next step is to create a JSON payload for the custom-runs request. For a field-by-field description of the payload (top-level keys, evaluations, and each row in data), see Custom run request body.

Shortname: rouge

Default threshold: 70

Structural metrics run without an LLM (deterministic checks). Your run may still include model_slug where the API expects it; scoring does not depend on it for this category.

Inputs (each object in data)

output (str, required): Candidate text.
golden_answer (str, required): Reference text.

metric_args

rouge_type (number | string optional): ROUGE-N order as an integer, or "L" for ROUGE-L. Default: 2.
use_stemmer (boolean optional): Use stemming when tokenizing. Default: false.
case_sensitive (boolean optional): Compare with case sensitivity. Default: false.

Eval metadata

Structural metrics do not populate eval_metadata; the field is omitted or ull on the result object.

Example

import json
import os

import requests
from dotenv import load_dotenv

load_dotenv(override=True)

_API_KEY = os.getenv("AEGIS_API_KEY")
_BASE_URL = os.getenv("AEGIS_API_BASE_URL")
_CUSTOM_RUN_URL = f"{_BASE_URL}/runs/custom"


def post_custom_run(payload: dict) -> requests.Response:
    """POST JSON payload to Aegis custom runs; returns the raw response."""
    headers = {
        "Content-Type": "application/json",
        "Authorization": f"Bearer {_API_KEY}",
    }
    return requests.post(
        _CUSTOM_RUN_URL,
        headers=headers,
        data=json.dumps(payload),
    )


if __name__ == "__main__":
    data = [
        {"output": "the cat sat", "golden_answer": "the cat sat on the mat"}
    ]

    payload = {
        "threshold": 70,
        "model_slug": "o4-mini",
        "is_blocking": True,
        "data_collection_id": None,
        "evaluations": [
            {
                "metrics": [
                    {
                        "metric": "rouge",
                        "metric_args": {
                            "rouge_type": 2,
                            "use_stemmer": False,
                            "case_sensitive": False,
                        },
                    },
                ],
                "threshold": 70,
                "model_slug": "o4-mini",
                "data": data,
            }
        ],
    }

    response = post_custom_run(payload)
    response.raise_for_status()
    print(json.dumps(response.json(), indent=2))

Contents​

Metric description​

How to interpret the score​

API usage​

Eval metadata​

Contents

Metric description

How to interpret the score

API usage

Eval metadata