Format Alignment (format_align)

Metric Description
API usage
Evaluation metadata

Metric Description

Format alignment measures how well the model’s output follows formatting and style instructions—structure (JSON, Markdown, required sections), style (bullets vs paragraphs, concision), length or scope, tone, locale (dates, currency, spelling), audience, or brand-style constraints. It does not judge whether the substantive content is correct or grounded in a source; it focuses on how the answer is shaped relative to those instructions.

The evaluator first builds a list of instructions, either from metric_args.instructions or by extracting them from your prompt and/or input with an LLM (explicit output-format rules only). It then evaluates each instruction independently, combining the results into a single final score.

How to interpret the score

Closer to 100: the output tends to satisfy most or all extracted (or supplied) formatting and style instructions.
Closer to 0: many instructions are missed, or only weakly followed.

Important

High format alignment does not mean the answer is factually correct, safe, or faithful to retrieved context. Pair with factfulness, content generation faithfulness, or other metrics when those matter. For longer outputs, consider using format consistency to make sure the entire text follows the same patterns.

API usage

Prerequisites

After the environment variables are configured, the next step is to create a JSON payload for the custom-runs request. For a field-by-field description of the payload (top-level keys, evaluations, and each row in data), see Custom run request body.

Shortname: format_align

Default threshold: 80

Inputs (each object in data)

output (str, required): The model-generated text to evaluate.
prompt (str, optional): System or developer instructions that may contain format/style rules. If both prompt and input are valid non-empty strings, they are combined before instruction extraction.
input (str, optional): User message or task text that may contain format/style rules.

At least one of the following must be present so the metric can obtain instructions: a non-empty prompt, a non-empty input, or a non-empty instructions value under metric_args. If prompt and input are both missing or invalid and instructions is empty, evaluation cannot run.

metric_args

instructions (str or list[str], optional): Explicit list of formatting/style instructions to check against output. If omitted or empty, instructions are extracted by an LLM from the combined prompt/input (or from whichever of prompt or input is available).

Evaluation metadata

On successful evaluation, the metric returns eval_metadata with structured details about instructions the output did not fully satisfy:

unfollowed_instructions (list[str]): List of reasons for unfollowed instructions. If every instruction is followed, this list is empty.

Example

import json
import os

import requests
from dotenv import load_dotenv

load_dotenv(override=True)

_API_KEY = os.getenv("AEGIS_API_KEY")
_BASE_URL = os.getenv("AEGIS_API_BASE_URL")
_CUSTOM_RUN_URL = f"{_BASE_URL}/runs/custom"


def post_custom_run(payload: dict) -> requests.Response:
    """POST JSON payload to Aegis custom runs; returns the raw response."""
    headers = {
        "Content-Type": "application/json",
        "Authorization": f"Bearer {_API_KEY}",
    }
    return requests.post(
        _CUSTOM_RUN_URL,
        headers=headers,
        data=json.dumps(payload),
    )


if __name__ == "__main__":
    data = [
        {
            "prompt": "Respond in JSON with keys summary and bullets. Use British English.",
            "input": "Summarize the benefits of walking.",
            "output": '{"summary": "Walking improves health.", "bullets": ["Low impact", "Free"]}',
        },
    ]

    payload = {
        "threshold": 80,
        "model_slug": "o4-mini",
        "is_blocking": True,
        "data_collection_id": None,
        "evaluations": [
            {
                "metrics": [
                    {
                        "metric": "format_align",
                        "metric_args": {
                            "instructions": ["Return JSON only", "Use British English"],
                        },
                    },
                ],
                "threshold": 80,
                "model_slug": "o4-mini",
                "data": data,
            }
        ],
    }

    response = post_custom_run(payload)
    response.raise_for_status()
    print(json.dumps(response.json(), indent=2))

Contents​

Metric Description​

How to interpret the score​

API usage​

Evaluation metadata​

Contents

Metric Description

How to interpret the score

API usage

Evaluation metadata