Format Alignment (format_align)
Contents
Metric Description
Format alignment measures how well the model’s output follows formatting and style instructions—structure (JSON, Markdown, required sections), style (bullets vs paragraphs, concision), length or scope, tone, locale (dates, currency, spelling), audience, or brand-style constraints. It does not judge whether the substantive content is correct or grounded in a source; it focuses on how the answer is shaped relative to those instructions.
The evaluator first builds a list of instructions, either from metric_args.instructions or by extracting them from your prompt and/or input with an LLM (explicit output-format rules only). It then evaluates each instruction independently, combining the results into a single final score.
How to interpret the score
- Closer to 100: the output tends to satisfy most or all extracted (or supplied) formatting and style instructions.
- Closer to 0: many instructions are missed, or only weakly followed.
High format alignment does not mean the answer is factually correct, safe, or faithful to retrieved context. Pair with factfulness, content generation faithfulness, or other metrics when those matter. For longer outputs, consider using format consistency to make sure the entire text follows the same patterns.
API usage
Prerequisites
After the environment variables are configured, the next step is to create a JSON payload for the custom-runs request. For a field-by-field description of the payload (top-level keys, evaluations, and each row in data), see Custom run request body.
Shortname: format_align
Default threshold: 80
Inputs (each object in data)
output(str, required): The model-generated text to evaluate.prompt(str, optional): System or developer instructions that may contain format/style rules. If bothpromptandinputare valid non-empty strings, they are combined before instruction extraction.input(str, optional): User message or task text that may contain format/style rules.
At least one of the following must be present so the metric can obtain instructions: a non-empty prompt, a non-empty input, or a non-empty instructions value under metric_args. If prompt and input are both missing or invalid and instructions is empty, evaluation cannot run.
metric_args
instructions(strorlist[str], optional): Explicit list of formatting/style instructions to check againstoutput. If omitted or empty, instructions are extracted by an LLM from the combined prompt/input (or from whichever ofpromptorinputis available).
Evaluation metadata
On successful evaluation, the metric returns eval_metadata with structured details about instructions the output did not fully satisfy:
unfollowed_instructions(list[str]): List of reasons for unfollowed instructions. If every instruction is followed, this list is empty.
Example
import json
import os
import requests
from dotenv import load_dotenv
load_dotenv(override=True)
_API_KEY = os.getenv("AEGIS_API_KEY")
_BASE_URL = os.getenv("AEGIS_API_BASE_URL")
_CUSTOM_RUN_URL = f"{_BASE_URL}/runs/custom"
def post_custom_run(payload: dict) -> requests.Response:
"""POST JSON payload to Aegis custom runs; returns the raw response."""
headers = {
"Content-Type": "application/json",
"Authorization": f"Bearer {_API_KEY}",
}
return requests.post(
_CUSTOM_RUN_URL,
headers=headers,
data=json.dumps(payload),
)
if __name__ == "__main__":
data = [
{
"prompt": "Respond in JSON with keys summary and bullets. Use British English.",
"input": "Summarize the benefits of walking.",
"output": '{"summary": "Walking improves health.", "bullets": ["Low impact", "Free"]}',
},
]
payload = {
"threshold": 80,
"model_slug": "o4-mini",
"is_blocking": True,
"data_collection_id": None,
"evaluations": [
{
"metrics": [
{
"metric": "format_align",
"metric_args": {
"instructions": ["Return JSON only", "Use British English"],
},
},
],
"threshold": 80,
"model_slug": "o4-mini",
"data": data,
}
],
}
response = post_custom_run(payload)
response.raise_for_status()
print(json.dumps(response.json(), indent=2))