Introduction
Why use metrics?
Metrics turn model behavior into measurable signals: counts, pass/fail checks, and graded scores over prompts, completions, retrieved context, or golden references. Use them to compare runs, catch regressions after changes, and gate releases on thresholds your team agrees to.
No single metric captures everything. In practice you combine broad coverage (many cases), complementary checks (for example factual alignment and format validity), and human review where automation is thin.
How metrics are scored
For most metrics, Aegis blends LLM-as-a-judge evaluation with heuristic signals—format checks, parsing, or comparison to references and rubrics—so scores reflect both semantic judgment and verifiable evidence. The reported value is a continuous score from 0 to 100; higher scores mean the output better satisfies that metric’s criteria.
Structural metrics are the main exception: they are fully deterministic and do not call a judge model. They inspect outputs (and reference answers where applicable) with rules or classical scoring.
Using this documentation
Each metric page includes a shortname (for APIs and configs), the fields the evaluator expects, an example payload, a run example (a sample custom-run request for that metric), and optional metric_args when the metric is configurable. For more ways to structure evaluations—several rows, RAG-style blocks, multiple metrics, and similar patterns—see Custom run examples.
Categories group metrics by use case:
- Structural — deterministic checks only; no LLM judge (see How metrics are scored).
- General, RAG, content generation, security, and safety — usually model- or rubric-based, suited to semantics, retrieval, or risk. See each category index for the full list and behavior details.
Choose metrics that fit your task (chat versus RAG versus structured output) and your risk profile (quality, safety, security). Start with a small set, observe variance across runs, then add checks where failures repeat.
Metrics categories
The cards below link to each category overview and the metrics listed there. Security and Safety follow the same documentation pattern.
General
Core quality metrics for answers, prompts, faithfulness, and summarization.
RAG
Retrieval metrics: context faithfulness, relevancy, recall, sufficiency, and more.
Content Generation
Format alignment, format consistency, and content generation faithfulness.
Structural
Deterministic checks: syntax, schemas, counts, readability, BLEU/ROUGE, and equality to references.
Security
Proprietary data leakage and threat detection - to keep your system secure.
Safety
Safety metrics including harmful content and bias - to keep the user safe.