Introduction

Why use metrics?

Metrics turn model behavior into measurable signals: counts, pass/fail checks, and graded scores over prompts, completions, retrieved context, or golden references. Use them to compare runs, catch regressions after changes, and gate releases on thresholds your team agrees to.

No single metric captures everything. In practice you combine broad coverage (many cases), complementary checks (for example factual alignment and format validity), and human review where automation is thin.

How metrics are scored

For most metrics, Aegis blends LLM-as-a-judge evaluation with heuristic signals—format checks, parsing, or comparison to references and rubrics—so scores reflect both semantic judgment and verifiable evidence. The reported value is a continuous score from 0 to 100; higher scores mean the output better satisfies that metric’s criteria.

Structural metrics are the main exception: they are fully deterministic and do not call a judge model. They inspect outputs (and reference answers where applicable) with rules or classical scoring.

Using this documentation

Each metric page includes a shortname (for APIs and configs), the fields the evaluator expects, an example payload, a run example (a sample custom-run request for that metric), and optional metric_args when the metric is configurable. For more ways to structure evaluations—several rows, RAG-style blocks, multiple metrics, and similar patterns—see Custom run examples.

Categories group metrics by use case:

Structural — deterministic checks only; no LLM judge (see How metrics are scored).
General, RAG, content generation, security, and safety — usually model- or rubric-based, suited to semantics, retrieval, or risk. See each category index for the full list and behavior details.

Choose metrics that fit your task (chat versus RAG versus structured output) and your risk profile (quality, safety, security). Start with a small set, observe variance across runs, then add checks where failures repeat.

Metrics categories

The cards below link to each category overview and the metrics listed there. Security and Safety follow the same documentation pattern.

Why use metrics?​

How metrics are scored​

Using this documentation​

Metrics categories​

General

RAG

Content Generation

Structural

Security

Safety

Why use metrics?

How metrics are scored

Using this documentation

Metrics categories