Create custom run

Endpoint: POST /runs/custom

Description Creates a run when you pass an evaluations array in the JSON body: each element defines metrics and rows to score.

Parameters

Body — application/json:

{
  "threshold": "integer | null",
  "model_slug": "string | null",
  "is_blocking": false,
  "data_collection_id": "integer | null",
  "alias": "string | null",
  "evaluations": [
    {
      "metrics": ["ans_corr"],
      "threshold": "integer | null",
      "model_slug": "string | null",
      "data": [
        {
          "prompt": "string | null",
          "input": "string | null",
          "context": "string | null",
          "output": "string | null",
          "golden_answer": "string | null"
        }
      ]
    }
  ]
}

Error responses

401, 402 — authentication or insufficient credits.
422 — request validation failed.
400 — empty evaluations; unknown/inactive metric shortnames; duplicate alias.
404 — model_slug is not a documented model slug; data_collection_id not found; no metrics resolved.
500 — failure while creating the run.

Responses

201 — same run object shape as Get run.

Example response (201)

{
  "id": 992,
  "user": "analyst@acme.com",
  "run_type": "Custom",
  "run_source": "API",
  "dataset": null,
  "data_collection": "Customer Support",
  "number_of_metrics": 1,
  "result": 100,
  "threshold": 70,
  "model_slug": "gpt-4o",
  "alias": "smoke-test",
  "aggregate_results": {
    "ans_corr": 100
  },
  "started_at": "2026-04-01T09:15:01Z",
  "finished_at": "2026-04-01T09:15:03Z",
  "is_gte_threshold": true,
  "evaluations": []
}

Metric shortnames

Use these in metrics (or in object form metric).

Metric shortnames (by category)

Content generation

General

RAG

Structural

Safety

Security

curl

curl -X POST "https://api.aegisevals.ai/api/v1/runs/custom" \
  -H "Authorization: Bearer sk_00000000000000000000000000000000" \
  -H "Content-Type: application/json" \
  -d '{"threshold":70,"model_slug":"gpt-4o","is_blocking":false,"alias":"smoke-test","evaluations":[{"metrics":["ans_corr"],"data":[{"prompt":"What is 2+2?","output":"4","golden_answer":"4"}]}]}'

Examples

Several rows, one metric

{
  "threshold": 75,
  "model_slug": "gpt-4o",
  "is_blocking": false,
  "alias": "support-batch-2025-03-27",
  "evaluations": [
    {
      "metrics": ["ans_corr"],
      "threshold": 75,
      "model_slug": "gpt-4o",
      "data": [
        {
          "prompt": "What is your refund policy for annual plans?",
          "output": "We refund unused months if you cancel within 14 days of renewal.",
          "golden_answer": "Annual plans are refundable for the unused portion within 14 days of the renewal charge."
        },
        {
          "prompt": "How do I export my data?",
          "output": "Open Settings → Data → Export; you will get a CSV within a few minutes.",
          "golden_answer": "Use Settings → Data → Export to download a CSV of your workspace."
        }
      ]
    }
  ]
}

RAG: context + answer metrics

Pass retrieved context with the model output. Here ctx_faith and ctx_rel run on the same rows.

{
  "threshold": 70,
  "model_slug": "gpt-4o-mini",
  "is_blocking": false,
  "evaluations": [
    {
      "metrics": ["ctx_faith", "ctx_rel"],
      "threshold": 70,
      "model_slug": "gpt-4o-mini",
      "data": [
        {
          "input": "When did the Acme Corp fiscal year end in 2024?",
          "context": "Acme Corp FY2024 ended on September 30, 2024. Revenue was $120M.",
          "output": "Acme’s 2024 fiscal year ended on September 30, 2024.",
          "golden_answer": null
        }
      ]
    }
  ]
}

Two evaluation blocks

Run one block with stricter threshold / different model than another (for example: cheap model for screening, stronger model for a smaller slice).

{
  "threshold": 80,
  "model_slug": "gpt-4o-mini",
  "is_blocking": false,
  "evaluations": [
    {
      "metrics": ["ans_rel"],
      "threshold": 60,
      "model_slug": "gpt-4o-mini",
      "data": [
        {
          "prompt": "Summarize our SLA in one sentence.",
          "output": "We target 99.9% monthly uptime excluding scheduled maintenance."
        }
      ]
    },
    {
      "metrics": ["faith"],
      "threshold": 85,
      "model_slug": "gpt-4o",
      "data": [
        {
          "prompt": "What guarantees does the SLA provide?",
          "context": "SLA: 99.9% uptime; credits apply if below target.",
          "output": "The SLA promises 99.9% uptime and service credits if we miss it."
        }
      ]
    }
  ]
}

Mixed metrics list + metric_args

Use strings when no options are needed, and objects when a metric accepts metric_args (see that metric’s doc page).

{
  "threshold": 100,
  "is_blocking": false,
  "evaluations": [
    {
      "metrics": [
        "exact_match",
        {
          "metric": "json_equal",
          "metric_args": {
            "ignore_extra_keys": true,
            "ignore_order": false
          }
        }
      ],
      "threshold": 100,
      "model_slug": "gpt-4o-mini",
      "data": [
        {
          "output": "{\"status\":\"ok\",\"items\":[1,2]}",
          "golden_answer": "{\"items\":[1,2],\"status\":\"ok\"}"
        }
      ]
    }
  ]
}

Python

import json
import os
import requests
from dotenv import load_dotenv

load_dotenv(override=True)
API_KEY = os.environ["AEGIS_API_KEY"]
BASE = os.environ["AEGIS_API_BASE_URL"].rstrip("/")

payload = {
    "threshold": 75,
    "model_slug": "gpt-4o",
    "is_blocking": False,
    "alias": "python-example",
    "evaluations": [
        {
            "metrics": ["ans_corr"],
            "threshold": 75,
            "model_slug": "gpt-4o",
            "data": [
                {
                    "prompt": "Capital of France?",
                    "output": "Paris is the capital of France.",
                    "golden_answer": "Paris.",
                }
            ],
        }
    ],
}

r = requests.post(
    f"{BASE}/runs/custom",
    headers={"Authorization": f"Bearer {API_KEY}", "Content-Type": "application/json"},
    data=json.dumps(payload),
)
r.raise_for_status()
print(json.dumps(r.json(), indent=2))

Responses​

Examples​

Responses

Examples