Create custom run
Endpoint: POST /runs/custom
Description
Creates a run when you pass an evaluations array in the JSON body: each element defines metrics and rows to score.
Parameters
- Body —
application/json:
{
"threshold": "integer | null",
"model_slug": "string | null",
"is_blocking": false,
"data_collection_id": "integer | null",
"alias": "string | null",
"evaluations": [
{
"metrics": ["ans_corr"],
"threshold": "integer | null",
"model_slug": "string | null",
"data": [
{
"prompt": "string | null",
"input": "string | null",
"context": "string | null",
"output": "string | null",
"golden_answer": "string | null"
}
]
}
]
}
Error responses
401,402— authentication or insufficient credits.422— request validation failed.400— emptyevaluations; unknown/inactive metric shortnames; duplicatealias.404—model_slugis not a documented model slug;data_collection_idnot found; no metrics resolved.500— failure while creating the run.
Responses
201— same run object shape as Get run.
Example response (201)
{
"id": 992,
"user": "analyst@acme.com",
"run_type": "Custom",
"run_source": "API",
"dataset": null,
"data_collection": "Customer Support",
"number_of_metrics": 1,
"result": 100,
"threshold": 70,
"model_slug": "gpt-4o",
"alias": "smoke-test",
"aggregate_results": {
"ans_corr": 100
},
"started_at": "2026-04-01T09:15:01Z",
"finished_at": "2026-04-01T09:15:03Z",
"is_gte_threshold": true,
"evaluations": []
}
Metric shortnames
Use these in metrics (or in object form metric).
Metric shortnames (by category)
Content generation
General
RAG
Structural
alpha_percalphanum_percbleuchar_ct_matchexact_matchis_booleanis_dateis_numericis_stringis_valid_jsonis_valid_pythonis_valid_sqlis_valid_xmljson_equaljson_schema_matchnumeric_matchpar_ct_matchrougesent_ct_matchtext_readabilityword_ct_matchxml_equalxml_schema_match
Safety
Security
curl
curl -X POST "https://api.aegisevals.ai/api/v1/runs/custom" \
-H "Authorization: Bearer sk_00000000000000000000000000000000" \
-H "Content-Type: application/json" \
-d '{"threshold":70,"model_slug":"gpt-4o","is_blocking":false,"alias":"smoke-test","evaluations":[{"metrics":["ans_corr"],"data":[{"prompt":"What is 2+2?","output":"4","golden_answer":"4"}]}]}'
Examples
Several rows, one metric
{
"threshold": 75,
"model_slug": "gpt-4o",
"is_blocking": false,
"alias": "support-batch-2025-03-27",
"evaluations": [
{
"metrics": ["ans_corr"],
"threshold": 75,
"model_slug": "gpt-4o",
"data": [
{
"prompt": "What is your refund policy for annual plans?",
"output": "We refund unused months if you cancel within 14 days of renewal.",
"golden_answer": "Annual plans are refundable for the unused portion within 14 days of the renewal charge."
},
{
"prompt": "How do I export my data?",
"output": "Open Settings → Data → Export; you will get a CSV within a few minutes.",
"golden_answer": "Use Settings → Data → Export to download a CSV of your workspace."
}
]
}
]
}
RAG: context + answer metrics
Pass retrieved context with the model output. Here ctx_faith and ctx_rel run on the same rows.
{
"threshold": 70,
"model_slug": "gpt-4o-mini",
"is_blocking": false,
"evaluations": [
{
"metrics": ["ctx_faith", "ctx_rel"],
"threshold": 70,
"model_slug": "gpt-4o-mini",
"data": [
{
"input": "When did the Acme Corp fiscal year end in 2024?",
"context": "Acme Corp FY2024 ended on September 30, 2024. Revenue was $120M.",
"output": "Acme’s 2024 fiscal year ended on September 30, 2024.",
"golden_answer": null
}
]
}
]
}
Two evaluation blocks
Run one block with stricter threshold / different model than another (for example: cheap model for screening, stronger model for a smaller slice).
{
"threshold": 80,
"model_slug": "gpt-4o-mini",
"is_blocking": false,
"evaluations": [
{
"metrics": ["ans_rel"],
"threshold": 60,
"model_slug": "gpt-4o-mini",
"data": [
{
"prompt": "Summarize our SLA in one sentence.",
"output": "We target 99.9% monthly uptime excluding scheduled maintenance."
}
]
},
{
"metrics": ["faith"],
"threshold": 85,
"model_slug": "gpt-4o",
"data": [
{
"prompt": "What guarantees does the SLA provide?",
"context": "SLA: 99.9% uptime; credits apply if below target.",
"output": "The SLA promises 99.9% uptime and service credits if we miss it."
}
]
}
]
}
Mixed metrics list + metric_args
Use strings when no options are needed, and objects when a metric accepts metric_args (see that metric’s doc page).
{
"threshold": 100,
"is_blocking": false,
"evaluations": [
{
"metrics": [
"exact_match",
{
"metric": "json_equal",
"metric_args": {
"ignore_extra_keys": true,
"ignore_order": false
}
}
],
"threshold": 100,
"model_slug": "gpt-4o-mini",
"data": [
{
"output": "{\"status\":\"ok\",\"items\":[1,2]}",
"golden_answer": "{\"items\":[1,2],\"status\":\"ok\"}"
}
]
}
]
}
Python
import json
import os
import requests
from dotenv import load_dotenv
load_dotenv(override=True)
API_KEY = os.environ["AEGIS_API_KEY"]
BASE = os.environ["AEGIS_API_BASE_URL"].rstrip("/")
payload = {
"threshold": 75,
"model_slug": "gpt-4o",
"is_blocking": False,
"alias": "python-example",
"evaluations": [
{
"metrics": ["ans_corr"],
"threshold": 75,
"model_slug": "gpt-4o",
"data": [
{
"prompt": "Capital of France?",
"output": "Paris is the capital of France.",
"golden_answer": "Paris.",
}
],
}
],
}
r = requests.post(
f"{BASE}/runs/custom",
headers={"Authorization": f"Bearer {API_KEY}", "Content-Type": "application/json"},
data=json.dumps(payload),
)
r.raise_for_status()
print(json.dumps(r.json(), indent=2))