Characteristics
- An eval has a stable ID.
- An eval targets an agent in V1.
- An eval loads examples from inline data, JSON, or JSONL.
- An eval records
input, optionalreference, and optionalmetadatafor each example. - An eval uses metrics such as exact match, contains, JSON match, no failed tools, latency, tokens, cost, and rubric judges.
- An eval produces a report with records, metric summaries, pass rate, and optional JUnit XML output.
Boundary
An eval is the definition. An eval run is one execution of that definition. A report is the result of the run. Durable eval runs use run kindeval and target
IDs such as eval:deep-research.
Keep evals separate from tests. Tests protect deterministic code behavior. Evals
measure probabilistic agent behavior, retrieval behavior, tool behavior, and
operational budgets across datasets.
Source files
Eval files live inevals/ and export an eval definition:
eval:deep-research. You can set id explicitly when a
stable ID must differ from the file path.
Dataset fields
| Field | Meaning |
|---|---|
id | Stable example identifier used in reports. |
input | Prompt or structured input sent to the target agent. |
reference | Expected answer, JSON object, or rubric reference. |
metadata | Tags, split names, difficulty, owner, or traceable labels. |
Studio integration
Studio should discover evals through the project discovery API, not by parsing files directly. The eval source metadata includesfilePath and exportName so
Studio can show a form editor for structured fields and fall back to source
editing when a definition is too dynamic. createEvalSourceDocument normalizes a
discovered eval into the form-editable source document used by Studio panels.
For implementation steps, see Evals. For exact APIs, see
veryfront/eval.