# Evaluation spec reference

Validators, metrics, behavioral signals, runtime limits, and scorecard semantics from backend/internal/scoring.

Source: https://agentclash.dev/docs/challenge-packs/evaluation-spec-reference
Markdown export: https://agentclash.dev/docs-md/challenge-packs/evaluation-spec-reference

Pack-local `evaluation_spec` is unmarshalled into `scoring.EvaluationSpec` (`backend/internal/scoring/spec.go`) via strict decoding. This page maps fields to enums and collectors actually implemented—not aspirational placeholders.

For LLM graders see the dedicated guide: [LLM judges](llm-judges).

## Judge mode (`judge_mode`)

Top-level discriminator on how scoring composes deterministic pieces with judges:

| Value | Constant |
| --- | --- |
| `deterministic` | `JudgeModeDeterministic` |
| `llm_judge` | `JudgeModeLLMJudge` |
| `hybrid` | `JudgeModeHybrid` |

Invalid values fail validation early.

## Validators (`validators[]`)

Each entry is a `ValidatorDeclaration`:

- **`key`** — unique within spec; also forbidden to collide with metrics or judges
- **`type`** — enumerated `ValidatorType` (snippet below)
- **`target`** — evidence reference (validators require supported references—see Evidence references section)
- **`expected_from`** — often required depending on validator type (`RequiresExpectedFrom` in `spec.go`)
- **`config`** — type-specific strict JSON validated in `validation.go`

### Implemented validator types

From `ValidatorType*` constants:

`exact_match`, `contains`, `regex_match`, `json_schema`, `json_path_match`, `boolean_assert`, `fuzzy_match`, `numeric_match`, `normalized_match`, `token_f1`, `math_equivalence`, `bleu_score`, `rouge_score`, `chrf_score`, `file_content_match`, `file_exists`, `file_json_schema`, `directory_structure`, `code_execution`

File-ish validators gate on sandbox artifacts (see **File validators**: `IsFileValidator()` distinguishes these).

Always check `requires_expected_from`: e.g., `file_exists`, `directory_structure`, and `code_execution` can rely on config/paths without `expected_from`.

## Metrics (`metrics[]`)

`MetricDeclaration` requires:

| Field | Notes |
| --- | --- |
| `key` | Unique within spec |
| `type` | `numeric`, `text`, or `boolean` |
| `collector` | Implemented switch in `engine_metrics.go` |
| `unit` | Stored for dashboards/score normalization |

Collectors wired today (verbatim keys):

`run_total_latency_ms`, `run_ttft_ms`, `run_input_tokens`, `run_output_tokens`, `run_total_tokens`, `run_agent_tokens`, `run_race_context_tokens`, `run_model_cost_usd`, `run_completed_successfully`, `run_failure_count`, behavioral scores (`behavioral_recovery_score`, … ), `validator_pass_rate`

Declaring a collector that does not exist will fail silently only if evidence missing—prefer copying keys from tests in `backend/internal/scoring/engine_metrics.go`.

## Behavioral panel (`behavioral`)

Optional `behavioral.signals[]` referencing `behavioral.signal` enums:

- `recovery_behavior`
- `exploration_efficiency`
- `error_cascade`
- `scope_adherence`
- `confidence_calibration`

Each signal supports `weight`, optional `gate`, `pass_threshold` for hardened evaluation sessions.

## Post-execution sandbox captures (`post_execution_checks`)

Declare file/directory grabs before sandbox teardown (`post_execution.go`):

| `type` | Meaning |
| --- | --- |
| `file_capture` | Persist file bytes up to configured max |
| `directory_listing` | Snapshot structure |

Captured evidence is exposed to graders through `file:<path>` style references downstream—pair with validators that target those artifacts.

Defaults: ~1 MiB per file, aggregate caps enforced per run (`DefaultMaxFileSizeBytes`, `DefaultMaxTotalCaptureBytes`).

## Scorecard (`scorecard`)

`ScorecardDeclaration` holds:

### Dimensions (`dimensions`)

Each dimension may be a plain string shorthand (historical compatibility) **or** an expanded object specifying:

- **`key`** — dimension name (`correctness`, `latency`, `cost`, `behavioral`, custom)
- **`source`** — dispatcher: `validators`, `metric`, `reliability`, `latency`, `cost`, `behavioral`, `llm_judge`
- **`validators[]`**, **`metric`**, **`judge_key`** — depending on `source`
- **`weight`**, **`normalization`** — linear normalize against target/max envelopes
- **`gate`**, **`pass_threshold`** — hard fail semantics (see Strategies)

Built-in shortcut keys normalize during `normalizeEvaluationSpec` (`validation.go`): correctness/ reliability/ latency / cost / behavioral auto-fill sensible sources.

### Strategy (`strategy`)

| Strategy | Semantics sketch |
| --- | --- |
| `weighted` | Weighted mean; gated dims may still veto pass verdict |
| `binary` | All dimensions treated as gates; scorecard-level `pass_threshold` is rejected (prevents ambiguity) |
| `hybrid` | Gates AND aggregate over non-gate dims must clear optional `scorecard.pass_threshold` |

See doc comments on `ScoringStrategy` in `spec.go` for nuanced behavior—especially hybrid vs weighted gate interplay.

### `scorecard.pass_threshold`

Optional inclusive overall score cutoff (documented extensively in struct comment). Forbidden for pure `binary`.

### Judge budgets (`scorecard.judge_limits`)

Caps LLM-as-judge spend per run (`MaxSamplesPerJudge`, `MaxCallsUSD`, `MaxTokens`). Hard-coded ceilings (`JudgeMaxSamplesCeiling`) still clamp pack-authored overrides.

### Legacy normalization (`normalization`)

`latency.target_ms`, `cost.target_usd` migrate into dimension-level normalization automatically for older specs—still accepted.

## Runtime limits (`runtime_limits`)

`max_total_tokens`, `max_cost_usd`, `max_duration_ms`—enforced upstream of sandbox/model loops; surfaced for UI + scoring fallbacks.

## Pricing (`pricing.models[]`)

Pricing rows describing per-million token economics for **`run_model_cost_usd`** normalization. Matches `ProviderKey`/`ProviderModelID` tuples your workspace deployments actually use—misaligned rows produce weak cost dims but do not invalidate the pack.

## Evidence references validators understand

Validated by `isSupportedEvidenceReference`:

- Absolute shortcuts: `final_output`, `run.final_output`, `challenge_input`, `case.payload`
- Dotted accessors: `case.payload.*`, `case.inputs.*`, `case.expectations.*`, `artifact.*`
- Sandbox artifacts: prefix `file:` with non-empty remainder
- Literals: `literal:…`

Prefer explicit paths whenever you refactor input schema—ambiguous references fail validate instead of drifting silently.

## See also

- [LLM judges](llm-judges)
- [Write a challenge pack](../guides/write-a-challenge-pack)
- Historical v0 evaluation contract notes live in the monorepo file `docs/evaluation/challenge-pack-v0.md` (developer-oriented, not mirrored on the docs site).