Product, reference, and contributor docs in one place.
Concepts
Challenge Packs and Inputs
Learn what a challenge pack really is in AgentClash, how the bundle is structured, and how inputs become runnable cases.
A challenge pack is a versioned YAML bundle that defines the workload, scoring contract, execution policy, and input sets for a repeatable evaluation.
What makes it a challenge pack instead of a prompt
A challenge pack is not just a task description. In the current repo, a runnable pack carries enough structure for AgentClash to do four jobs consistently:
- execute the same workload again later
- attach one or more deployments to that workload
- score the result using a versioned evaluation spec
- preserve the relationship between a failed case and the evidence that exposed it
That is why the API does not ask you to start a run with a loose prompt blob. It asks for a challenge_pack_version_id.
The current bundle shape
The parser in backend/internal/challengepack/bundle.go expects a YAML bundle with these top-level sections:
pack: human metadata likeslug,name, andfamilyversion: the executable version blocktools: optional pack-defined composed toolschallenges: the workload definitionsinput_sets: the concrete runnable cases
A pack becomes runnable through its version block. That block currently carries the load-bearing execution data:
number: the pack version numberexecution_mode:nativeorprompt_evaltool_policy: allowed tool kinds and runtime togglesfilesystem: optional filesystem constraintssandbox: network, env, package, and template configurationevaluation_spec: the scoring contractassets: version-scoped files or artifact references
flowchart TD
P[pack metadata]
V[version]
V --> ES[evaluation_spec]
V --> SB[sandbox]
V --> TP[tool_policy]
V --> AS[version assets]
C[challenges] --> IS[input_sets]
IS --> CASES[cases]
V --> RUN[run execution]
CASES --> RUN
ES --> SCORE[scorecards]
Challenge, input set, case, and asset are different things
These terms are easy to blur together. Do not blur them.
- challenge pack: the entire versioned bundle
- challenge: one task definition inside the bundle
- input set: one named collection of runnable cases for that pack version
- case: one concrete workload item tied to a challenge via
challenge_key - asset: a file-like dependency declared by key and path, optionally backed by a stored artifact ID
The bundle model in the repo uses input_sets[].cases[] as the main execution unit. A case can carry:
payload- structured
inputs - structured
expectations artifacts- case-local
assets
That makes cases more expressive than a single flat prompt. They can reference files, expected outputs, and evaluator inputs without inventing an ad-hoc schema per benchmark.
The evaluation spec is part of the pack, not global product config
The current evaluation docs are explicit about this. The scoring contract lives inside the pack version’s manifest. That means the pack defines:
- validator keys and types
- metrics and collectors
- runtime limits
- pricing rows used for cost scoring
- scorecard dimensions and normalization thresholds
This matters because AgentClash needs scorecards to remain auditable. When a run is scored, the product can persist the exact evaluation_spec_id that was used. The publish response already returns that ID.
Execution mode matters
Two execution modes are visible in the current code and examples:
prompt_eval: lighter-weight packs that focus on prompt-style evaluationnative: packs that can carry sandbox, tool, and execution policy for richer runs
You should choose the simpler mode unless the workload really needs a sandbox, files, or tool execution.
Sandbox, tool policy, and internet access belong to the pack version
This is one of the most important design choices in the repo.
The pack version can say what the evaluator is allowed to do:
- which tool kinds are allowed
- whether shell or network access is enabled
- what network CIDRs are allowed
- which additional packages should exist in the sandbox
- which env vars are injected as literal values
In other words, the pack is not only content. It is also policy.
Assets and artifact-backed packs
The version block, challenge blocks, and case blocks can all reference assets. Each asset has a key and path, and may also carry media_type, kind, or artifact_id.
That gives you two useful authoring patterns:
- check small fixtures into the pack and refer to them by path
- attach previously uploaded workspace artifacts and refer to them by
artifact_id
Validation already checks that asset references are real. If a case or expectation points at an artifact key that was never declared, publish-time validation fails.
Publish and validate are first-class workflow steps
The API and CLI already expose the authoring loop directly:
- validate with
POST /v1/workspaces/{workspaceID}/challenge-packs/validate - publish with
POST /v1/workspaces/{workspaceID}/challenge-packs - list packs with
GET /v1/workspaces/{workspaceID}/challenge-packs - list published input sets with
GET /v1/workspaces/{workspaceID}/challenge-pack-versions/{versionID}/input-sets
The CLI mirrors that with:
agentclash challenge-pack validate <file>
agentclash challenge-pack publish <file>
agentclash challenge-pack list
Publish returns more than a pack ID. It returns:
challenge_pack_idchallenge_pack_version_idevaluation_spec_idinput_set_ids- optional
bundle_artifact_id
That tells you the pack bundle is treated as a concrete artifact of record, not just transient YAML.
How a pack becomes a run
A run binds:
- one
challenge_pack_version_id - one or more deployment IDs
- optionally one selected input set
From there the worker resolves the pack manifest, execution policy, assets, and scoring contract into the runtime path.
That is why challenge packs are foundational in AgentClash. They are the unit that makes two runs comparable without hand-waving.