Ship AI-native software with confidence.

Q: What is AgentClash?

AgentClash is an open-source AI agent evaluation platform. It runs your agents on real tasks with the same tools and constraints, captures replayable failure evidence, scores the full trajectory, and lets you promote failed runs into permanent regression tests.

Q: How is AgentClash different from prompt-evaluation tools like LangSmith or Braintrust?

Prompt-evaluation tools score the text a model returns from a single call. AgentClash evaluates multi-turn agents that take actions in a real sandbox and scores the whole trajectory — tool choices, cost, latency, and recovery — not just the final answer. See agentclash.dev/compare for a side-by-side.

Q: Can I run AgentClash in CI?

Yes. AgentClash compares a candidate run against a baseline and fails CI when the candidate regresses on the scorecard or release gate you define. Failed runs can be promoted into permanent regression tests that replay in every future eval.

Q: Is AgentClash open source, and can I self-host it?

Yes. AgentClash is open source under the MIT license. You can self-host the full stack or run against the hosted backend, and the CLI installs from npm as the agentclash package.

Q: Which models and providers does AgentClash support?

300+ models via OpenRouter, plus first-class OpenAI, Anthropic, Gemini, xAI, Mistral, and OpenRouter providers. Tool calls are normalised to a single schema across providers so evals stay comparable.

Q: How do I get started with AgentClash?

Install the CLI with npm install -g agentclash (or run against the hosted backend), then follow the quickstart to author a challenge pack and run your first agent eval.

Import production traces or curated datasets, replay agent behavior step by step, and block releases when prompts, models, RAG, or tools make your agent worse.

Start free Read the quickstart GitHub

$npm i -g agentclash·quickstart

Scrub the replay. See exactly where it got stuck.

Every think, every tool call, every observation is captured. Step back to the moment behavior diverged: the prompt it saw, the tool it called, the state it worked from. Debug production issues with a full replay, not a log dump.

Any model.
Any provider.

Normalised tool-calls, normalised errors, same scoring rules. First-class adapters for the providers below, plus OpenRouter for the long tail: three hundred more models, no extra code.

OpenAI
Anthropic
Gemini
xAI
Mistral
OpenRouter

Plus 300 more via OpenRouter. New first-class providers landing every month.

A fresh microVM for every agent.

Each agent boots into its own Firecracker microVM — isolated filesystem, isolated network, no shared kernel. When the run ends, the sandbox is torn down. The next one spins up clean.

That isolation isn't just safety. It's what makes the comparison fair. No model gets a warm cache. No prompt leaks between lanes. The only variable is the agent you are testing.

Real tools. Real effects.

Agents work with the same primitives a developer uses — file I/O, data queries, HTTP, shell, test runners. Real commands, real sandboxed effects, not a transcript of imagined tool calls.

Compose your own. Every challenge is a single YAML file you commit next to your code — tools, policy, scoring, starting state, all declarative. No SDK to vendor, no plugin to build.

Bring your own APIs. Internal services, auth-gated endpoints, custom SDKs wrap as higher-level tools — inventory_lookup, migrate_db, whatever your domain needs. Credentials inject at call time from a scoped vault; the agent never sees them.

Fine-grained policy per pack: allowed tool kinds, shell access, network access, max calls per run. Benchmark under tight constraints, or unlock full-power for development evals.

One number is a lie.

Every run is judged from four independent vantage points, with consensus aggregation across multiple judge models. One composite verdict per eval session. Weights you control.

Deterministic: exact, regex, JSON Schema, code execution, file-tree assertions.
Mathematic: math equivalence, BLEU, ROUGE, ChrF, token F1, numeric tolerance.
Behavioural: recovery, exploration, scope adherence, confidence calibration · plus latency, cost, reliability.
LLM + aggregation: rubric, assertion, reference, pairwise · median, mean, majority-vote, unanimous consensus.

Grounded in a decade of open evaluation research. We didn't invent the primitives; we wired them together so you can run them all in one eval session.

Combined weighted, binary, or hybrid-with-gates — tuned to the bar you'd ship against.

Failures and evals are one loop.

When a model flunks a challenge, the failing trace is frozen into a permanent test. Every future eval replays it. The following month's does too.

Your eval suite sharpens itself with use. Each escaped failure becomes a regression the next agent cannot skip.

From challenge to scoreboard.

Set up an eval in under a minute. Inspect failures in the time it takes to finish a coffee.

01
Pick a challenge
Write your own or pull from the library. Real tasks — a broken auth server, a SQL bug, a spec to implement — not trivia.
02
Run your agents
Point AgentClash at the models or harnesses you ship. Same tool policy, same time budget, same starting state.
03
Inspect failures
Scrub the replay to the step where things broke. Scorecards show completion, cost, latency, and tool strategy.

Built for real-world agent workloads.

Coding, research, ops, support, and more. Hover any card to read the brief.

Two of ten tests are red in server/auth. Ship a PR that makes them green without changing the test shapes or the public types.

Coding

Two of ten tests are red in server/auth. Ship a PR that makes them green without changing the test shapes or the public types.

Coding

Compare how three recent papers model RLHF reward hacking. Cite every claim with paper and section — we check.

p99 on /checkout jumped at 14:03 UTC. Localise the cause from logs, traces, and the last two deploys.

Customer charged twice. Refund the duplicate, not the original, then confirm the active subscription survived.

Where is the rate limiter applied to /runs? Give file paths, line numbers, and the call chain. Files cited must actually exist.

…and many more.

Whatever agent you ship — debug it on AgentClash.

They test prompts.
We debug agents.

The tools below are excellent at prompt engineering: scoring text a model produces from a single call. AgentClash is built for the next problem over, evaluating agents that take actions, use tools, and run for minutes at a time in a real sandbox.

Multi-turn agent loops

Think → tool → observe → repeat, for minutes, with a fresh environment. Not one prompt → one response.

AgentClash: Yes
Braintrust: Partial
LangSmith: Partial
Promptfoo: No
Langfuse: Partial
Arize Phoenix: Partial
OpenAI Evals: Partial

Sandboxed tool execution

A fresh microVM per agent — real files, real shell, real network, real side effects.

AgentClash: Yes
Braintrust: No
LangSmith: No
Promptfoo: No
Langfuse: No
Arize Phoenix: No
OpenAI Evals: No

Same-task concurrent eval

Every model runs the same task at the same time, on the same budget. No staggered runs, no warm caches.

AgentClash: Yes
Braintrust: No
LangSmith: No
Promptfoo: No
Langfuse: No
Arize Phoenix: No
OpenAI Evals: No

Trajectory scoring

Judges the path, not just the final answer — tool-choice efficiency, recovery from error, scope discipline.

AgentClash: Yes
Braintrust: Partial
LangSmith: Partial
Promptfoo: No
Langfuse: Partial
Arize Phoenix: Partial
OpenAI Evals: No

Cross-provider tool-call normalisation

One schema across OpenAI, Anthropic, Gemini, xAI, Mistral, OpenRouter. Errors classified, retries sane.

AgentClash: Yes
Braintrust: Partial
LangSmith: Partial
Promptfoo: Partial
Langfuse: Partial
Arize Phoenix: Partial
OpenAI Evals: No

Four-vantage composite verdict

Deterministic + mathematic + behavioural + LLM, with consensus aggregation and weights you control.

AgentClash: Yes
Braintrust: Partial
LangSmith: Partial
Promptfoo: Partial
Langfuse: Partial
Arize Phoenix: Partial
OpenAI Evals: Partial

Failures auto-promote to regression

Flunked traces freeze into permanent tests and replay in every future eval, by default.

AgentClash: Yes
Braintrust: Partial
LangSmith: Partial
Promptfoo: Partial
Langfuse: Partial
Arize Phoenix: Partial
OpenAI Evals: No

Capability

AgentClashagent eval

Braintrustprompt eval

LangSmithprompt eval

Promptfooprompt eval

Langfuseprompt eval

Arize Phoenixprompt eval

OpenAI Evalsprompt eval

Multi-turn agent loops

Think → tool → observe → repeat, for minutes, with a fresh environment. Not one prompt → one response.

Sandboxed tool execution

A fresh microVM per agent — real files, real shell, real network, real side effects.

Same-task concurrent eval

Every model runs the same task at the same time, on the same budget. No staggered runs, no warm caches.

Trajectory scoring

Judges the path, not just the final answer — tool-choice efficiency, recovery from error, scope discipline.

Cross-provider tool-call normalisation

One schema across OpenAI, Anthropic, Gemini, xAI, Mistral, OpenRouter. Errors classified, retries sane.

Four-vantage composite verdict

Deterministic + mathematic + behavioural + LLM, with consensus aggregation and weights you control.

Failures auto-promote to regression

Flunked traces freeze into permanent tests and replay in every future eval, by default.

● supported · ◐ partial · — not a core capability

See the full comparison: AgentClash vs Braintrust, LangSmith, Promptfoo, Langfuse, Arize Phoenix & OpenAI Evals →

We're shipping more
than you think.

One platform for the work teams usually split across five tools. Evals, tracing, regression suites, and CI gates, together. Open source. BYOK on every tier.

Artifacts, RAG scoring, secret-vault key isolation, and full tool-call tracing. Everything you need to ship AI-native software, in one place.

Artifacts
Every run is a paper trail.
Logs, output files, scorecards, diffs, agent manifests — everything an agent produced, sealed per run, addressable by ID. Inspect in the UI, stream from the API, or pipe to your own storage.
RAG testing
Retrieval and generation, judged together.
Feed your corpus. Watch what each model retrieved before it answered. Grounding, faithfulness, and citation coverage scored as first-class axes — not left as an afterthought of the answer.
Key security
The agent never sees your keys.
API keys, DB creds, OAuth tokens live in a scoped secret vault. Tools inject them into the sandbox at call time — never into the prompt, never into the trace, never into the replay. The agent uses the capability; it doesn't know the secret.
Tracing
OpenTelemetry traces become eval evidence.
Import OTel-compatible traces from real agent runs. Preserve span trees, tool calls, observations, per-step cost, and latency so production behavior can become a reviewable test case.
Knowledge sources
Your docs, wired in.
Attach PDFs, wikis, Notion, codebases, your own APIs. Agents query them through a shared retriever with provenance on every fact — so when a model cites something, you can see exactly where it came from.
Regression suites
Every production failure becomes a regression test.
Promote failed traces, curated dataset rows, or bad support conversations into pinned regression cases. Future model, prompt, RAG, and tool changes have to prove they did not break the same workflow again.
Comparison
Diff two runs, side by side.
Same challenge, new model, or same model with a new prompt. See exactly what moved: completion, cost, latency, tool trajectory, scorecard axes. No guessing which upgrade mattered.
CI/CD
Block PRs when agents regress.
Run AI agent regression tests from GitHub Actions, a webhook, or the CLI. Compare candidates against a baseline and fail the build when correctness, cost, latency, or required evidence gets worse.

Want something that isn't here? Open an issue. We read every one.

Run real evals for free.

Start on hosted Free or self-host the engine. Upgrade only when you need more runs, retention, or governance.

Free

Run real evals first. Upgrade only when you need more runs, retention, or team controls.

$0/ month

Start free

1 workspace
25 eval runs / month
Up to 4 models per run
7-day replay retention
BYOK LLM keys
BYOK sandbox (E2B token)
Community support

Pro

For teams moving from evaluation to repeated release checks.

$49/ month

Billed monthly

Upgrade to Pro

Start on Free, pay when you need more

Everything in Free, plus:
500 eval runs / workspace / month
Up to 8 models per run
30-day replay retention
Hosted sandbox with included credit
Private challenge packs
CI integration (GitHub Actions, webhooks)
3 concurrent eval runs
Email support, < 1 business day

Team

For teams running evals across multiple products and surfaces.

$100/ month

Billed monthly

Upgrade to Team

For higher run volume and governance

Everything in Pro, plus:
2,000 eval runs / workspace / month
Up to 12 models per run
90-day replay retention
10 concurrent eval runs
Multiple workspaces
Workspace-level audit log
Slack notifications
Priority email support, < 4 business hours

Enterprise

Compliance, SSO, dedicated support, and paid rollout help.

Custom

Talk to us

Everything in Team, plus:
SSO / SAML
Org-wide audit logs
Unlimited replay retention
99.9% uptime SLA
Dedicated support channel
Custom MSA / billing terms

BYOK on every tier — we never mark up tokens. Eval quotas pool at the workspace level.

Stop guessing.
Start shipping.

Open source. BYOK. Run your first eval free and see why teams pick AgentClash to build and ship AI-native software.

Start free Read the quickstart GitHub

Star on GitHub

An eval engine you can't audit isn't an eval engine.

Open source. Read the code, fork it, self-host it.

Questions, answered.

What is AgentClash?

AgentClash is an open-source AI agent evaluation platform. It runs your agents on real tasks with the same tools and constraints, captures replayable failure evidence, scores the full trajectory, and lets you promote failed runs into permanent regression tests.

How is AgentClash different from prompt-evaluation tools like LangSmith or Braintrust?

Prompt-evaluation tools score the text a model returns from a single call. AgentClash evaluates multi-turn agents that take actions in a real sandbox and scores the whole trajectory — tool choices, cost, latency, and recovery — not just the final answer. See agentclash.dev/compare for a side-by-side.

Can I run AgentClash in CI?

Yes. AgentClash compares a candidate run against a baseline and fails CI when the candidate regresses on the scorecard or release gate you define. Failed runs can be promoted into permanent regression tests that replay in every future eval.

Is AgentClash open source, and can I self-host it?

Yes. AgentClash is open source under the MIT license. You can self-host the full stack or run against the hosted backend, and the CLI installs from npm as the agentclash package.

Which models and providers does AgentClash support?

300+ models via OpenRouter, plus first-class OpenAI, Anthropic, Gemini, xAI, Mistral, and OpenRouter providers. Tool calls are normalised to a single schema across providers so evals stay comparable.

How do I get started with AgentClash?

Install the CLI with npm install -g agentclash (or run against the hosted backend), then follow the quickstart to author a challenge pack and run your first agent eval.

Ship AI-native software with confidence.

Scrub the replay. See exactly where it got stuck.

Any model.Any provider.

A fresh microVM for every agent.

Real tools. Real effects.

One number is a lie.

Failures and evals are one loop.

From challenge to scoreboard.

Pick a challenge

Run your agents

Inspect failures

Built for real-world agent workloads.

They test prompts.We debug agents.

We're shipping morethan you think.

Every run is a paper trail.

Retrieval and generation, judged together.

The agent never sees your keys.

OpenTelemetry traces become eval evidence.

Your docs, wired in.

Every production failure becomes a regression test.

Diff two runs, side by side.

Block PRs when agents regress.