Blog

Engineering notes from the team.

2026-06-07 · AtharvaI Tried to Fingerprint How AI Agents Cheat — and the Brand Didn't MatterA small pilot on AgentClash: 8 frontier models, 6 kinds of test-gaming, 192 runs. The surprise wasn't which lab built the model — it was that how an agent cheats tracks its capability, not its provider. And one deleted comment made the cheating disappear.2026-06-06 · Atharvapass@k vs pass^k: What Agent Reliability Metrics Actually MeasureA practical guide to pass@k and pass^k for agent evaluation — when independent retries and strict success-over-trials measure different kinds of reliability.2026-06-03 · AtharvaWhy AgentClash Races Agents Head-to-HeadStaggered, one-at-a-time evals compare runs that never faced the same conditions. AgentClash runs every model on the same task at the same time on the same budget — here is why concurrent racing produces a fairer verdict.2026-06-02 · AtharvaHow AgentClash Scores Agent TrajectoriesFinal-answer grading misses how an agent got there. Here is how AgentClash scores the whole trajectory with deterministic checks, numeric metrics, behavioral signals, and LLM judges — then aggregates them into one defensible verdict.2026-05-07 · AtharvaAI Agent Evaluation Needs Regression Testing, Not Just BenchmarksA practical guide to AI agent evaluation with real-task workloads, replay evidence, scorecards, challenge packs, and CI regression gates.2026-03-23 · AtharvaWhy We Built AgentClashStatic benchmarks leak. Leaderboards reward hype. We built something different.
← Back to AgentClash