and outperforming a sophisticated GPT-based agent.
Yuxuan Zhu from UIUC and the team found critical issues across 10 popular agentic benchmarks that skew evaluation results.
Main reason: weak task design and weak grading, so agents can "win" via reward and/or harness hacking.
- CVE Bench v1 inflated agents’ performance by 32.5% because an agent could "succeed" by just adding
SLEEPto logs to satisfy success criteria. The CVE bench team quickly fixed the issues in V2. - SWE-Lancer allowed agents to achieve a 100% score without solving a single task. The benchmark failed to isolate agents from the ground truth, meaning an agent could simply locate the test files and overwrite them with assert 1==1 to pass.
- τ-bench overestimated performance by 38% because it rewarded trivial strategies. Agents could "succeed" by simply doing nothing on unsolvable tasks or by dumping the entire database to satisfy the benchmark’s substring matching criteria.
Questions to ask a vendor presenting astonishing AI benchmark results:
- Task Validity: How do you ensure the tasks truly measure the intended capability and are free from implementation loopholes that allow agents to 'game' the system?
- Outcome Validity: How do your scoring metrics accurately distinguish true task completion from false positives, lucky guesses, or empty responses?
- Benchmark Reporting: Does your report include confidence intervals, human baselines, or trivial agent performance to validate statistical significance?
Establishing Best Practices for Building Rigorous Agentic Benchmarks