and outperforming a sophisticated GPT-based agent.

Yuxuan Zhu from UIUC and the team found critical issues across 10 popular agentic benchmarks that skew evaluation results.

Main reason: weak task design and weak grading, so agents can "win" via reward and/or harness hacking.

Questions to ask a vendor presenting astonishing AI benchmark results:

  1. Task Validity: How do you ensure the tasks truly measure the intended capability and are free from implementation loopholes that allow agents to 'game' the system?
  2. Outcome Validity: How do your scoring metrics accurately distinguish true task completion from false positives, lucky guesses, or empty responses?
  3. Benchmark Reporting: Does your report include confidence intervals, human baselines, or trivial agent performance to validate statistical significance?

Establishing Best Practices for Building Rigorous Agentic Benchmarks