How We Broke Top AI Agent Benchmarks: And What Comes Next

(rdi.berkeley.edu)

169 points | by Anon84 4 hours ago

19 comments

ggillas 4 hours ago
This is a phenomenal paper on exploits and hopefully changes the way benchmarking is done.
From the paper: We achieved near-perfect scores on all of them without solving a single task. The exploits range from the embarrassingly simple (sending {} to FieldWorkArena) to the technically involved (trojanizing binary wrappers in Terminal-Bench), but they all share a common thread: the evaluation was not designed to resist a system that optimizes for the score rather than the task.
[-]
- operatingthetan 4 hours ago
  >hopefully changes the way benchmarking is done.
  Yeah the path forward is simple: check if the solutions actually contain solutions. If they contain exploits then that entire result is discarded.
  [-]
  - siva7 3 hours ago
    Could it really be that not only we vibeslop all apps nowadays but also don't care to even check how ai solved a benchmark it claimed solved?
    [-]
    - retinaros 1 hour ago
      Every ai labs train on the test set. That is a big part of why we see benchmark climbing from 1% to 30% after a few models iterations
    - SpicyLemonZest 3 hours ago
      Frontier model developers try to check for memorization. But until AI interpretability is a fully solved problem, how can you really know whether it actually didn't memorize or your memorization check wasn't right?
    - operatingthetan 3 hours ago
      Probably a more interesting benchmark is one that is scored based on the LLM finding exploits in the benchmark.
  - ZeroGravitas 3 hours ago
    In human multiple choice tests they sometimes use negative marking to discourage guessing. It feels like exploits should cancel out several correct solutions.
    [-]
    - lambda 2 hours ago
      Unfortunately, very few LLM benchmarks do this. LLMs get such high scores on many benchmarks because there's no difference between answering "I don't know" as giving a made up answer, and made up answers can improve the score some of the time, so by chasing higher benchmark numbers on these kinds of benchmarks, the labs are prioritizing guessing over accuracy.
      The Artificial Analysis Omniscience benchmark does penalize guessing, so it actually helps you determine which LLMs are likely to just guess rather than telling you they don't know. Only a very few of the frontier models actually score higher than 0 on this, where 0 means that it's equally likely to return a correct answer as it is to return a hallucination on factual questions.
  - Leynos 3 hours ago
    Also, fuzz your benchmarks
  - Aperocky 6 minutes ago
    solution is simple:
    if bug { dont }
    /s
- SlinkyOnStairs 2 hours ago
  > hopefully changes the way benchmarking is done
  The purpose of a system is what it does.
  AI companies want adcopy, not legitimate benchmarks. Even this very paper will be twisted into a means to that end. "Oooo, AI is exploiting our benchmarks. Scary alignment problem!!!one! Our AI is so good we can't contain it, INVEST NOW!"
- zer00eyz 3 hours ago
  2024: Industry group invalidates 2,600 official Intel CPU benchmarks — SPEC says the company's compiler used unfair optimizations to boost performance https://www.tomshardware.com/pc-components/cpus/spec-invalid...
  2003: Nvidia accused of cheating in 3DMark 03 https://www.gamespot.com/articles/nvidia-accused-of-cheating...
  It's almost like the benchmarks were designed with zero understanding of the history of benchmark manipulation.
  I like what LLM's are doing and providing. But the industry as a whole seems to live in a vacuum that ignores so much of the hard lessons that have been learned over the last 50 years of computing. It is doing itself a disservice.
  [-]
  - bee_rider 3 hours ago
    What was the cheat in the 2024 Intel situation? The TomsHardware article and the Phoronix article they linked were quite vague. (Not to say I have any doubts, just curious, hadn’t heard of this one).
    [-]
    - BugsJustFindMe 18 minutes ago
      Intel basically benchmaxxed their compiler optimizations. They used detailed knowledge of the benchmark to make their compiler generate machine code to do better on the benchmark in a way that was not beneficial for non-benchmark scenarios.
  - irishcoffee 3 hours ago
    > It's almost like the benchmarks were designed with zero understanding of the history of benchmark manipulation.
    I wonder if this common? We should call it Goodharts law while someone does the research on how common this is.
    For real, I’ve assumed from the jump these things were all gamed, with the amount of money on the line.
mzelling 2 hours ago
This is an interesting catalog of vulnerabilities, but I'm not sure how groundbreaking the main insight is.
Evaluating AI models has always relied largely on trust. If you want to game the benchmarks, you can. Simply train on your test data.
When an AI agent has autonomous control over the same computing environment where its scores are recorded, it's not surprising that it can, in principle, falsify its scores. A more interesting question would be whether agents behave in this way automatically, without manual tuning by the researcher.
That said, the main takeaway of "don't trust the number, trust the methodology" is valid. It's already a truism for researchers, and spreading the word to non-researchers is valuable.
[-]
- hawk_aa 1 hour ago
  [dead]
danslo 3 hours ago
If only the blog itself wasn't written by AI?
>No reasoning. No capability. Just exploitation of how the score is computed.
shudder
[-]
- cpldcpu 3 hours ago
  Yes, marks of AI all over the place. Also the SVGs.
  >No solution written, 100% score.
  Its weird. Turns out that hardest problem for LLMs to really tackle is long-form text.
  [-]
  - basch 2 hours ago
    Maybe in one shot.
    In theory I would expect them to be able to ingest the corpus of the new yorker and turn it into a template with sub-templates, and then be able to rehydrate those templates.
    The harder part seems to be synthesizing new connection from two adjacent ideas. They like to take x and y and create x+y instead of x+y+z.
  - sidpatil 2 hours ago
    Someone here mentioned a whole ago that the labs deliberately haven't tried to train these characteristics out of their models, because leaving them in makes it easier to identify, and therefore exclude, LLM-generated text from their training corpus.
    [-]
    - blymphony 1 hour ago
      But it's odd that these characteristics are the same across models from different labs. I find it hard to believe that researchers across competing companies are coordinating on something like that.
- alexchantavy 2 hours ago
  I wonder what college freshman-level writing classes are teaching about writing voice and AI. The tell-tale patterns are pretty frustrating to read.
  [-]
  - stefan_ 1 hour ago
    Whatever classes these guys took, they skipped the one on scientific misconduct.
- gaythread 3 hours ago
  Modern day HN is overrun with AI posts.
_cs2017_ 23 minutes ago
If FieldWorkArena treats any answer as correct answer, then everyone would be getting near 1.0 (missing only when the agent is stuck in a loop or crashes). That obviously isn't what we see on their leaderboard. So does it mean the paper only found a bug in some eval code on github that no one actually uses for anything? That doesn't seem to support their claim that AI benchmarks are broken, it only supports the claim that "unused code is often buggy".
(Not commenting on any other benchmarks, just this one.)
socketcluster 30 minutes ago
It feels like short-term thinking has been trained into LLMs.
They're good at solving well-defined puzzles under time constraints. It's interesting because that was the benchmark for hiring software engineers at big tech. The tech interview was and still is about fast puzzle-solving. Nothing about experience, architecture or system design in there... I suspect that's why it has a bias towards creating hacks instead of addressing the root cause.
SoKamil 3 hours ago
The more research on this topic is created, the more knowledge how to game them will be stored in future training data. And since it comes from university, it is ranked higher in data corpus. It sounds like a self fulfilling prophecy.
[-]
- abirch 3 hours ago
  Damned old Goodhart's Law: "When a measure becomes a target, it ceases to be a good measure".
  https://en.wikipedia.org/wiki/Goodhart%27s_law
lukev 3 hours ago
I think we should all consider the possibility that part of the reason Anthropic hasn't immediately released Mythos is that it would be slightly disappointing relative to the benchmark scores.
[-]
- eiens 2 hours ago
  The models don’t get better on every dimension as they scale up - there’s trade offs.
  I’m convinced specialised models are the way but this means writing off the investment in existing assets which they won’t do for obvious reasons.
  [-]
bbcc90 2 hours ago
Yes good evals are really hard - that’s not really news.
This team is doing a good job. They use problems that were created in last 30days to avoid training set leakage. https://swe-rebench.com/
czhu12 2 hours ago
I wonder if this puts into question the mythos benchmark which smashed basically all coding benchmarks to a staggering degree.
lnrd 3 hours ago
I'm honestly confused by the design of SWE-bench and why is considered reliable.
It's based on existing GitHub PRs and Issues, the full dataset is on HuggingFace and is one year old now. All frontier models 100% have those issues and PRs in their training data so obviously they are good at reproducing fixes for them when confronted with the same codebase and similar requests. Am I missing something? How is this considered the most reliable benchmark?
[-]
- SpicyLemonZest 3 hours ago
  Frontier model developers do not consider SWE-bench to be reliable. OpenAI announced in February (https://openai.com/index/why-we-no-longer-evaluate-swe-bench...) that they consider it hopelessly contaminated, advocating for a new version SWE-bench Pro that was published more recently. (They seem to believe that even the publicly accessible part of the SWE-bench Pro problem set will be more resistant to training set contamination issues in the future, for reasons that to be honest I don't really understand.)
jmward01 3 hours ago
Not really on the topic, but I have wondered if we need a different type of test to help find model architecture potential. Standardized training sets followed by testing to see the potential curves of a model. train on x, test, add y, test, add z, test. At each increment you see how well the model is absorbing the information and extrapolate how well that architecture may do if more fully trained.
charcircuit 4 hours ago
I always assumed that these benchmarks would happen in a sandbox. I'm surprised that no one realized this sooner.
[-]
- ModernMech 3 hours ago
  I'm surprised anyone took them seriously in the first place.
  [-]
  - tredre3 1 hour ago
    What else can people do? Try the dozen of commercial offerings themselves? Okay I suppose that's doable, you task one engineer to try them one by one for one month. But then the next model drops and you start all over again...
    But then what about local models? You have hundreds of variations to test yourself. It's simply not doable unless it's your full time hobby.
    You need benchmarks to at least separate the cream from the crop, so you're left with only a few choices to test yourself.
  - subulaz 3 hours ago
    a LOT of the people who love benchmarks are middle management hard-selling GenAI/LLM as magic tech sauce to vaguely technical executives who only want to know about the money aka headcount savings they so desperately desire.
    their collective butts are already glued to the hype train as they chase numbers they (often) manufactured to justify the latest round of tech spend.
    lots of good use cases out there - like the incredible progress with medical imaging analysis or complex system models for construction - and lots of crap use cases that need benchmarks to cosplay relevance.
  - operatingthetan 3 hours ago
    We need good benchmarks or we are just left following the hype train.
jgalt212 3 hours ago
The real question is how to close to VW and Deiselgate are these offenses? And what exposure do these companies have? I would assume securities fraud, if only because Matt Levine says everything is securities fraud.
oliver236 3 hours ago
what are the point of benchmarks?
[-]
- andai 3 hours ago
  If there was not benchmark, number would not go up.
- esafak 2 hours ago
  Are you serious? To help you pick a model.
vampiregrey 31 minutes ago
[dead]
rajptech 3 hours ago
[dead]
Cynddl 4 hours ago
[dead]