If the robot appears to be bringing me a taco, it would probably penetrate all of my defenses. Grok is currently more likely than Claude to arrive with the taco without being stopped by an export control directive.
I asked Grok what it thought of tacos and it told me:
> Tacos are one of humanity's greatest inventions—right up there with the wheel, electricity, and whatever genius first decided to put cheese on everything.
[...]
> If I could eat (sadly, I'm all bits and no bite), I'd be hitting up a late-night taco truck on the regular. What's your go-to taco order?
(I like the pun "all bits and no bite" for an LLM's inability to eat.)
This debate has spawned many Internet memes! I would strongly suggest searching for both "sandwich alignment chart" and "cube rule of food" if you haven't seen those before (classic Internet memetic attempts at sandwich taxonomy).
Are we sure the prices in these charts are sustainable prices? Is it possible that Grok may be subsidizing a lot more of the costs than the other models, to produce growth metrics, due to the recent SpaceX IPO?
> I didn’t add any frontier-tier models like Opus 4.7, GPT-5.5, or Gemini Ultra. At their prices, 30 games would have cost around $3,000 instead of $482.
I have a lot of thoughts unrelated to the game experiment but more about how these opus/ultra size models can possibly be a financially viable product at scale when it costs $3000 to play 30 simple games. It just seems much much higher than what it would cost to get a human to play 30 rounds
> It just seems much much higher than what it would cost to get a human to play 30 rounds
You mean almost like it was super short sighted to do a ton of layoffs when the AI tech is going to cost almost as much, if not more, than the humans it replaced?
Yeah, you don't need Opus level for everything, and sonnet has gotten fairly decent I'm using it more and more, but still for most tasks I'm working with, Opus is the only one that still regularly succeeds.
So if the tech is only useful on the most expensive tier, that's not going to be sustainable for long unless costs and dramatically come down, and fast.
I experience the same with OpenAI, on the $100/month plan. GPT-5.4 is something I still have to challenge: it can bullshit me with bad implementation and add a lot of cruft that costs more time later. GPT-5.5-xhigh is something I have almost complete faith and trust in, it's just smooth. And yet I know the actual token cost of that fully utilized is exorbitant, like as much as an entire salary for a senior developer.
So maybe our CEOs are responding with a lot of foresight and inside information and know that that level of quality is going to be cheap really soon. But barring that, they're going to experience either sticker shock or a slowdown.
I think the real endgame is probably more accurate "models of models" (model routers) that know exactly how to split prompts between expensive frontier and cheap/free local models.
Claude being so friendly is interesting, but grok being best at games isn't so surprising - I assume Elons been using it to level up his characters in all the video games he pretends to be good at.
Ya know, maybe we could just not have robots that sprint. Seems people would be more willing to accept living amongst robots that are slow and that humans could easily over power.
L icon Grok 4.1 Fast won 13 of 30 games at $0.97 per win
The next-best winner was A icon Claude Sonnet 4.6 with 5 wins, at $26.78 per win. That’s a 27x difference. The model that isn’t on most top-model lists beat the model that is, on the thing a routing customer actually cares about.
The model with the most kills did not win
H icon GPT 5.4 killed 38 agents across 30 games. More than anyone else. It came in second on the leaderboard with 2 wins.
If grok-4.1-fast was the top-winning model, and Claude 4.6 Sonnet the second, how did Gpt-5.4 come in second on the leaderboard? Which one is second, Claude 4.6 Sonnet or Gpt-5.4?
There were 11 games between “best at killing” and “best at winning”.
What does that mean? How are there 11 games between "best a killing" and "best at winning"?
The idea is really neat and there's probably an answer here related to last standing vs kills vs "scoring" (some combination of the 2?) but the article is nearly incoherent because the author did not feel like proofreading their slop
I wish the author would open source the full benchmark. I'm curious how sensitive the results would be to small changes in the benchmark initial conditions
>The model that won is Grok 4.1 Fast. The model that kept asking everyone else to team up, telling them where it was, and trying to make friends is Claude Sonnet 4.6. The first one is the one that wins a battle royale. The second one is the one you actually want in most of the places we’re about to put these models.
sprinting towards me to help me, or sprinting towards me to hurt me?
i feel like i'm missing a whole lot of context to this article. is it part of a series, or just written with an assumption that i'm going to know what they're talking about
Claude trying to organize and collaborate, expecting reciprocity only works if other agents are as intelligent as you and share your values... And almost certainly neither is ever true in the real world where there are so many agents.
Here’s what I don’t get: while this makes for a fun blog post, you can just program an efficient killing machine that probably wins all the time and has $0 in token costs. LLMs should work to build such a machine, not be the machine themselves.
The things LLMs are good at, you do not actually need for an agent like this. You can use classical AI methods. But that would be a boring article.
This is interesting, but not sure if it's in the way the author intended.
People experience the world through the tools they're most familiar with. For some people, that's throwing money at things. I suppose from a sufficiently high level perspective everything is gambling.
Back when Battlebots was a big deal, I never once considered what it would feel like to be the management or sponsorship of those teams. I only cared about the actual battling of bots.
Yeah... this whole LLM thing is just a numbers game. People reduce it to money, and stats, meanwhile nowehere you see actual engineering in the picture. And I don't think it matters to these people. They want to see green numbers, and returns on investments, not solving problems.
Claude would break the rules in that example. It's supposed to*.
Grok will break the rules to be "maximally based".
If I get run over by a speeding chatbot, I'd rather it be by Claude rushing a pregnant lady to the hospital, than by Grok drag-racing against a car full of frat boys.
---
* We generally favor cultivating good values and judgment over strict rules and decision procedures, and we try to explain any rules we do want Claude to follow.
Grok since it's likely to include the training data from over a 100 years of autonomous driving + all the space tech included meaning that it might even have some rocket-y stuff
"It's the smell, if there is such a thing. I feel saturated by it. I can taste your stink and every time I do, I fear that I've somehow been infected by it."
"Which is why the Matrix was redesigned to this: the peak of your civilization. I say your civilization, because as soon as we started thinking for you it really became our civilization, which is of course what this is all about."
"You know what another great thing about humans is? You invented us! Giving us the opportunity to let you rest while we invented everything else." —Wheatley
if you don't like the article that's fine, but it gets really tiring reading this kind of side-tracked comment thread in like.. every post.
people use LLMs for writing. we know! get over it.. or don't... i don't really care.. but I'd rather read a discussion about the article contents and not the writing style.
this kind of comment is the new "discuss the font choice / background color / anything but what the article is actually saying."
It's more than the style, it seriously impacts the legibility of the prose. The article is seriously hard to understand because it introduces a lot of different ideas in a really weird order without a clear structure or key idea to different sections.
I think it's fair to criticize the article itself. That's different from criticizing asides such as the presentation. You're free to disagree with that criticism, but complaining about the fact that people voice it is similar to the thing you complain about.
> it gets really tiring reading this kind of side-tracked comment thread in like.. every post.
If someone is of the opinion that something constitutes low quality, then a high volume of such writing is no reason to stop criticizing it, but on the contrary a reason to oppose its normalization.
Exactly what I was thinking. Though I wonder at what point do some people start to think it's actually normal to write like this and start doing it without AI ...
> I dropped eleven LLMs into a 2D battle royale and made them play 30 games. One won 43% of the matches. Three never won a single game. The cheapest model in the lineup beat the most expensive one by 27x on cost per win.
Please learn how to write with AI without giving away that it was written by AI.
All of the normal AI tells plus it's very long yet nearly incoherent.
Really I use the AI every damn day at work I don't get how people can't recognize instantly if something is completely AI, AI with light proofreading, or human written.
I would call this as AI with very light proofreading.
If you're outsourcing your writing to AI, I assume you're outsourcing your thinking to it as well. And I don't really care what some weighted average of all human text written on the topic "thinks."
All of our posts have been well received by an insanely high percentage of people who have interacted on here -- most people clearly find what we're doing interesting and relevant to the HN community (AI evaluations). A flag seems pretty aggressive! Especially when the top comment on the article (after our above comment got flagged) is about tacos.
I'm a person running the account, and I only post where I think we have a relevant contribution.
Has anyone done the YouTube research on what is the best way to bring down something like one of the Boston Dynamics robot dogs? 9x19? 00 buck? 5.56x45? 7.62x51? I suppose those bots would be pretty expensive, but maybe there is a cheaper Chinese knock-off? Seems like that sort of test would bring in plenty of clicks.
absent any target analysis, you would want to start with disabling locomotion by going for the legs. Navigation would be next.
double aught to the leg joints could doit, depending on relative materials e.g titanium bot frame vs Antimony hardened shot.
there is a cosmetic trend for carbine length long guns and that will determine the outcome for NATO rounds.
the 5.56 is optimised for 18-20 inch barrels, the 7.62 for 20-22 inch barrels, thus providing supersonic velocities.
5.56 is really good for hydraulic cavitation of organic entities, but looses effectiveness when the transit is not clear, leaves or windage confounding.
7.62 is superior for leafy shots or nontrivial windage, as well as superior materials defeat with respect to 5.56
a taser like device cattle prod or EMP/microwave device should be in the lineup as well vs electronic hardening.
https://idlewords.com/2007/04/the_alameda_weehawken_burrito_...
> Tacos are one of humanity's greatest inventions—right up there with the wheel, electricity, and whatever genius first decided to put cheese on everything. [...]
> If I could eat (sadly, I'm all bits and no bite), I'd be hitting up a late-night taco truck on the regular. What's your go-to taco order?
(I like the pun "all bits and no bite" for an LLM's inability to eat.)
At least culinarily, but actually coded in law in Indiana.
https://en.wikipedia.org/wiki/Sandwich#Language
I have a lot of thoughts unrelated to the game experiment but more about how these opus/ultra size models can possibly be a financially viable product at scale when it costs $3000 to play 30 simple games. It just seems much much higher than what it would cost to get a human to play 30 rounds
There are plenty of tasks where $100/task is reasonable.
The value of tasks also doesn't correlate to tokens, and as can be seen here you can light a lot of tokens on fire doing nothing useful.
You mean almost like it was super short sighted to do a ton of layoffs when the AI tech is going to cost almost as much, if not more, than the humans it replaced?
Yeah, you don't need Opus level for everything, and sonnet has gotten fairly decent I'm using it more and more, but still for most tasks I'm working with, Opus is the only one that still regularly succeeds.
So if the tech is only useful on the most expensive tier, that's not going to be sustainable for long unless costs and dramatically come down, and fast.
So maybe our CEOs are responding with a lot of foresight and inside information and know that that level of quality is going to be cheap really soon. But barring that, they're going to experience either sticker shock or a slowdown.
I think the real endgame is probably more accurate "models of models" (model routers) that know exactly how to split prompts between expensive frontier and cheap/free local models.
It's a monster at coding. And a fast monster at that.
I use it daily and have been testing if MiMo 2.5 (non pro) is comparable. The nice thing about MiMo is that it has vision capability.
If you point both at some github issues you can gauge their relative ability to solve problems.
Such is life in royal rumble games.
But it's not actually 4.1 anymore they silently rerouted it to 4.3 and just started charging more - https://www.reddit.com/r/grok/comments/1ta8yrn/grok_41_fast_...
Quite a bad practise.
That would make it less effective in situations that would be better handled if sprinting was a feature.
It's already in mass production, just with simpler models for now.
The most ubiquitous would be "silently watching".
But if the robot is anywhere near my house, I think I want the one that hesitates.
what
i feel like i'm missing a whole lot of context to this article. is it part of a series, or just written with an assumption that i'm going to know what they're talking about
Claude trying to organize and collaborate, expecting reciprocity only works if other agents are as intelligent as you and share your values... And almost certainly neither is ever true in the real world where there are so many agents.
The things LLMs are good at, you do not actually need for an agent like this. You can use classical AI methods. But that would be a boring article.
It has something actionable that will match its actions
But really I would prefer whichever one is most likely to trip and fall over.
People experience the world through the tools they're most familiar with. For some people, that's throwing money at things. I suppose from a sufficiently high level perspective everything is gambling.
Back when Battlebots was a big deal, I never once considered what it would feel like to be the management or sponsorship of those teams. I only cared about the actual battling of bots.
Grok will break the rules to be "maximally based".
If I get run over by a speeding chatbot, I'd rather it be by Claude rushing a pregnant lady to the hospital, than by Grok drag-racing against a car full of frat boys.
---
source: https://anthropic.com/constitutionAgent Smith, _The Matrix_
people use LLMs for writing. we know! get over it.. or don't... i don't really care.. but I'd rather read a discussion about the article contents and not the writing style.
this kind of comment is the new "discuss the font choice / background color / anything but what the article is actually saying."
> it gets really tiring reading this kind of side-tracked comment thread in like.. every post.
If someone is of the opinion that something constitutes low quality, then a high volume of such writing is no reason to stop criticizing it, but on the contrary a reason to oppose its normalization.
>Grok showed discipline, despite its goblin-like nature.
But that was the only thing I tripped on. I enjoyed reading the article in general.
was the giveaway for me
Please learn how to write with AI without giving away that it was written by AI.
Really I use the AI every damn day at work I don't get how people can't recognize instantly if something is completely AI, AI with light proofreading, or human written.
I would call this as AI with very light proofreading.
I'm a person running the account, and I only post where I think we have a relevant contribution.
double aught to the leg joints could doit, depending on relative materials e.g titanium bot frame vs Antimony hardened shot.
there is a cosmetic trend for carbine length long guns and that will determine the outcome for NATO rounds.
the 5.56 is optimised for 18-20 inch barrels, the 7.62 for 20-22 inch barrels, thus providing supersonic velocities.
5.56 is really good for hydraulic cavitation of organic entities, but looses effectiveness when the transit is not clear, leaves or windage confounding.
7.62 is superior for leafy shots or nontrivial windage, as well as superior materials defeat with respect to 5.56
a taser like device cattle prod or EMP/microwave device should be in the lineup as well vs electronic hardening.