I have long said I am an AI doubter until AI could print out the answers to hard problems or ones requiring tons of innovation. Assuming this is verified to be correct (not by AI) then I just became a believer. I would like to see a few more AI inventions to know for sure, but wow, it really is a new and exciting world. I really hope we use this intelligence resource to make the world better.
Math and coding competition problems are easier to train because of strict rules and cheap verification.
But once you go beyond that to less defined things such as code quality, where even humans have hard time putting down concrete axioms, they start to hallucinate more and become less useful.
We are missing the value function that allowed AlphaGo to go from mid range player trained on human moves to superhuman by playing itself.
As we have only made progress on unsupervised learning, and RL is constrained as above, I don't see this getting better.
AI is a remixer; it remixes all known ideas together. It won't come up with new ideas though; the LLMs just predict the most likely next token based on the context. That means the group of characters it outputs must have been quite common in the past. It won't add a new group of characters it has never seen before on its own.
But human researchers are also remixers. Copying something I commented below:
> Speaking as a researcher, the line between new ideas and existing knowledge is very blurry and maybe doesn't even exist. The vast majority of research papers get new results by combining existing ideas in novel ways. This process can lead to genuinely new ideas, because the results of a good project teach you unexpected things.
Models based on RL are still just remixers as defined above, but their distribution can cover things that are unknown to humans due to being present in the synthetic training data, but not present in the corpus of human awareness. AlphaGo's move 37 is an example. It appears creative and new to outside observers, and it is creative and new, but it's not because the model is figuring out something new on the spot, it's because similar new things appeared in the synthetic training data used to train the model, and the model is summoning those patterns at inference time.
remixing ideas that already exist is a major part of where innovation and breakthroughs come from. if you look at bitcoin as an example, hashes (and hashcash) and digital signatures existed for decades before bitcoin was invented. the cypherpunks also spent decades trying to create a decentralized digital currency to the point where many of them gave up and moved on. eventually one person just took all of the pieces that already existed and put them together in the correct way. i dont see any reason why a sufficiently capable llm couldn't do this kind of innovation.
Yeah but you're thinking of AI as like a person in a lab doing creative stuff. It is used by scientists/researchers as a tool *because* it is a good remixer.
Nobody is saying this means AI is superintelligence or largely creative but rather very smart people can use AI to do interesting things that are objectively useful. And that is cool in its own way.
I mean it's not going to invent new words no, but it can figure out new sentences or paragraphs, even ones it hasn't seen before, if it's highly likely based on its training and context. Those new sentences and paragraphs may describe new ideas, though!
> That means the group of characters it outputs must have been quite common in the past. It won't add a new group of characters it has never seen before on its own.
If LLMs really solved hard problems by 'trying every single solution until one works', we'd be sitting here waiting until kingdom come for there to be any significant result at all. Instead this is just one of a few that has cropped up in recent months and likely the foretell of many to come.
Yes, but is it "intelligence" is a valid question. We have known for a long time that computers are a lot faster than humans. Get a dumb person who works fast enough and eventually they'll spit out enough good work to surpass a smart person of average speed.
It remains to be seen whether this is genuinely intelligence or an infinite monkeys at infinite typewriters situation. And I'm not sure why this specific example is worthy enough to sway people in one direction or another.
Not always, humans are a lot better at poofing a solution into existence without even trying or testing. It's why we have the scientific method: we come up with a process and verify it, but more often than not we already know that it will work.
Compared to AI, it thinks of every possible scientific method and tries them all. Not saying that humans never do this as well, but it's mostly reserved for when we just throw mud at a wall and see what sticks.
More often than not, far, far, far more often than not, we do not already know that it will work. For all human endeavors, from the beginning of time.
If we get to any sort of confidence it will work it is based on building a history of it, or things related to "it" working consistently over time, out of innumerable other efforts where other "it"s did not work.
Shotgunning it is an entirely valid approach to solving something. If AI proves to be particularly great at that approach, given the improvement runway that still remains, that's fantastic.
Their 'Open Problems page' linked below gives some interesting context. They list 15 open problems in total, categorized as 'moderately interesting,' 'solid result,' 'major advance,' or 'breakthrough.' The solved problem is listed as 'moderately interesting,' which is presumably the easiest category. But it's notable that the problem was selected and posted here before it was solved. I wonder how long until the other 3 problems in this category are solved.
I like to imagine that the number of consumed tokens before a solution is found is a proxy for how difficult a problem is, and it looks like Opus 4.6 consumed around 250k tokens. That means that a tricky React refactor I did earlier today at work was about half as hard as an open problem in mathematics! :)
You might be joking, but you're probably also not that far off from reality.
I think more people should question all this nonsense about AI "solving" math problems. The details about human involvement are always hazy and the significance of the problems are opaque to most.
We are very far away from the sensationalized and strongly implied idea that we are doing something miraculous here.
I am kind of joking, but I actually don't know where the flaw in my logic is. It's like one of those math proofs that 1 + 1 = 3.
If I were to hazard a guess, I think that tokens spent thinking through hard math problems probably correspond to harder human thought than tokens spend thinking through React issues. I mean, LLMs have to expend hundreds of tokens to count the number of r's in strawberry. You can't tell me that if I count the number of r's in strawberry 1000 times I have done the mental equivalent of solving an open math problem.
You can spend countless "tokens" solving minesweeper or sudoku. This doesn't mean that you solved difficult problems: just that the solutions are very long and, while each step requires reasoning, the difficulty of that reasoning is capped.
1. LLMs aren't "efficient", they seem to be as happy to spin in circles describing trivial things repeatedly as they are to spin in circles iterating on complicated things.
2. LLMs aren't "efficient", they use the same amount of compute for each token but sometimes all that compute is making an interesting decision about which token is the next one and sometimes there's really only one follow up to the phrase "and sometimes there's really only" and that compute is clearly unnecessary.
3. A (theoretical) efficient LLM still needs to emit tokens to tell the tools to do the obviously right things like "copy this giant file nearly verbatim except with every `if foo` replaced with `for foo in foo`. An efficient LLM might use less compute for those trivial tokens where it isn't making meaningful decisions, but if your metric is "tokens" and not "compute" that's never going to show up.
Until we get reasonably efficient LLMs that don't waste compute quite so freely I don't think there's any real point in trying to estimate task complexity by how long it takes an LLM.
This is interesting, I like the thought about "what makes something difficult". Focusing just on that, my guess is that there are significant portions of work that we commonly miss in our evaluations:
1. Knowing how to state the problem. Ie, go from the vague problem of "I don't like this, but I do like this", to the more specific problem of "I desire property A". In math a lot of open problems are already precisely stated, but then the user has to do the work of _understanding_ what the precise stating is.
2. Verifying that the proposed solution actually is a full solution.
This math problem actually illustrates them both really well to me. I read the post, but I still couldn't do _either_ of the steps above, because there's a ton of background work to be done. Even if I was very familiar with the problem space, verifying the solution requires work -- manually looking at it, writing it up in coq, something like that. I think this is similar to the saying "it takes 10 years to become an overnight success"
>The details about human involvement are always hazy and the significance of the problems are opaque to most.
Not really. You're just in denial and are not really all that interested in the details. This very post has the transcript of the chat of the solution.
For those, like me, who find the prompt itself of interest …
> A full transcript of the original conversation with GPT-5.4 Pro can be found here [0] and GPT-5.4 Pro’s write-up from the end of that transcript can be found here [1].
> Subsequent to this solve, we finished developing our general scaffold for testing models on FrontierMath: Open Problems. In this scaffold, several other models were able to solve the problem as well: Opus 4.6 (max), Gemini 3.1 Pro, and GPT-5.4 (xhigh).
Interesting. Whats that “scaffold”? A sort of unit test framework for proofs?
I think in this context, scaffolds are generally the harness that surrounds the actual model. For example, any tools, ways to lay out tasks, or auto-critiquing methods.
I think there's quite a bit of variance in model performance depending on the scaffold so comparisons are always a bit murky.
As someone with only passing exposure to serious math, this section was by far the most interesting to me:
> The author assessed the problem as follows.
> [number of mathematicians familiar, number trying, how long an expert would take, how notable, etc]
How reliably can we know these things a-priori? Are these mostly guesses? I don't mean to diminish the value of guesses; I'm curious how reliable these kinds of guesses are.
For number of mathematicians familiar with and actively working on the problem, modern mathematics research is incredibly specialized, so it's easy to keep track of who's working on similar problems. You read each other's papers, go to the same conferences etc.
For "how long an expert would take" to solve a problem, for truly open problems I don't think you can usually answer this question with much confidence until the problem has been solved. But once it has been solved, people with experience have a good sense of how long it would have taken them (though most people underestimate how much time they need, since you always run into unanticipated challenges).
I was trying to get Claude and Codex to try and write a proof in Isabelle for the Collatz conjecture, but annoyingly it didn't solve it, and I don't feel like I'm any closer than I was when I started. AI is useless!
In all seriousness, this is pretty cool. I suspect that there's a lot of theoretical math that haven't been solved simply because of the "size" of the proof. An AI feedback loop into something like Isabelle or Lean does seem like it could end up opening up a lot of proofs.
I got Gemini to find a polynomial-time algorithm for integer factoring, but then I mysteriously got locked out of my Google account. They should at least refund me the tokens.
I don't know why I am still perpetually shocked that the default assumption is that humans are somehow unique.
It's this pervasive belief that underlies so much discussion around what it means to be intelligent. The null hypothesis goes out the window.
People constantly make comments like "well it's just trying a bunch of stuff until something works" and it seems that they do not pause for a moment to consider whether or not that also applies to humans.
If they do, they apply it in only the most restrictive way imaginable, some 2 dimensional caricature of reality, rather than considering all the ways that humans try and fail in all things throughout their lifetimes in the process of learning and discovery.
There's still this seeming belief in magic and human exceptionalism, deeply held, even in communities that otherwise tend to revolve around the sciences and the empirical.
Every living thing on Earth is unique. Every rock is unique in virtually infinite ways from the next otherwise identical rock.
There are also a tremendous number of similarities between all living things and between rocks (and between rocks and living things).
The default mode, the null hypothesis should be to assume that human intelligence isn't unique unless it can be proven otherwise.
In these repeated discussions around AI, there is criticism over the way an AI solves a problem, without any actual critical thought about the way humans solve problems.
The latter is left up to the assumption that "of course humans do X differently" and if you press you invariably end up at something couched in a vague mysticism about our inner-workings.
Humans apparently create something from nothing, without the recombination of any prior knowledge or outside information, and they get it right on the first try. Through what, divine inspiration from the God who made us and only us in His image?
Not sure if AI can have clever or new ideas, it still seems to be it combines existing knowledge and executes algoritms.
I am not necessarily saying humans do something different either, but I have yet to see a novel solution from an AI that is not simply an extrapolation of current knowledge.
Speaking as a researcher, the line between new ideas and existing knowledge is very blurry and maybe doesn't even exist. The vast majority of research papers get new results by combining existing ideas in novel ways. This process can lead to genuinely new ideas, because the results of a good project teach you unexpected things.
My biggest hesitation with AI research at the moment is that they may not be as good at this last step as humans. They may make novel observations, but will they internalize these results as deeply as a human researcher would? But this is just a theoretical argument; in practice, I see no signs of progress slowing down.
We call that Standing On The Shoulders Of Giants and revere Isaac Newton as clever, even though he himself stated that he was standing on the shoulders of giants.
Complete denial that AI/LLMs can produce novel, good things is an indefensible stance at this point. But the large volume of AI slop is still an unsolved problem, and the claim that "AI will still mostly deliver slop" seems to be almost certainly correct in the near-term.
We've had a few decades to address email spam, and still haven't manage to disincentivize it enough to stop being the main challenge for email as a communication medium. I don't think there's much hope that we'll be able to disincentive the widespread, large-scale creation of AI slop even after more expensive models with higher-quality output are available.
Seems like the high compute parallel thinking models weren't even needed, both the normal 5.4 and gemini 3.1 pro solved it. Somehow Gemini 3 deepthink couldn't solve it.
New goalpost, and I promise I'm not being facetious at all, genuinely curious:
Can an AI pose an frontier math problem that is of any interest to mathematicians?
I would guess 1) AI can solve frontier math problems and 2) can pose interesting/relevant math problems together would be an "oh shit" moment. Because that would be true PhD level research.
Fantastic news! That means with the right support tooling existing models are already capable of solving novel mathematics. There’s probably a lot of good mathematics out there we are going to make progress on.
This is a remarkable result if confirmed independently. The gap between solving competition problems and open research problems has always been significant - bridging that gap suggests something qualitatively different in the model capabilities.
We are missing the value function that allowed AlphaGo to go from mid range player trained on human moves to superhuman by playing itself. As we have only made progress on unsupervised learning, and RL is constrained as above, I don't see this getting better.
> Speaking as a researcher, the line between new ideas and existing knowledge is very blurry and maybe doesn't even exist. The vast majority of research papers get new results by combining existing ideas in novel ways. This process can lead to genuinely new ideas, because the results of a good project teach you unexpected things.
Nobody is saying this means AI is superintelligence or largely creative but rather very smart people can use AI to do interesting things that are objectively useful. And that is cool in its own way.
This is false.
It's pretty much how all the hard problems are solved by AI from my experience.
It remains to be seen whether this is genuinely intelligence or an infinite monkeys at infinite typewriters situation. And I'm not sure why this specific example is worthy enough to sway people in one direction or another.
The artist drew 10 pencil sketches and said "hmm I think this one works the best" and finished the painting based on it.
I said he didn't one shot it and therefore he has no ability to paint, and refused to pay him.
A basic AI chat response also doesn't first discard all other possible responses.
Compared to AI, it thinks of every possible scientific method and tries them all. Not saying that humans never do this as well, but it's mostly reserved for when we just throw mud at a wall and see what sticks.
If we get to any sort of confidence it will work it is based on building a history of it, or things related to "it" working consistently over time, out of innumerable other efforts where other "it"s did not work.
Shotgunning it is an entirely valid approach to solving something. If AI proves to be particularly great at that approach, given the improvement runway that still remains, that's fantastic.
https://epoch.ai/frontiermath/open-problems
I think more people should question all this nonsense about AI "solving" math problems. The details about human involvement are always hazy and the significance of the problems are opaque to most.
We are very far away from the sensationalized and strongly implied idea that we are doing something miraculous here.
If I were to hazard a guess, I think that tokens spent thinking through hard math problems probably correspond to harder human thought than tokens spend thinking through React issues. I mean, LLMs have to expend hundreds of tokens to count the number of r's in strawberry. You can't tell me that if I count the number of r's in strawberry 1000 times I have done the mental equivalent of solving an open math problem.
1. LLMs aren't "efficient", they seem to be as happy to spin in circles describing trivial things repeatedly as they are to spin in circles iterating on complicated things.
2. LLMs aren't "efficient", they use the same amount of compute for each token but sometimes all that compute is making an interesting decision about which token is the next one and sometimes there's really only one follow up to the phrase "and sometimes there's really only" and that compute is clearly unnecessary.
3. A (theoretical) efficient LLM still needs to emit tokens to tell the tools to do the obviously right things like "copy this giant file nearly verbatim except with every `if foo` replaced with `for foo in foo`. An efficient LLM might use less compute for those trivial tokens where it isn't making meaningful decisions, but if your metric is "tokens" and not "compute" that's never going to show up.
Until we get reasonably efficient LLMs that don't waste compute quite so freely I don't think there's any real point in trying to estimate task complexity by how long it takes an LLM.
1. Knowing how to state the problem. Ie, go from the vague problem of "I don't like this, but I do like this", to the more specific problem of "I desire property A". In math a lot of open problems are already precisely stated, but then the user has to do the work of _understanding_ what the precise stating is.
2. Verifying that the proposed solution actually is a full solution.
This math problem actually illustrates them both really well to me. I read the post, but I still couldn't do _either_ of the steps above, because there's a ton of background work to be done. Even if I was very familiar with the problem space, verifying the solution requires work -- manually looking at it, writing it up in coq, something like that. I think this is similar to the saying "it takes 10 years to become an overnight success"
Not really. You're just in denial and are not really all that interested in the details. This very post has the transcript of the chat of the solution.
> A full transcript of the original conversation with GPT-5.4 Pro can be found here [0] and GPT-5.4 Pro’s write-up from the end of that transcript can be found here [1].
[0] https://epoch.ai/files/open-problems/gpt-5-4-pro-hypergraph-...
[1] https://epoch.ai/files/open-problems/hypergraph-ramsey-gpt-5...
Interesting. Whats that “scaffold”? A sort of unit test framework for proofs?
I think there's quite a bit of variance in model performance depending on the scaffold so comparisons are always a bit murky.
> The author assessed the problem as follows.
> [number of mathematicians familiar, number trying, how long an expert would take, how notable, etc]
How reliably can we know these things a-priori? Are these mostly guesses? I don't mean to diminish the value of guesses; I'm curious how reliable these kinds of guesses are.
For "how long an expert would take" to solve a problem, for truly open problems I don't think you can usually answer this question with much confidence until the problem has been solved. But once it has been solved, people with experience have a good sense of how long it would have taken them (though most people underestimate how much time they need, since you always run into unanticipated challenges).
In all seriousness, this is pretty cool. I suspect that there's a lot of theoretical math that haven't been solved simply because of the "size" of the proof. An AI feedback loop into something like Isabelle or Lean does seem like it could end up opening up a lot of proofs.
It's this pervasive belief that underlies so much discussion around what it means to be intelligent. The null hypothesis goes out the window.
People constantly make comments like "well it's just trying a bunch of stuff until something works" and it seems that they do not pause for a moment to consider whether or not that also applies to humans.
If they do, they apply it in only the most restrictive way imaginable, some 2 dimensional caricature of reality, rather than considering all the ways that humans try and fail in all things throughout their lifetimes in the process of learning and discovery.
There's still this seeming belief in magic and human exceptionalism, deeply held, even in communities that otherwise tend to revolve around the sciences and the empirical.
Uh, because up until and including now, we are...?
There are also a tremendous number of similarities between all living things and between rocks (and between rocks and living things).
The default mode, the null hypothesis should be to assume that human intelligence isn't unique unless it can be proven otherwise.
In these repeated discussions around AI, there is criticism over the way an AI solves a problem, without any actual critical thought about the way humans solve problems.
The latter is left up to the assumption that "of course humans do X differently" and if you press you invariably end up at something couched in a vague mysticism about our inner-workings.
Humans apparently create something from nothing, without the recombination of any prior knowledge or outside information, and they get it right on the first try. Through what, divine inspiration from the God who made us and only us in His image?
I am not necessarily saying humans do something different either, but I have yet to see a novel solution from an AI that is not simply an extrapolation of current knowledge.
My biggest hesitation with AI research at the moment is that they may not be as good at this last step as humans. They may make novel observations, but will they internalize these results as deeply as a human researcher would? But this is just a theoretical argument; in practice, I see no signs of progress slowing down.
Sometimes just having the time/compute to explore the available space with known knowledge is enough to produce something unique.
We've had a few decades to address email spam, and still haven't manage to disincentivize it enough to stop being the main challenge for email as a communication medium. I don't think there's much hope that we'll be able to disincentive the widespread, large-scale creation of AI slop even after more expensive models with higher-quality output are available.
Can an AI pose an frontier math problem that is of any interest to mathematicians?
I would guess 1) AI can solve frontier math problems and 2) can pose interesting/relevant math problems together would be an "oh shit" moment. Because that would be true PhD level research.
Hoping that won't be the case with AI but we may need some major societal transformations to prevent it.