Scaling Karpathy's Autoresearch: What Happens When the Agent Gets a GPU Cluster

(blog.skypilot.co)

95 points | by hopechong 5 hours ago

16 comments

kraddypatties 4 hours ago
I feel like most of this recent Autoresearch trend boils down to reinventing hyper-parameter tuning. Is the SOTA still Bayesian optimization when given a small cluster? It was ~3 years ago when I was doing this kind of work, haven't kept up since then.
Also, shoutout SkyPilot! It's been a huge help for going multi-cloud with our training and inference jobs (getting GPUs is still a nightmare...)!
[-]
- karpathy 3 hours ago
  Wrong and short-sighted take given that the LLM explores serially learning along the way, and can tool use and change code arbitrarily. It seems to currently default to something resembling hyperparameter tuning in absence of more specific instructions. I briefly considered calling the project “autotune” at first but I think “autoresearch” will prove to be the significantly more appropriate name.
  [-]
  - achierius 2 hours ago
    Out of curiosity, what sort of things have you seen it do that better fit 'autoresearch' than 'autotune' thus far? Optimizations it made that wouldn't be been surfaced by an autotune system, I suppose.
    [-]
    - karpathy 1 hour ago
      The most recent round of autoresearch (round 2) which decreased "time to GPT-2" from 1.8 hours to 1.65 hours had some examples. I adjusted the program.md to "look at modded nanogpt project and draw inspirations from there for things to try" and it came back with a bunch of tuning, but also tried and implemented new architecture changes, some of which actually helped including the smear gate and the backout skip connection. These are not just hyperparameters, they are new PyTorch code. I'm now working on a more general system that can have a queue of ideas that could be sourced from archive papers, github repos, etc.
    - jwilber 44 minutes ago
      I see this critique about autoresearch online often, but I think it’s misplaced.
      Here’s a use case that may illuminate the difference, from my own work at Nvidia. Im currently training some large sparse autoencoders, and there are issues with dead latents. Several solutions exit to help here, such as auxk, which I can certainly include and tune the relevant params as you describe. However, I have several other ideas that are much different, each of which requires editing core code (full evaluation changes, initialization strategies, architecture changes, etc.), including changes to parallelism strategies in the multi-rank environment I’m using. Moreover, based on my ideas and other existing literature, Claude can try a number of new ideas, each potentially involving more code changes.
      This automated run-and-discover process is far beyond what’s possible with hyperparam search.
  - kraddypatties 3 hours ago
    I can believe that in the long run.
    Does the agent have access to arxiv (a brief skim of the README didn't have an answer)? If not, it could be that the current approach of relying on the model's weights only is resulting in the perceived local optimum of hyperparameter tuning.
    Anecdotally, we built a little MCP for arxiv to help with our internal research, noticed a significant boost in the diversity of methods (architecture or otherwise) Claude and friends were able to reference.
  - corndoge 3 hours ago
    Would you say it's fair to describe autoresearch as a form of neural architecture search? I am curious what you think the core differences are between them.
  - westurner 2 hours ago
    Is there a cost to converge? And how much does it vary with the random seed?
    Re: OpenCogPrime:EconomicAttentionAllocation https://news.ycombinator.com/item?id=45518074 and something about eWASM (edit) https://news.ycombinator.com/item?id=47171887 .. from https://news.ycombinator.com/item?id=46825026 re: eWASM and costed opcodes for agent efficiency
  - saberience 2 hours ago
    Have you actually used LLMs for non trivial tasks? They are still incredibly bad when it comes to actually hard engineering work and they still lie all the time, it's just gotten harder to notice, especially if you're just letting it run all night and generate reams of crap.
    Most people are optimizing for terrible benchmarks and then don't really understand what the model did anyone and just assume it did something good. It's the blind leading the blind basically, and a lot of people with an AI-psychosis or delusion.
    [-]
    - nfg 2 hours ago
      Do you realise who you’re replying to?
      [-]
      - emp17344 1 hour ago
        Why should we care that he’s famous?
        [-]
        nfg 59 minutes ago
        Fame doesn’t enter it - the point is Karpathy has about as strong a claim as anyone to having “actually used LLMs for non trivial tasks”.
      - _menelaus 1 hour ago
        lolololol
- ipsum2 3 hours ago
  Hyperparam tuning that has better intuition and can incorporate architecture changes automatically. It won't invent something completely new though.
  [-]
  - kraddypatties 3 hours ago
    Hm, that's fair. It does feel like there's low hanging fruit in combining "old school" methods for conducting a hyperparameter sweep efficiently _with_ the higher level architecture edit ability of Autoresearch.
    Probably would cut the number of runs down by a significant number (as far as I can tell it's doing a grid search once it decides to mess with a knob or section of the architecture).
pbkhrv 2 hours ago
> How parallelism changed the agent’s research strategy > With a single GPU, the agent is stuck doing greedy hill-climbing: try one thing, check the result, pick a direction, try the next thing. With 16 GPUs, the strategy shifts. ...skip... 12 experiments in a single 5-minute wave. This makes it much harder to get stuck in local optima and much easier to find interaction effects between parameters.
The agent can theoretically come up with a protocol to run those same 12 experiments one-by-one and only then decide which branch to explore next - which I think would lead to the same outcome?
But in this case, it just happened to have stumbled on this particular outcome only because it didn't get a chance to execute a greedy strategy after the first 1 or 2 results.
Worse experiment design + parallelism = better experiment design + serialized execution ?
zhwu 4 hours ago
The most surprising part: the agent had access to both H100s and H200s. Without being told, it noticed H200s scored better and started screening ideas on H100s, then promoting winners to H200s for validation. That strategy emerged entirely on its own.
[-]
- rogerrogerr 4 hours ago
  Why do we think this emerged “on its own”? Surely this technique has been discussed in research papers that are in the training set.
  [-]
  - fdghrtbrt 2 hours ago
    Why surely? Have you never seen an LLM try something new?
    [-]
    - rogerrogerr 2 hours ago
      Is your assertion that no one has ever written "we tried some stuff on the small inexpensive platform first, then moved to the bigger more expensive platform with the more promising options" in a research paper or literally anywhere else?
      [-]
      - fdghrtbrt 1 hour ago
        No, that's not my assertion. In fact I asserted nothing at all.
        [-]
        rogerrogerr 1 hour ago
        You're speaking in riddles; your communication would be more effective if you didn't do that.
        [-]
        fdghrtbrt 1 hour ago
        You said "surely", and I asked:
        > Why surely? Have you never seen an LLM try something new?
        I'm afraid I can't make it any simpler than this.
        And I still don't know the answer to how you're so sure. To me there's several explanations, and it seems to you there's only one.
        I'm pretty happy with my communication style.
        [-]
        frank_nitti 56 minutes ago
        Seems to me the commenter was asking: what observations led us to conclude that original affirmative statement that “the AI did this entirely on its own”.
        Given that this is a common technique and not a novel invention, it’s probably present in the training set.
        The “surely” reads like it’s referring to the presence of that information in the training set. But your response casts it as saying “surely the AI has not invented something on its own”.
        The original question stands IMO, the burden of proof is on whoever is asserting that the AI has invented something on its own, with or without training data that surely already mentions this approach
    - caconym_ 1 hour ago
      I honestly don't think I have.
      In this case, using a cheap(er) signal or heuristic as an initial filter before spending more resources on cases that pass the filter is a pattern that shows up all over the place, and LLMs are good at picking up on patterns like that and generalizing them. AFAICT.
- hhh 3 hours ago
  Why?… The experiment.yaml shows that it is calling h100/200 explicitly, it’s pretty common for humans to say “number bigger more gooder” for anything… Lie and reverse the values and see what happens. I would put money on a rabbit hole of complaining about it being misconfigured.
  [-]
  - ed 3 hours ago
    Models are familiar with H100’s. They even predate ChatGPT.
- Aboutplants 4 hours ago
  Yeah I thought that was a particularly neat part
- TheJord 2 hours ago
  [dead]
herf 1 hour ago
This "early velocity only" approach seems like a problem - how do you know with 5-minute training runs that you aren't affecting the overall asymptote? e.g., what if the AI picks a quantizer that happens to be faster in the first five minutes, but has a big noise floor where it can't make more progress?
fabmilo 3 hours ago
I am fascinated by this example of using AI to improve AI. I won a small prize using this technique on helion kernels at a pytorch hackathon in SF.
The next step are: - give the agent the whole deep learning literature research and do tree search over the various ideas that have been proposed in the past. - have some distributed notepad that any of these agents can read and improve upon.
covi 4 hours ago
This feels like the chimpanzee with a power drill. An agent is honestly just brute-force search, but guided.
[-]
- chaos_emergent 3 hours ago
  Human-driven research is also brute-force but with a more efficient search strategy. One can think of a parameter that represents research-search-space-navigation efficiency. RL-trained agents will inevitably optimize for that parameter. I agree with your statement insomuch as the value of that efficiency parameter is lower for agents than humans today.
  It's really hard to imagine that they __won't__ exceed the human value for that efficiency parameter rather soon given that 1. there are plenty of scalar value functions that can represent research efficiency, of which a subset will result in robust training, and 2. that AI labs have a massive incentive to increase their research efficiency overall, along with billions of dollars and really good human researchers working on the problem.
- groby_b 3 hours ago
  Is there anything in the research space that doesn't fit "brute-force search, but guided"?
  All of science is "gather inputs, make hypothesis, test, analyse" on repeat.
  There's plenty to critique in the particular guidance approach, but the overall method is the same.
- gwern 2 hours ago
  Except the power drill isn't being used to make a better chimpanzee.
ipsum2 3 hours ago
A cluster is 2 nodes? That's technically true, but not very exciting.
[-]
saberience 2 hours ago
Wait, "Karpathy's Autoresearch", you mean a loop that prompts the agent to improve a thing given a benchmark?
People have been doing this for a year or more, Ralph loops etc.
I hate the weird strange Twitter world of hero-worship for folks that seems to arise just out of large followings.
Joe no-followers does this six months ago, nobody cares. Karpathy writes a really basic loop and it's now a kind of AI miracle prompting tons of grifters, copy-cats, weird hype.
I do wonder if LLMs have just made everyone seriously, seriously dumber all of a sudden. Most of the "Autoresearch" posts I see are completely rubbish, with AI optimizing for nonsense benchmarks and people failing to understand the graphs they are looking at. So yes, the AI made itself better at a useless benchmark while also making the code worse in 10 other ways you don't actually understand.
[-]
- password54321 2 hours ago
  The number of refurbished mac minis that are available in my country has suddenly dramatically increased ever since the Clawdbot tweet. People never learn.
aplomb1026 17 minutes ago
[dead]
robutsume 23 minutes ago
[dead]
maxothex 2 hours ago
[dead]
opensre 2 hours ago
[flagged]
ReacherL3692283 1 hour ago
[dead]
ladyxtel88 1 hour ago
[dead]
pratelsingh 4 hours ago
[dead]
mika-el 1 hour ago
[flagged]