Taking on CUDA with ROCm: 'One Step After Another'

(eetimes.com)

106 points | by mindcrime 6 hours ago

18 comments

lrvick 5 hours ago
Just spent the last week or so porting TheRock to stagex in an effort to get ROCm built with a native musl/mimalloc toolchain and get it deterministic for high security/privacy workloads that cannot trust binaries only built with a single compiler.
It has been a bit of a nightmare and had to package like 30+ deps and their heavily customized LLVM, but got the runtime to build this morning finally.
Things are looking bright for high security workloads on AMD hardware due to them working fully in the open however much of a mess it may be.
[-]
- WhyNotHugo 3 hours ago
  I also attempted to package ROCM on musl. Specifically, packaging it for Alpine Linux.
  It truly is a nightmare to build the whole thing. I got past the custom LLVM fork and a dozen other packages, but eventually decided it had been too much of a time sink.
  I’m using llama.cpp with its vulkan support and it’s good enough for my uses. Vulkan so already there and just works. It’s probably on your host too, since so many other things rely on it anyway.
  That said, I’d be curious to look at your build recipes. Maybe it can help power through the last bits of the Alpine port.
  [-]
  - lrvick 2 hours ago
    Keep an eye out for a stable rocm PR to stagex in the next week or so if all goes well.
- 999900000999 2 hours ago
  Wait ?
  You don't trust Nvidia because the drivers are closed source ?
  I think Nvidia's pledged to work on the open source drivers to bring them closer to the proprietary ones.
  I'm hopping Intel can catch up , at 32GB of VRAM for around 1000$ it's very accessible
  [-]
  - lrvick 2 hours ago
    Nvidia has been pledging that for years. If it ever actually happens, I am here for it.
    [-]
    - shaklee3 26 minutes ago
      It happened 2 years ago:
      https://developer.nvidia.com/blog/nvidia-transitions-fully-t...
  - cmxch 1 hour ago
    > Intel
    For some workloads, the Arc Pro B70 actually does reasonably well when cached.
    With some reasonable bring-up, it also seems to be more usable versus the 32gb R9700.
- jauntywundrkind 4 hours ago
  https://github.com/ROCm/TheRock/issues/3477 makes me quite sad for a variety of reasons. It shouldn't be like this. This work should be usable.
  [-]
  - lrvick 4 hours ago
    Oh I fully abandoned TheRock in my stagex ROCm build stack. It is not worth salvaging, but it was an incredibly useful reference for me to rewrite it.
  - hackernows_test 3 hours ago
    [flagged]
- salawat 50 minutes ago
  >Just spent the last week or so porting TheRock to stagex in an effort to get ROCm built with a native musl/mimalloc toolchain and get it deterministic for high security/privacy workloads that cannot trust binaries only built with a single compiler.
  ...I have a feeling you might not be at liberty to answer, but... Wat? The hell kind of "I must apparently resist Reflections on Trusting Trust" kind of workloads are you working on?
  And what do you mean "binaries only built using a single compiler"? Like, how would that even work? Compile the .o's with compiler specific suffixes then do a tortured linker invo to mix different .o's into a combined library/ELF? Are we talking like mixing two different C compilers? Same compiler, two different bootstraps? Regular/cross-mix?
  I'm sorry if I'm pushing for too much detail, but as someone whose actually bootstrapped compilers/user spaces from source, your usecase intrigues me just by the phrasing.
0xbadcafebee 3 hours ago
AMD has years of catching up to do with ROCm just to get their devices to work well. They don't support all their own graphics cards that can do AI, and when it is supported, it's buggy. The AMDGPU graphics driver for Linux has had continued instability since 6.6. I don't understand why they can't hire better software engineers.
[-]
- xethos 1 hour ago
  > I don't understand why they can't hire better software engineers.
  Beyond the fact they're competing with the most valuable companies in the world for talent while being less than a decade past "Bet the company"-level financial distress?
- onlyrealcuzzo 2 hours ago
  Because they aren't willing to pay for them?
- oofbey 2 hours ago
  Years. They neglected ROCm for soooo long. I have friends who worked there 5+ years ago who tried desperately to convince execs to invest more in ROCm and failed. You had to have your head stuck pretty deep in the sand back then to not see that AI was becoming an important workload.
  I would love AMD to be competitive. The entire industry would be better off if NVIDIA was less dominant. But AMD did this to themselves. One hundred percent.
  [-]
  - tux1968 1 hour ago
    It would be very helpful to deeply understand the truth behind this management failing. The actual players involved, and their thinking. Was it truly a blind spot? Or was it mistaken priorities? I mean, this situation has been so obvious and tragic, that I can't help feeling like there is some unknown story-behind-the-story. We'll probably never really know, but if we could, I wouldn't spend quite as much time wearing a tinfoil hat.
    [-]
    - throwawayrgb 1 hour ago
      if you asked AMD execs they'd probably say they never had the money to build out a software team like NVIDIA's. that might only be part of the answer. the rest would be things like lack of vision, "can't turn a tanker on a dime", etc.
    - oofbey 1 hour ago
      My guess is it’s just incompetence. Imagine you’re in charge of ROCm and your boss asks you how it’s going. Do you say good things about your team and progress? Do you highlight the successes and say how you can do all the major things CUDA can? I think many people would. Or do you say to your boss “the project I’m in charge of is a total disaster and we are a joke in the industry”? That’s a hard thing to say.
      [-]
      - Shitty-kitty 17 minutes ago
        a 10 year lead can't be closed overnight but Intel had a even larger lead and look how the mighty have fallen.
      - throwawayrgb 35 minutes ago
        > My guess is it’s just incompetence.
        maybe on some level but not that level you're describing. pretty much everyone at AMD understands the situation, and has for a while.
suprjami 30 minutes ago
Just in time for Vulkan tg to be faster in almost all situations, and Vulkan pp to be faster in many situations with constant improvements on the way, making ROCm obsolete for inference.
[-]
- kimixa 3 minutes ago
  ROCm vs Vulkan has never been about performance - you should be able to represent the "same" shader code in either, and often they back onto the same compilers and optimizers anyway. If one is faster, that often means something has gone /wrong/.
  The advantages for ROCm would be integration into existing codebases/engineer skillsets (e.g. porting an existing C++ implementation of something to the GPU with a few attributes and API calls rather than rewriting the core kernel in something like GLSL and all the management vulkan implies).
jmward01 59 minutes ago
I really want to get to the point that I am looking online for a GPU and Nvidia isn't the requirement. I think we are really close to there. Maybe we are there and my level of trust just needs to bump up.
rdevilla 3 hours ago
ROCm is not supported on some very common consumer GPUs, e.g. the RX 580. Vulkan backends work just fine.
[-]
- daemonologist 1 hour ago
  ROCm usually only supports two generations of consumer GPUs, and sometimes the latest generation is slow to gain support. Currently only RDNA 3 and RDNA 4 (RX 7000 and 9000) are supported: https://rocm.docs.amd.com/projects/install-on-linux/en/lates...
  It's not ideal. CUDA for comparison still supports Turing (two years older than RDNA 2) and if you drop down one version to CUDA 12 it has some support for Maxwell (~2014).
- chao- 2 hours ago
  I purchased my RX 580 in early 2018 and used it through late 2024.
  I am critical of AMD for not fully supporting all GPUs based on RNDA1 and RDNA2. While backwards compatibility is always better than less for the consumer, the RX 580 was a lightly-updated RX 480, which came out in 2016. Yes, ROCm technically came out in 2016 as well, but I don't mind acknowledging that it is a different beast to support the GCN architecture than the RDNA/CDNA generations that followed (Vega feels like it is off on an island of its own, and I don't even know what to say about it).
  As cool as it would be to repurpose my RX 580, I am not at all surprised that GCN GPUs are not supported for new library versions in 2026.
  I would be MUCH more annoyed if I had any RDNA1 GPU, or one of the poorly-supported RDNA2 GPUs.
- maxloh 2 hours ago
  I have the same experience with my RX 5700. The supported ROCm version is too old to get Ollama running.
  Vulkan backend of Ollama works fine for me, but it took one year or two for them to officially support it.
- BobbyTables2 3 hours ago
  Did it used to be different?
  A few years ago I thought I had used the ROCm drivers/libraries with hashcat on a RX580
  Now it’s obsolete ?
- hurricanepootis 3 hours ago
  RX 580 is a GCN 4 GPU. I'm pretty sure the bare minimum for ROCm is GCN 5 (Vega) and up.
  [-]
  - daemonologist 1 hour ago
    Among consumer cards, latest ROCm supports only RDNA 3 and RDNA 4 (RX 7000 and RX 9000 series). Most stuff will run on a slightly older version for now, so you can get away with RDNA 2 (6000 series).
bruce343434 3 hours ago
In my experience fiddling with compute shaders a long time ago, cuda and rocm and opencv are way too much hassle to set up. Usually it takes a few hours to get the toolkits and SDK up and running that is, if you CAN get it up and running. The dependencies are way too big as well, cuda is 11gb??? Either way, just use Vulkan. Vulkan "just works" and doesn't lock you into Nvidia/amd.
[-]
- cmovq 3 hours ago
  Vulkan is a pain for different reasons. Easier to install sure, but you need a few hundred lines of code to set up shader compilation and resources, and you’ll need extensions to deal with GPU addresses like you can with CUDA.
  [-]
  - rdevilla 2 hours ago
    Ah yes, but those hundred lines of code are basically free to produce now with LLMs...
roenxi 4 hours ago
> Challenger AMD’s ability to take data center GPU share from market leader Nvidia will certainly depend on the success or failure of its AI software stack, ROCm.
I don't think this is true. ROCm is a huge advantage for Nvidia but as far as I can tell it is more a set of R&D libraries than anything else, so all the Hot New Stuff keeps being Nvidia first and only (to start with) as the library ecosystem for the hotness doesn't exist yet. Then eventually new libraries are created that are CUDA independent and AMD turns out to make pretty good graphics cards.
I wouldn't be surprised of ROCm withered on the vine and AMD still does fine.
p1esk 4 hours ago
Someone from AMD posted this a few minutes ago, then deleted it:
"Anush's success is due to opting out of internal bureaucracy than anything else. most Claude use at AMD goes through internal infrastructure that can take hundreds of seconds per response due to throttling. Anush got us an exemption to use Anthropic directly. he is also exempt from normal policies on open source and so I can directly contribute to projects to add AMD support. He's an effective leader and has turned ROCm into a internal startup based in California. Definitely worth joining the team even if you've heard bad things about AMD as a whole."
This kind of bullshit is why I don't want to join AMD, even if this particular team is temporarily exempt from it.
[-]
- nl 3 hours ago
  > he is also exempt from normal policies on open source and so I can directly contribute to projects to add AMD support.
  It's crazy that this is a big deal.
  I understand the need for some kind of governance around this but for it to require a special exemption just shows how far the AMD culture needs to shift.
  [-]
  - 0xbadcafebee 3 hours ago
    Liability is always a big deal.
    [-]
    - nl 1 hour ago
      Sure, but it's not like other large companies don't have policies that address this.
- noident 49 minutes ago
  Policies like these are widespread in most companies with >1000 employees
- brcmthrowaway 3 hours ago
  So join NVIDIA instead
hurricanepootis 4 hours ago
I've been using ROCm on my Radeon RX 6800 and my Ryzen AI 7 350 systems. I've only used it for GPU-accelerated rendering in Cycles, but I am glad that AMD has an option that isn't OpenCL now.
superkuh 5 hours ago
AMD hasn't signaled in behavior or words that they're going to actually support ROCm on $specificdevice for more than 4-5 years after release. Sometimes it's as little as the high 3.x years for shrinks like the consumer AMD RX 580. And often the ROCm support for consumer devices isn't out until a year after release, further cutting into that window.
Meanwhile nvidia just dropped CUDA/driver support for 1xxx series cards from their most recent drivers this year.
For me ROCm's mayfly lifetime is a dealbreaker.
[-]
- mindcrime 5 hours ago
  Last year, AMD ran a GitHub poll for ROCm complaints and received more than 1,000 responses. Many were around supporting older hardware, which is today supported either by AMD or by the community, and one year on, all 1,000 complaints have been addressed, Elangovan said. AMD has a team going through GitHub complaints, but Elangovan continues to encourage developers to reach out on X where he’s always happy to listen.
  Seems like they're making some effort in that direction at least. If you have specific concerns, maybe try hitting up Anush Elangovan on Twitter?
- SwellJoe 4 hours ago
  Is it really that short? This support matrix shows ROCm 7.2.1 supporting quite old generations of GPUs, going back at least five or six years. I consider longevity important, too, but if they're actively supporting stuff released in 2020 (CDNA), I can't fault them too much. With open drivers on Linux, where all the real AI work is happening, I feel like this is a better longevity story than nvidia...where you're dependent on nvidia for kernel drivers in addition to CUDA.
  https://rocm.docs.amd.com/en/latest/compatibility/compatibil...
- lrvick 5 hours ago
  ROCm is open source and TheRock is community maintained, and in a minute the first Linux distro will have native in-tree builds. It will be supported for the foreseeable future due to AMDs open development approach.
  It is Nvidia that has the track record of closed drivers and insisting on doing all software dev without community improvements to expected results.
  [-]
  - KennyBlanken 4 hours ago
    > expected results
    The defacto GPU compute platform? With the best featureset?
    [-]
    - lrvick 4 hours ago
      And the worst privacy, transparency, and FOSS integration due to their insistence on a heavily proprietary stack.
      Also pretty hard to beat a Strix Halo right now in TPS for the money and power consumption.
      Even that aside there exist plenty like me that demand high freedom and transparency and will pay double for it if we have to.
      [-]
      - KennyBlanken 4 hours ago
        > And the worst privacy, transparency, and FOSS integration due to their insistence on a heavily proprietary stack.
        The market doesn't care about any of that. The consumer market doesn't care, and the commercial market definitely does not. The consumer market wants the most Fortnite frames per second per dollar. The commercial market cares about how much compute they can do per watt, per slot.
        > there exist plenty like me that demand high freedom and transparency and will pay double for it if we have to.
        The four percent share of the datacenter market and five percent of the desktop GPU market say (very strongly) otherwise.
        I have a 100% AMD system in front of me so I'm hardly an NVIDIA fanboy, but you thinking you represent the market is pretty nuts.
        [-]
        lrvick 4 hours ago
        I did not claim to represent the market as a whole, but I feel I likely represent a significant enough segment of it that AMD is going to be just fine.
        I think local power efficient LLMs are going to make those datacenter numbers less relevant in the long run.
- canpan 5 hours ago
  I was thinking to get 2x r9700 for a home workstation (mostly inference). It is much cheaper than a similar nvidia build. But still not sure if good value or more trouble.
  [-]
  - stephlow 4 hours ago
    I own a single R9700 for the same reason you mentioned, looking into getting a second one. Was a lot of fiddling to get working on arch but RDNA4 and ROCm have come a long way. Every once in a while arch package updates break things but that’s not exclusive to ROCm.
    LLM’s run great on it, it’s happily running gemma4 31b at the moment and I’m quite impressed. For the amount of VRAM you get it’s hard to beat, apart from the Intel cards maybe. But the driver support doesn’t seem to be that great there either.
    Had some trouble with running comfyui, but it’s not my main use case, so I did not spent a lot of time figuring that out yet
    [-]
    - canpan 4 hours ago
      Thanks for the answer. Brings my hope up. Looking in my local shops, I can get 3 cards for the price of one 5090.
      May I ask, what kind of tok/s you are getting with the r9700? I assume you got it fully in vram?
      [-]
      - jhgorrell 2 hours ago
        Stock install, no tuning.
        $uname -r 6.8.0-107-generic $ollama --version ollama version is 0.20.2 $ollama run "gemma4:31b" --verbose "write fizzbuzz in python." [...] total duration: 45.141599637s load duration: 143.633498ms prompt eval count: 21 token(s) prompt eval duration: 48.047609ms prompt eval rate: 437.07 tokens/s eval count: 1057 token(s) eval duration: 44.676612241s eval rate: 23.66 tokens/s
      - theoli 1 hour ago
        I have a dual R9700 machine, with both cards on PCIe gen4 x8 slots. The 256bit GDDR6 memory bandwidth is the main limiting factor and makes dense models above 9b fairly slow.
        The model that is currently loaded full time for all workloads on this machine is Unsloth's Q3_K_M quant of Qwen 3.5 122b, which has 10b active parameters. With almost no context usage it will generate 59 tok/sec. At 10,000 input tokens it will prefill at about 1500 tok/sec and generate at 51 tok/sec. At 110,000 input tokens it will prefill at about 950 tok/sec and generate at 30 tok/sec.
        Smaller MoE models with 3b active will push 70 tok/sec at 10,000 context. Dense models like Qwen 3.5 27b and Devstral Small 2 at 24b will only generate at around 13 - 15 tok/sec with 10,000 context.
        This is all on llama.cpp with the Vulkan backend. I didn't get to far in testing / using anything that requires ROCm because there is an outstanding ROCm bug where the GPU clock stays at 100% (and drawing like 60 watts) even when the model is not processing anything. The issue is now closed but multiple commenters indicate it is still a problem. Using the Vulkan backend my per-card idle draw is between 1 and 2 watts with the display outputs shut down and no kernel frame buffer.
  - chao- 5 hours ago
    Talking to friends who have fought more homelab battles than I ever will, my sense is that (1) AMD has done a better job with RDNA4 than the past generations, and (2) it seems very workload-dependent whether AMD consumer gear is "good value", "more trouble", or both at the same time.
    Edit: I misread the "2x r9700" as "2 rx9700" which differs from the topic of this comment (about RNDA4 consumer SKUs). I'll keep my comment up, but anyone looking to get Radeon PRO cards can (should?) disregard.
    [-]
    - KennyBlanken 4 hours ago
      Given RDNA3 was a pathetic joke, it wouldn't be hard for them to do a better job.
  - cyberax 5 hours ago
    I have this setup, with 2x 32Gb cards. It's perfect for my needs, and cheaper than anything comparable from NV.
- hotstickyballs 5 hours ago
  Driver support eats directly into driver development
alecco 5 hours ago
Apple got it right with unified memory with wide bus. That's why Mac Minis are flying for local models. But they are 10x less powerful in AI TOPS. And you can't upgrade the memory.
I really wish AMD and Intel boards get replaced by competent people. They could do it in very short time. Both have integrated GPUs with main memory. AMD and Intel have (or at least used to have) serious know-how in data buses and interconnects, respectively. But I don't see any of that happening.
ROCm? It can't even support decent Attention. It lacks a lot of features and NVIDIA is adding more each year. Soon they will reach escape velocity and nobody will catch them for a decade. smh
[-]
- caycep 4 hours ago
  Granted, I feel like NVIDIA GPU pricing is such that Mac minis will be way less than 10x cheaper if not already, so one might still get ahead purchasing a bulk order of Mac minis....
  [-]
  - KennyBlanken 4 hours ago
    A 5090 will cost you about the same amount of money as a Mac Studio M3 Ultra with eight times the RAM.
    It's pretty insane how overpriced NVIDIA hardware is.
    [-]
    - corndoge 4 hours ago
      But the 5090 can run Crysis
    - LoganDark 4 hours ago
      Yes but the 5090 can run games.
      Running games on my loaded M4 Max is worse than on my 3090 despite the over-four-year generational gap.
      Like, Pacific Drive will reach maybe 30fps at less than 1080p whereas the 3090 will run it better even in 4K.
      That could just be CrossOver's issue with Unreal Engine games, but "just play different games" is not a solution I like.
- bsder 4 hours ago
  > I really wish AMD and Intel boards get replaced by competent people.
  Intel? Agreed. But AMD is making money hand over fist with enterprise AI stuff.
  Right now, any effort that AMD or NVIDIA expend on the consumer sector is a waste of money that they could be spending making 10x more at the enterprise level on AI.
ycui1986 3 hours ago
For many LLM load, it seems ROCm is slower than vulkan. What’s the point?
shmerl 5 hours ago
Side question, but why not advance something like Rust GPU instead as a general approach to GPU programming? https://github.com/Rust-GPU/rust-gpu/
From all the existing examples, it really looks the most interesting.
I.e. what I'm surprised about is lack of backing for it from someone like AMD. It doesn't have to immediately replace ROCm, but AMD would benefit from it advancing and replacing the likes of CUDA.
[-]
- LegNeato 2 hours ago
  One of the rust-gpu maintainers here. Haven't officially heard from anyone at AMD but we've had chats with many others. Happy to talk with whomever! I would imagine AMD is focusing on ROCm over Vulkan for compute right now as their pure datacenter play, which makes sense.
  We've started a company around Rust on the GPU btw (https://www.vectorware.com/), both CUDA and Vulkan (and ROCm eventually I guess?).
  Note that most platform developers in the GPU space are C++ folks (lots of LLVM!) and there isn't as much demand from customers for Rust on the GPU vs something like Python or Typescript. So Rust naturally gets less attention and is lower on the list...for now.
- MobiusHorizons 4 hours ago
  From the readme:
  > Note: This project is still heavily in development and is at an early stage.
  > Compiling and running simple shaders works, and a significant portion of the core library also compiles.
  > However, many things aren't implemented yet. That means that while being technically usable, this project is not yet production-ready.
  Also projects like rust gpu are built on top of projects like cuda and ROCm they aren’t alternatives they are abstractions overtop
  [-]
  - shmerl 4 hours ago
    I think Rust GPU is built on top of Vulkan + SPIR-V as their main foundation, not on top of CUDA or ROCm.
    What I meant more is the language of writing GPU programs themselves, not necessarily the machinery right below it. Vulkan is good to advance for that.
    I.e. CUDA and ROCm focus on C++ dialect as GPU language. Rust GPU does that with Rust and also relies on Vulkan without tying it to any specific GPU type.
    [-]
    - markisus 2 hours ago
      The article mentions Triton for this purpose. I don’t think you will get maxed out performance on the hardware though because abstraction layers won’t let you access the fastest possible path.
      [-]
      - shmerl 1 hour ago
        > I don’t think you will get maxed out performance on the hardware though because abstraction layers won’t let you access the fastest possible path.
        You could argue about CPU architectures the same, no? Yet compilers solve this pretty well most of the time.
- HarHarVeryFunny 4 hours ago
  If you don't want/need to program at lowest level possible, then Pytorch seems the obvious option for AMD support, or maybe Mojo. The Triton compiler would be another option for kernel writing.
  [-]
  - shmerl 4 hours ago
    I don't think that's something that can be pitched as a CUDA alternative. Just different level.
xkbear89 11 minutes ago
[dead]
cameolkc 2 hours ago
[dead]
blovescoffee 5 hours ago
Naive question, could agents help speed up building code for ROCm parity with CUDA? Outside of code, what are the bottlenecks for reaching parity?
[-]
- WorldPeas 5 hours ago
  to be honest, outside of fullstack and basic MCU stuff, these agents aren't very good. Whenever a sufficiently interesting new model comes out I test it on a couple problems for android app development and OS porting for novel cpu targets and we still haven't gotten there yet. I'd be happy to see a day where it was possible however
  [-]
  - catgary 2 hours ago
    I’ve found they’re quite good when you’re higher in the compiler stack, where it’s essentially a game of translating MLIR dialects.
- hypercube33 2 hours ago
  Maybe this is dumb but at the moment through windows (and WSL?) you get: rocm DirectML Vulkan OpenML?
- jiggawatts 5 hours ago
  Lack of focus from AMD management. See the sibling comment: https://news.ycombinator.com/item?id=47745611
  They just don't care enough to compete.
nnevatie 1 hour ago
Why is it called "ROCm” (with the strange capitalization) in the first place? This may sound silly, but in order to compete, every detail matters, including the name.
[-]
- slongfield 1 hour ago
  It used to stand for "[R]adeon [O]pen [C]o[m]pute", but since it's not affiliated with the Open Compute Project, they dropped the meaning of it a little while ago, and now it doesn't stand for anything.
- dnautics 1 hour ago
  presumably a reference to rocm/socm robots?
- WanderPanda 1 hour ago
  This is so true! Shows a lack of care that usually doesn’t stop at just the naming