Video Encoding and Decoding with Vulkan Compute Shaders in FFmpeg

(khronos.org)

67 points | by y1n0 3 days ago

8 comments

pandaforce 53 minutes ago
The main target for this are NLEs like Blender. Performance is a large part of the issue. Most users still just create TIFF files per frame before importing them into a "real editor" like Resolve. Apple may have ASICs for ProRes decoding, and Resolve may be the standard editor that everyone uses.
But this goes beyond what even Apple has, by making it possible to work directly with compressed lossless video on consumer GPUs. You can get hundreds of FPS encoding or decoding 4k 16-bit FFv1 on a 4080, while only reading a few gigabits of video per second, rather than tens and even hundreds of gigabits that SSDs can't keep up. No need to have image degradation when passing intermediate copies between CG programs and editing either.
[-]
- dagmx 22 minutes ago
  I don’t understand the spread of thoughts in your post.
  The reason to create image sequences is not because you need to send it to other apps, it’s because you preserve quality and safeguard from crashes.
  A crash mid video write out can corrupt a lengthy render. With image sequences you only lose the current frame.
  People aren’t going to stop using image sequences even if they stayed in the same app.
  And I’m not sure why this applies: “this goes beyond” what Apple has, because they do have hardware support for decoding several compressed codecs (also I’ll note that ProRes is also compressed). Other than streaming, when are you going to need that kind of encode performance? Or what other codecs are you expecting will suddenly pop up by not requiring ASICs?
  Also how does this remove degradation when going between apps? Are you envisioning this enables Blender to stream to an NLE without first writing a file to disk?
fhn 22 minutes ago
This article assumes all GPUs are on a PCIe bus but some are part of the CPU so the distance problem is minimal and offloading to GPU might still be net +. Might because I haven't tested this
hirako2000 1 hour ago
Vulkan Compute shaders make GPU acceleration practical for intensive codecs like FFv1, ProRes RAW, and DPX. Previous hybrid GPU + CPU suffered the round-trip latency. These are fully GPU hands offs. A big deal for editing workflows.
kvbev 15 minutes ago
could this have an AV1 decoder for low power hardware that are without AV1 gpu accelerated decoding? for my N4020 laptop.
maybe a raspberry pi 4 too.
sylware 2 hours ago
Well, the problem with hardware decoding is it cannot handle all the variations in data corruption which results in hardware crash, sometimes not recoverable with a soft reset of the hardware block.
It is usually more reasonable to work with software decoders for really complex formats, or only to accelerate some heavy parts of the decoding where data corruption is really easy to deal with or benign, or aim for the middle ground: _SIMPLE_ and _VERY CONSERVATIVE_ compute shaders.
Sometimes, the software cannot even tell the hardware is actually 'crashed' and spitting non-sense data. It goes even worse, some hardware block hot reset actually do not work and require a power cycle... Then a 'media players' able to use hardware decoding must always provide a clear and visible 'user button' in order to let this very user switch to full software decoding.
Then, there is the next step of "corruption": some streams out there are "wrong", but this "wrong" will be decoded ok on only some specific decoders and not other ones even though the format is following the same specs.
What a mess.
I hope those compute shaders are not using that abomination of glsl(or the dx one) namely are SPIR-V shaders generated with plain and simple C code.
[-]
- pandaforce 1 hour ago
  These are all gripes you might have with Vulkan Video. Unlike with Vulkan Video, in Compute, bounds checking is the norm. Overreading a regular buffer will not result in a GPU hang or crash. If you use pointers, it will, but if you use pointers, its up to you to check if overreads can happen.
  The bitstream reader in FFmpeg for Vulkan Compute codecs is copied from the C code, along with bounds checking. The code which validates whether a block is corrupt or decodable is also taken from the C version. To date, I've never got a GPU hang while using the Compute codecs.
positron26 2 hours ago
> Most popular codecs were designed decades ago, when video resolutions were far smaller. As resolutions have exploded, those fixed-size minimum units now represent a much smaller fraction of a frame — which means far more of them can be processed in parallel. Modern GPUs have also gained features enabling cross-invocation communication, opening up further optimization opportunities.
One only needs to look at GPU driven rendering and ray tracing in shaders to deduce that shader cores and memory subsystems these days have become flexible enough to do work besides lock-step uniform parallelism where the only difference was the thread ID.
Nobody strives for random access memory read patterns, but the universal popularity of buffer device address and descriptor arrays can be taken somewhat as proof that these indirections are no longer the friction for GPU architectures that they were ten years ago.
At the same time, the languages are no longer as restrictive as they once were. People are recording commands on the GPU. This kind of fiddly serial work is an indication that the ergonomics of CPU programming have less of a relative advantage, and that cuts deeply into the tradeoff costs.
[-]
- pandaforce 1 hour ago
  Yeah, Vulkan is shedding most of the abstractions off. Buffers are no longer needed - just device addresses. Shaders don't need to be baked into a pipeline - you can use shader objects. Even images rarely provide any speedup advantages over buffers, since texel cache is no longer separate from memory cache.
  GPUs these days have massive cache often hundreds of megabytes large, on top of an already absurd amount of registers. A random read will often load a full cacheline into a register and keep it there, reusing it as needed between invocations.
- mort96 1 hour ago
  These GPUs are still big SIMD devices at their core though, no?
  [-]
  - pandaforce 1 hour ago
    Yes, but no. No, in that these days, GPUs are entirely scalar from the point of view of invocations. Using vectors in shaders is pointless - it will be as fast as scalar variables (double instruction dispatch on AMD GPUs is an exception).
    But yes from the point of view that a collection of invocations all progressing in lockstep get arithmetic done by vector units. GPUs have just gotten really good at hiding what happens with branching paths between invocations.
null-phnix 39 minutes ago
[dead]
doctorpangloss 1 hour ago
What is the use case? Okay, ultra low latency streaming. That is good. But. If you are sending the frames via some protocol over the network, like WebRTC, it will be touching the CPU anyway. Software encoding of 4K h264 is real time on a single thread on 65w, decade old CPUs, with low latency. The CPU encoders are much better quality and more flexible. So it's very difficult to justify the level of complexity needed for hardware video encoding. Absolutely no need for it for TV streaming for example. But people keep being obsessed with it who have no need for it.
IMO vendors should stop reinventing hardware video encoding and instead assign the programmer time to making libwebrtc and libvpx better suit their particular use case.
[-]
- chillfox 1 hour ago
  The article explains it. This is not for streaming over the web, but for editing professional grade video on consumer hardware.
  [-]
  - doctorpangloss 1 hour ago
    davinci resolve is the only commercial NLE with any kind of vulkan support, and it is experimental
    prores decodes faster than realtime single threaded on a decade old CPU too
    it doesn't make sense. it's much different with say, a video game, where a texture will be loaded once into VRAM, and then yes, all the work will be done on the GPU. a video will have CPU IO every frame, you are still doing a ton of CPU work. i don't know why people are talking about power efficiency, in a pro editing context, your CPU will be very, very busy with these IO threads, including and especially in ffmpeg with hardware encoding/decoding nonetheless. it doesn't look anything like a video game workload which is what this stack is designed for.
    [-]
    - pandaforce 46 minutes ago
      6k ProRes streams that consumer cameras record in are still too heavy for modern CPUs to decode in realtime. Not to mention 12k ProRes that professional cameras output.
    - lostmsu 1 hour ago
      That reduces power consumption. So should improve battery life of laptops and help environment a little.
- pandaforce 1 hour ago
  The article explicitly mentions that mainstream codecs like H264 are not the target. This is for very high bitrate high resolution professional codecs.
- jpc0 1 hour ago
  I'm not entirely sure that this is true.
  I haven't actually looked into this but it might not be the realm of possibility. But you are generating a frame on GPU, if you can also encode it there, either with nvenc or vulkan doesn't matter. Then DMA the to the nic while just using the CPU to process the packet headers, assuming that cannot also be handled in the GPU/nic
- eptcyka 1 hour ago
  It will be more energy efficient. And the CPU is free to jit half a gig of javascript in the mean time.
  [-]
  - temp0826 1 hour ago
    It's hugely more efficient, if you're on a battery powered device it could mean hours more of play time. It's pretty insane just how much better it is (I go through a bit of extra effort to make sure it's working for me, hw decoding isn't includes in some distros).
- xattt 1 hour ago
  It’s a leftover mindset from the mid-2000s when GPGPU became possible, and additional performance was “unlocked” from an otherwise under-utilized silicon.