Unlocking asynchronicity in continuous batching#3372
Conversation
eda3def to
4f7d49e
Compare
pcuenca
left a comment
There was a problem hiding this comment.
Very nice, did a first pass!
There's a lot of ground to cover and you do a good job at introducing the problem and diving progressively into the details. I got a bit lost during the carry-over discussion (it's easy conceptually but I lost track of the positions lol), but it's looking good!
Happy to take another look later if you want.
| - hub | ||
|
|
||
| - local: continuous_async | ||
| date: April 28, 2026 |
There was a problem hiding this comment.
reminder to update before release
|
|
||
| The figure above shows how this unfolds. The CPU prepares the batch, then quickly enqueues all the GPU work: the H2D transfer, the forward pass, the D2H transfer, with `record` and `wait` calls inserted between each stage. After that, the CPU is free. The GPU takes over, executing each stream in order as its dependency event is set. Notice the green annotation on the right: once the D2H transfer completes, the CPU comes back and reads the results. This final synchronization is the only point where the CPU blocks in the whole step. To implement it, we record a third event on the D2H stream after the output transfer, then call `d2h_done_event.synchronize()` on the CPU side. `synchronize` blocks the CPU until the D2H stream reaches that marker. | ||
|
|
||
| This is the key difference from synchronous batching. Before, the CPU blocked after every operation. Now it blocks once, at the very end, only to read results it genuinely needs at that moment. Everything else runs in the background. |
There was a problem hiding this comment.
Yes, but so far we are effectively using parallel computation to serialize tasks lol. Perhaps we could call this out a bit more explicitly so the reader is prepared to see the magic in a moment.
There was a problem hiding this comment.
Changed the last paragraph and transition to:
This is the key difference from synchronous batching: before, the CPU blocked after every operation. Now, it is free to do "something" while the GPU works.
We need to figure out what that "something" is, because right now nothing changed from a GPU-utilization standpoint.
Filling the vacuum
The window where the CPU is available sits between dispatching batch N and dispatching batch N+1 to the GPU. It's natural use would be to prepare batch N+1's inputs, so we can dispatch them to the GPU and have them be ready once batch N compute is over. Let us see how we can do this.
What do you think?
There was a problem hiding this comment.
This is the key difference from synchronous batching: before, the CPU blocked after every operation. Noe, it is free to do "something" while the GPU works.
I think "Noe" is a typo (just wanted to point it out)
|
|
||
| ## Conclusion | ||
|
|
||
| We started with three questions: |
There was a problem hiding this comment.
Perhaps we could do a higher-level recap rather than the lower-level problems we solved.
There was a problem hiding this comment.
Changed the conclusion from the 2 paragraphs to:
We started with a synchronous workload where the CPU and GPU worked one after the other, leaving both underused. By moving from schedule-based dependancies to data-based dependancies and refining synchronization points, we managed to detangle the CPU and GPU workloads, making parallel execution of both hardwares possible. Hence, we were able to saturate the GPU work queue and ensure it is always running. This finaly resulted in a large increase of generation speed while maintaining the accuracy of the model. Pretty much a slam dunk.
pcuenca
left a comment
There was a problem hiding this comment.
Very nice, did a first pass!
There's a lot of ground to cover and you do a good job at introducing the problem and diving progressively into the details. I got a bit lost during the carry-over discussion (it's easy conceptually but I lost track of the positions lol), but it's looking good!
Happy to take another look later if you want.
ariG23498
left a comment
There was a problem hiding this comment.
I am done reviewing first half of the blog post. I will be able to get the second half done by early tomorrow.
Initial verdict: I really like how the blog post is paced, while going really deep into technicalities, it never feels overwhelming. The ideas introduced are very advanced but the illustrations make it reasonable and intuitive. This is really good job! I got to intuitively understand events and streams for the very first time.
Co-authored-by: Pedro Cuenca <pedro@huggingface.co> Co-authored-by: Aritra Roy Gosthipaty <aritra.born2fly@gmail.com>
ariG23498
left a comment
There was a problem hiding this comment.
I really like the second half. I should also mention that there were moments where I had to read the paragraphs quite a number of times to understand it completely. This was due to my first time exposure to a lot of new terms. I predict that this is will also be the case for a lot of readers. To counter this, I think we should describe technical terms (buffers, slots, graphs, etc) better, and also be coherent with our wording.
This is a really well made blog post!
ariG23498
left a comment
There was a problem hiding this comment.
I think ideating around Continuous Batching Series would be a nice addition. What do you all think?
|
|
||
|  | ||
|
|
||
| *TL;DR: in this blog post, starting from attention mechanisms and KV caching, we derive continuous batching by optimizing for throughput.* |
There was a problem hiding this comment.
| Continuous Batching series: | |
| 1. Continuous batching (_this blog post_) | |
| 2. [Unlocking asynchronicity in continuous batching](https://huggingface.co/blog/continuous_async) | |
| *TL;DR: in this blog post, starting from attention mechanisms and KV caching, we derive continuous batching by optimizing for throughput.* |
Co-authored-by: Sergio Paniego Blanco <sergiopaniegoblanco@gmail.com>
This PR adds the
Unlocking asynchronicity in continuous batchingblog post.