Skip to content

Unlocking asynchronicity in continuous batching#3372

Merged
remi-or merged 11 commits into
mainfrom
remi-or/continuous-async
May 14, 2026
Merged

Unlocking asynchronicity in continuous batching#3372
remi-or merged 11 commits into
mainfrom
remi-or/continuous-async

Conversation

@remi-or
Copy link
Copy Markdown
Collaborator

@remi-or remi-or commented May 4, 2026

This PR adds the Unlocking asynchronicity in continuous batching blog post.

@remi-or remi-or force-pushed the remi-or/continuous-async branch from eda3def to 4f7d49e Compare May 4, 2026 07:45
Copy link
Copy Markdown
Member

@pcuenca pcuenca left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice, did a first pass!

There's a lot of ground to cover and you do a good job at introducing the problem and diving progressively into the details. I got a bit lost during the carry-over discussion (it's easy conceptually but I lost track of the positions lol), but it's looking good!

Happy to take another look later if you want.

Comment thread _blog.yml Outdated
- hub

- local: continuous_async
date: April 28, 2026
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reminder to update before release

Comment thread continuous_async.md Outdated
Comment thread continuous_async.md Outdated
Comment thread continuous_async.md Outdated
Comment thread continuous_async.md Outdated
Comment thread continuous_async.md Outdated

The figure above shows how this unfolds. The CPU prepares the batch, then quickly enqueues all the GPU work: the H2D transfer, the forward pass, the D2H transfer, with `record` and `wait` calls inserted between each stage. After that, the CPU is free. The GPU takes over, executing each stream in order as its dependency event is set. Notice the green annotation on the right: once the D2H transfer completes, the CPU comes back and reads the results. This final synchronization is the only point where the CPU blocks in the whole step. To implement it, we record a third event on the D2H stream after the output transfer, then call `d2h_done_event.synchronize()` on the CPU side. `synchronize` blocks the CPU until the D2H stream reaches that marker.

This is the key difference from synchronous batching. Before, the CPU blocked after every operation. Now it blocks once, at the very end, only to read results it genuinely needs at that moment. Everything else runs in the background.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but so far we are effectively using parallel computation to serialize tasks lol. Perhaps we could call this out a bit more explicitly so the reader is prepared to see the magic in a moment.

Copy link
Copy Markdown
Collaborator Author

@remi-or remi-or May 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed the last paragraph and transition to:


This is the key difference from synchronous batching: before, the CPU blocked after every operation. Now, it is free to do "something" while the GPU works.
We need to figure out what that "something" is, because right now nothing changed from a GPU-utilization standpoint.

Filling the vacuum

The window where the CPU is available sits between dispatching batch N and dispatching batch N+1 to the GPU. It's natural use would be to prepare batch N+1's inputs, so we can dispatch them to the GPU and have them be ready once batch N compute is over. Let us see how we can do this.


What do you think?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the key difference from synchronous batching: before, the CPU blocked after every operation. Noe, it is free to do "something" while the GPU works.

I think "Noe" is a typo (just wanted to point it out)

Comment thread continuous_async.md Outdated
Comment thread continuous_async.md Outdated
Comment thread continuous_async.md Outdated
Comment thread continuous_async.md Outdated

## Conclusion

We started with three questions:
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps we could do a higher-level recap rather than the lower-level problems we solved.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed the conclusion from the 2 paragraphs to:


We started with a synchronous workload where the CPU and GPU worked one after the other, leaving both underused. By moving from schedule-based dependancies to data-based dependancies and refining synchronization points, we managed to detangle the CPU and GPU workloads, making parallel execution of both hardwares possible. Hence, we were able to saturate the GPU work queue and ensure it is always running. This finaly resulted in a large increase of generation speed while maintaining the accuracy of the model. Pretty much a slam dunk.

Copy link
Copy Markdown
Member

@pcuenca pcuenca left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice, did a first pass!

There's a lot of ground to cover and you do a good job at introducing the problem and diving progressively into the details. I got a bit lost during the carry-over discussion (it's easy conceptually but I lost track of the positions lol), but it's looking good!

Happy to take another look later if you want.

Copy link
Copy Markdown
Contributor

@ariG23498 ariG23498 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am done reviewing first half of the blog post. I will be able to get the second half done by early tomorrow.

Initial verdict: I really like how the blog post is paced, while going really deep into technicalities, it never feels overwhelming. The ideas introduced are very advanced but the illustrations make it reasonable and intuitive. This is really good job! I got to intuitively understand events and streams for the very first time.

Comment thread continuous_async.md
Comment thread continuous_async.md Outdated
Comment thread continuous_async.md Outdated
Comment thread continuous_async.md Outdated
Comment thread continuous_async.md Outdated
Comment thread continuous_async.md Outdated
Comment thread continuous_async.md Outdated
Comment thread continuous_async.md
Comment thread continuous_async.md Outdated
Comment thread continuous_async.md Outdated
Co-authored-by: Pedro Cuenca <pedro@huggingface.co>
Co-authored-by: Aritra Roy Gosthipaty <aritra.born2fly@gmail.com>
Copy link
Copy Markdown
Contributor

@ariG23498 ariG23498 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really like the second half. I should also mention that there were moments where I had to read the paragraphs quite a number of times to understand it completely. This was due to my first time exposure to a lot of new terms. I predict that this is will also be the case for a lot of readers. To counter this, I think we should describe technical terms (buffers, slots, graphs, etc) better, and also be coherent with our wording.

This is a really well made blog post!

Comment thread continuous_async.md
Comment thread continuous_async.md
Comment thread continuous_async.md Outdated
Comment thread continuous_async.md Outdated
Comment thread continuous_async.md Outdated
Copy link
Copy Markdown
Contributor

@ariG23498 ariG23498 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks really good!

Comment thread continuous_async.md Outdated
Comment thread continuous_async.md
Comment thread continuous_async.md
Copy link
Copy Markdown
Contributor

@ariG23498 ariG23498 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think ideating around Continuous Batching Series would be a nice addition. What do you all think?

Comment thread continuous_async.md
Comment thread continuous_batching.md

![Title card](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/blog/continuous_batching/banner.png)

*TL;DR: in this blog post, starting from attention mechanisms and KV caching, we derive continuous batching by optimizing for throughput.*
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Continuous Batching series:
1. Continuous batching (_this blog post_)
2. [Unlocking asynchronicity in continuous batching](https://huggingface.co/blog/continuous_async)
*TL;DR: in this blog post, starting from attention mechanisms and KV caching, we derive continuous batching by optimizing for throughput.*

Copy link
Copy Markdown
Member

@sergiopaniego sergiopaniego left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

loved it!

Comment thread continuous_async.md Outdated
@remi-or remi-or merged commit 23c18a1 into main May 14, 2026
2 checks passed
@remi-or remi-or deleted the remi-or/continuous-async branch May 14, 2026 14:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants