Unlocking asynchronicity in continuous batching by remi-or · Pull Request #3372 · huggingface/blog

remi-or · 2026-05-04T07:44:10Z

This PR adds the Unlocking asynchronicity in continuous batching blog post.

pcuenca

Very nice, did a first pass!

There's a lot of ground to cover and you do a good job at introducing the problem and diving progressively into the details. I got a bit lost during the carry-over discussion (it's easy conceptually but I lost track of the positions lol), but it's looking good!

Happy to take another look later if you want.

pcuenca · 2026-05-04T14:43:22Z

    - hub
+
+- local: continuous_async
+  date: April 28, 2026


reminder to update before release

pcuenca · 2026-05-04T15:40:20Z

+
+The figure above shows how this unfolds. The CPU prepares the batch, then quickly enqueues all the GPU work: the H2D transfer, the forward pass, the D2H transfer, with `record` and `wait` calls inserted between each stage. After that, the CPU is free. The GPU takes over, executing each stream in order as its dependency event is set. Notice the green annotation on the right: once the D2H transfer completes, the CPU comes back and reads the results. This final synchronization is the only point where the CPU blocks in the whole step. To implement it, we record a third event on the D2H stream after the output transfer, then call `d2h_done_event.synchronize()` on the CPU side. `synchronize` blocks the CPU until the D2H stream reaches that marker.
+
+This is the key difference from synchronous batching. Before, the CPU blocked after every operation. Now it blocks once, at the very end, only to read results it genuinely needs at that moment. Everything else runs in the background.


Yes, but so far we are effectively using parallel computation to serialize tasks lol. Perhaps we could call this out a bit more explicitly so the reader is prepared to see the magic in a moment.

Changed the last paragraph and transition to:

This is the key difference from synchronous batching: before, the CPU blocked after every operation. Now, it is free to do "something" while the GPU works.
We need to figure out what that "something" is, because right now nothing changed from a GPU-utilization standpoint.

Filling the vacuum

The window where the CPU is available sits between dispatching batch N and dispatching batch N+1 to the GPU. It's natural use would be to prepare batch N+1's inputs, so we can dispatch them to the GPU and have them be ready once batch N compute is over. Let us see how we can do this.

What do you think?

This is the key difference from synchronous batching: before, the CPU blocked after every operation. Noe, it is free to do "something" while the GPU works.

I think "Noe" is a typo (just wanted to point it out)

pcuenca · 2026-05-04T15:56:44Z

+
+## Conclusion
+
+We started with three questions:


Perhaps we could do a higher-level recap rather than the lower-level problems we solved.

Changed the conclusion from the 2 paragraphs to:

We started with a synchronous workload where the CPU and GPU worked one after the other, leaving both underused. By moving from schedule-based dependancies to data-based dependancies and refining synchronization points, we managed to detangle the CPU and GPU workloads, making parallel execution of both hardwares possible. Hence, we were able to saturate the GPU work queue and ensure it is always running. This finaly resulted in a large increase of generation speed while maintaining the accuracy of the model. Pretty much a slam dunk.

pcuenca

Very nice, did a first pass!

There's a lot of ground to cover and you do a good job at introducing the problem and diving progressively into the details. I got a bit lost during the carry-over discussion (it's easy conceptually but I lost track of the positions lol), but it's looking good!

Happy to take another look later if you want.

ariG23498

I am done reviewing first half of the blog post. I will be able to get the second half done by early tomorrow.

Initial verdict: I really like how the blog post is paced, while going really deep into technicalities, it never feels overwhelming. The ideas introduced are very advanced but the illustrations make it reasonable and intuitive. This is really good job! I got to intuitively understand events and streams for the very first time.

Co-authored-by: Pedro Cuenca <pedro@huggingface.co> Co-authored-by: Aritra Roy Gosthipaty <aritra.born2fly@gmail.com>

ariG23498

I really like the second half. I should also mention that there were moments where I had to read the paragraphs quite a number of times to understand it completely. This was due to my first time exposure to a lot of new terms. I predict that this is will also be the case for a lot of readers. To counter this, I think we should describe technical terms (buffers, slots, graphs, etc) better, and also be coherent with our wording.

This is a really well made blog post!

ariG23498

Looks really good!

ariG23498

I think ideating around Continuous Batching Series would be a nice addition. What do you all think?

ariG23498 · 2026-05-11T09:30:38Z


 ![Title card](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/blog/continuous_batching/banner.png)

 *TL;DR: in this blog post, starting from attention mechanisms and KV caching, we derive continuous batching by optimizing for throughput.*


Suggested change

Continuous Batching series:

1. Continuous batching (_this blog post_)

2. [Unlocking asynchronicity in continuous batching](https://huggingface.co/blog/continuous_async)

*TL;DR: in this blog post, starting from attention mechanisms and KV caching, we derive continuous batching by optimizing for throughput.*

sergiopaniego

loved it!

Co-authored-by: Sergio Paniego Blanco <sergiopaniegoblanco@gmail.com>

First draft, every component in

4f7d49e

remi-or force-pushed the remi-or/continuous-async branch from eda3def to 4f7d49e Compare May 4, 2026 07:45

pcuenca reviewed May 4, 2026

View reviewed changes

ariG23498 reviewed May 6, 2026

View reviewed changes

Apply suggestions from code review

bef6424

Co-authored-by: Pedro Cuenca <pedro@huggingface.co> Co-authored-by: Aritra Roy Gosthipaty <aritra.born2fly@gmail.com>

ariG23498 approved these changes May 7, 2026

View reviewed changes

Comment thread continuous_async.md

Comment thread continuous_async.md

Comment thread continuous_async.md Outdated

Comment thread continuous_async.md Outdated

Comment thread continuous_async.md Outdated

remi-or added 3 commits May 11, 2026 05:00

Review compliance 1/n

c7e0724

Review compliance (rework conclusion) (2/n)

78ea62c

Add link between articles (review, 3/n)

acab8b1

ariG23498 approved these changes May 11, 2026

View reviewed changes

Comment thread continuous_async.md Outdated

Comment thread continuous_async.md

Comment thread continuous_async.md

ariG23498 reviewed May 11, 2026

View reviewed changes

remi-or added 3 commits May 12, 2026 06:51

Typos and note

0130d1c

Added the gist

90e59ac

Remodel the intro

88b2863

sergiopaniego approved these changes May 13, 2026

View reviewed changes

Comment thread continuous_async.md Outdated

pcuenca and others added 3 commits May 13, 2026 13:10

Merge branch 'main' into remi-or/continuous-async

e196675

Update continuous_async.md

c9fb86c

Co-authored-by: Sergio Paniego Blanco <sergiopaniegoblanco@gmail.com>

Co-authors, acknoledgments, two nits

5acbfba

remi-or merged commit 23c18a1 into main May 14, 2026
2 checks passed

remi-or deleted the remi-or/continuous-async branch May 14, 2026 14:14


		The figure above shows how this unfolds. The CPU prepares the batch, then quickly enqueues all the GPU work: the H2D transfer, the forward pass, the D2H transfer, with `record` and `wait` calls inserted between each stage. After that, the CPU is free. The GPU takes over, executing each stream in order as its dependency event is set. Notice the green annotation on the right: once the D2H transfer completes, the CPU comes back and reads the results. This final synchronization is the only point where the CPU blocks in the whole step. To implement it, we record a third event on the D2H stream after the output transfer, then call `d2h_done_event.synchronize()` on the CPU side. `synchronize` blocks the CPU until the D2H stream reaches that marker.

		This is the key difference from synchronous batching. Before, the CPU blocked after every operation. Now it blocks once, at the very end, only to read results it genuinely needs at that moment. Everything else runs in the background.


		![Title card](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/blog/continuous_batching/banner.png)

		TL;DR: in this blog post, starting from attention mechanisms and KV caching, we derive continuous batching by optimizing for throughput.

+Continuous Batching series:
+. Continuous batching (_this blog post_)
+. [Unlocking asynchronicity in continuous batching](https://huggingface.co/blog/continuous_async)
+*TL;DR: in this blog post, starting from attention mechanisms and KV caching, we derive continuous batching by optimizing for throughput.*

Conversation

remi-or commented May 4, 2026

Uh oh!

pcuenca left a comment

Choose a reason for hiding this comment

Uh oh!

pcuenca May 4, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

pcuenca May 4, 2026

Choose a reason for hiding this comment

Uh oh!

remi-or May 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Filling the vacuum

Uh oh!

ariG23498 May 11, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

pcuenca May 4, 2026

Choose a reason for hiding this comment

Uh oh!

remi-or May 11, 2026

Choose a reason for hiding this comment

Uh oh!

pcuenca left a comment

Choose a reason for hiding this comment

Uh oh!

ariG23498 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ariG23498 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ariG23498 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ariG23498 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ariG23498 May 11, 2026

Choose a reason for hiding this comment

Uh oh!

sergiopaniego left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

remi-or May 11, 2026 •

edited

Loading