[CB] [Major] Add tensor paralellism by remi-or · Pull Request #45821 · huggingface/transformers

remi-or · 2026-05-07T09:32:38Z

This PR adds support for TP in continuous batching. The major changes required to do this were:

add inter-process communication for the requests states
add per TP group seeding
add hints to prevent NCCL graph mixing
change hash function to avoid python hash which is salted depending on the process

It also adds a mechanism to the benchmark script to make sure the generation is coherent.

Performance

Benchmark	TP1 main tok/s	TP1 tok/s	TP2 tok/s	Speedup	TP1 acc	TP2 acc
gsm8k_default	2,454	2,454	3,657	1.49×	0.822	0.819
gsm8k_sampling	1,940	1,942	2,616	1.35×	0.792	0.775
gsm8k_compile	2,463	2,467	3,689	1.50×	0.822	0.821
gsm8k_no_fast_decode	2,367	2,370	3,522	1.49×	0.822	0.819
gsm8k_bare_bones	1,877	1,881	2,331	1.24×	0.821	0.821
ifeval_default	7,890	7,898	15,135	1.92×	0.442	0.455
rollouts_1024	3,200	3,199	4,281	1.34×	—	—
rollouts_2048	3,048	3,049	4,194	1.38×	—	—
rollouts_4096	2,719	2,719	3,887	1.43×	—	—
rollouts_8192	2,209	2,211	3,345	1.51×	—	—
rollouts_16384	1,465	1,465	2,589	1.77×	—	—
few_blocks	686	696	840	1.21×	—	—
multi_return_seq	1,544	1,553	1,926	1.24×	—	—

No perf regression, TP is faster.

Tests

Added tests for TP, all tests run.

HuggingFaceDocBuilderDev · 2026-05-07T09:43:15Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

ArthurZucker

LGTM be careful, EP only will fail if you devide by default with tp size the head dim

ArthurZucker · 2026-05-12T05:09:03Z

+        # Account for TP: each KV head is dispatched to a different GPU, so the effective number of KV heads per GPU is
+        # simply divided by the TP size (number of GPUs)
        if tp_size is not None and tp_size > 1:


only if the attention k and v are target of the tp plan tho

What could be the other targets? Not familiar enough with the TP plan tbh

Ok, added a boolean kv_is_tp = "layers.*.self_attn.k_proj" in config.tp_plan and "layers.*.self_attn.v_proj" in config.tp_plan to condition this.

ArthurZucker · 2026-05-12T05:22:44Z

            logit_processor: The [`ContinuousBatchingLogitsProcessorList`] object used to process the logits.
-            input_queue: Queue for incoming requests
+            input_queue: Queue for incoming requests. Is None if this process is not a TP driver.
+            cancel_queue: Queue for cancellation request_ids. Is None if this process is not a TP driver.


Okay will read the rest to see if all are drivers or not

github-actions · 2026-05-14T08:35:11Z

View the CircleCI Test Summary for this PR:

https://huggingface.co/spaces/transformers-community/circle-ci-viz?pr=45821&sha=b757f3

remi-or requested a review from ArthurZucker May 7, 2026 09:32

remi-or self-assigned this May 7, 2026

ArthurZucker approved these changes May 12, 2026

View reviewed changes

remi-or added 24 commits May 14, 2026 01:18

TP heads and DP / TP seeds

4f2d363

Reproducible hash

49c1091

Add the notion of TP drivers

8e905fd

Fix NCCL device

2453ed1

Temporary fix for multiple streams

0130049

Better handling of NCCL graph mixing

1cc84c7

Fix cfg

48ff4a6

nit

6ef86fc

Move the seed setting

b0ee768

Reworked overall to have accuracy scoring

069be44

Adding tests 1/n

fd4522d

Added tests

097f491

Style

b44132d

Fixes

0881422

CC review

4266450

Nits

503b87d

Renames

52840ad

Small fixes

d21cf72

Move distributed stuff to a distributed file

3d51029

Docstring

552f13e

Final fixes

9fda59e

Review compliance

715a875

Review compliance 2

dcccdee

Rebase fix

ae83a40

remi-or force-pushed the cb-tp2 branch from 43ccc05 to ae83a40 Compare May 14, 2026 01:19

Style

950be2f

remi-or added 3 commits May 14, 2026 05:02

Less redudant testing suite

936f84c

Fix TP plan

cb58622

Fix stopping condition

b757f38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CB] [Major] Add tensor paralellism#45821

[CB] [Major] Add tensor paralellism#45821
remi-or wants to merge 28 commits into
mainfrom
cb-tp2

remi-or commented May 7, 2026 •

edited

Loading

Uh oh!

HuggingFaceDocBuilderDev commented May 7, 2026

Uh oh!

ArthurZucker left a comment

Uh oh!

ArthurZucker May 12, 2026

Uh oh!

remi-or May 12, 2026

Uh oh!

remi-or May 12, 2026

Uh oh!

ArthurZucker May 12, 2026

Uh oh!

remi-or May 12, 2026

Uh oh!

Uh oh!

Uh oh!

github-actions Bot commented May 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

remi-or commented May 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Performance

Tests

Uh oh!

HuggingFaceDocBuilderDev commented May 7, 2026

Uh oh!

ArthurZucker left a comment

Choose a reason for hiding this comment

Uh oh!

ArthurZucker May 12, 2026

Choose a reason for hiding this comment

Uh oh!

remi-or May 12, 2026

Choose a reason for hiding this comment

Uh oh!

remi-or May 12, 2026

Choose a reason for hiding this comment

Uh oh!

ArthurZucker May 12, 2026

Choose a reason for hiding this comment

Uh oh!

remi-or May 12, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

github-actions Bot commented May 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

remi-or commented May 7, 2026 •

edited

Loading