Skip to content

[CB] [Major] Add tensor paralellism#45821

Open
remi-or wants to merge 28 commits into
mainfrom
cb-tp2
Open

[CB] [Major] Add tensor paralellism#45821
remi-or wants to merge 28 commits into
mainfrom
cb-tp2

Conversation

@remi-or
Copy link
Copy Markdown
Collaborator

@remi-or remi-or commented May 7, 2026

This PR adds support for TP in continuous batching. The major changes required to do this were:

  • add inter-process communication for the requests states
  • add per TP group seeding
  • add hints to prevent NCCL graph mixing
  • change hash function to avoid python hash which is salted depending on the process

It also adds a mechanism to the benchmark script to make sure the generation is coherent.

Performance

Benchmark TP1 main tok/s TP1 tok/s TP2 tok/s Speedup TP1 acc TP2 acc
gsm8k_default 2,454 2,454 3,657 1.49× 0.822 0.819
gsm8k_sampling 1,940 1,942 2,616 1.35× 0.792 0.775
gsm8k_compile 2,463 2,467 3,689 1.50× 0.822 0.821
gsm8k_no_fast_decode 2,367 2,370 3,522 1.49× 0.822 0.819
gsm8k_bare_bones 1,877 1,881 2,331 1.24× 0.821 0.821
ifeval_default 7,890 7,898 15,135 1.92× 0.442 0.455
rollouts_1024 3,200 3,199 4,281 1.34×
rollouts_2048 3,048 3,049 4,194 1.38×
rollouts_4096 2,719 2,719 3,887 1.43×
rollouts_8192 2,209 2,211 3,345 1.51×
rollouts_16384 1,465 1,465 2,589 1.77×
few_blocks 686 696 840 1.21×
multi_return_seq 1,544 1,553 1,926 1.24×

No perf regression, TP is faster.

Tests

Added tests for TP, all tests run.

@remi-or remi-or requested a review from ArthurZucker May 7, 2026 09:32
@remi-or remi-or self-assigned this May 7, 2026
@HuggingFaceDocBuilderDev
Copy link
Copy Markdown

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Copy link
Copy Markdown
Collaborator

@ArthurZucker ArthurZucker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM be careful, EP only will fail if you devide by default with tp size the head dim

Comment on lines 166 to 168
# Account for TP: each KV head is dispatched to a different GPU, so the effective number of KV heads per GPU is
# simply divided by the TP size (number of GPUs)
if tp_size is not None and tp_size > 1:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

only if the attention k and v are target of the tp plan tho

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What could be the other targets? Not familiar enough with the TP plan tbh

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, added a boolean kv_is_tp = "layers.*.self_attn.k_proj" in config.tp_plan and "layers.*.self_attn.v_proj" in config.tp_plan to condition this.

logit_processor: The [`ContinuousBatchingLogitsProcessorList`] object used to process the logits.
input_queue: Queue for incoming requests
input_queue: Queue for incoming requests. Is None if this process is not a TP driver.
cancel_queue: Queue for cancellation request_ids. Is None if this process is not a TP driver.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay will read the rest to see if all are drivers or not

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

?

Comment thread src/transformers/generation/continuous_batching/continuous_api.py Outdated
Comment thread src/transformers/generation/continuous_batching/utils.py Outdated
@github-actions
Copy link
Copy Markdown
Contributor

View the CircleCI Test Summary for this PR:

https://huggingface.co/spaces/transformers-community/circle-ci-viz?pr=45821&sha=b757f3

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants