Conversation
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
ArthurZucker
left a comment
There was a problem hiding this comment.
LGTM be careful, EP only will fail if you devide by default with tp size the head dim
| # Account for TP: each KV head is dispatched to a different GPU, so the effective number of KV heads per GPU is | ||
| # simply divided by the TP size (number of GPUs) | ||
| if tp_size is not None and tp_size > 1: |
There was a problem hiding this comment.
only if the attention k and v are target of the tp plan tho
There was a problem hiding this comment.
What could be the other targets? Not familiar enough with the TP plan tbh
There was a problem hiding this comment.
Ok, added a boolean kv_is_tp = "layers.*.self_attn.k_proj" in config.tp_plan and "layers.*.self_attn.v_proj" in config.tp_plan to condition this.
| logit_processor: The [`ContinuousBatchingLogitsProcessorList`] object used to process the logits. | ||
| input_queue: Queue for incoming requests | ||
| input_queue: Queue for incoming requests. Is None if this process is not a TP driver. | ||
| cancel_queue: Queue for cancellation request_ids. Is None if this process is not a TP driver. |
There was a problem hiding this comment.
Okay will read the rest to see if all are drivers or not
|
View the CircleCI Test Summary for this PR: https://huggingface.co/spaces/transformers-community/circle-ci-viz?pr=45821&sha=b757f3 |
This PR adds support for TP in continuous batching. The major changes required to do this were:
hashwhich is salted depending on the processIt also adds a mechanism to the benchmark script to make sure the generation is coherent.
Performance
No perf regression, TP is faster.
Tests
Added tests for TP, all tests run.