Parallelization on RTX 4060 Ti cards. #1789
Replies: 1 comment
-
|
The error is expected because tensor parallelism ( The RTX 4060 Ti does not support CUDA P2P, so when TensorRT-LLM attempts to enable peer access between the devices, it fails with: Unfortunately, there isn't a build flag or runtime option to instruct TensorRT-LLM to transparently route these communications through host (CPU) memory instead. Tensor parallel execution assumes that the participating GPUs can communicate efficiently using CUDA IPC/P2P. Your options are therefore:
In your specific example, you built the engine from a checkpoint configured for trtllm-build \
--checkpoint_dir ... \
--output_dir ... \
--gemm_plugin autoIf that checkpoint itself was generated for tensor parallelism ( So, with the current TensorRT-LLM implementation, there is no supported way to make a If this answer helped or pointed you in the right direction, I'd appreciate it if you could mark it as the accepted answer so it's easier for others with the same issue to find. Also, if you found my contribution useful, I'd appreciate it if you could check out my GitHub profile, follow me, and star any repositories you find interesting. GitHub: https://github.com/Advait251206 |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
I have 4 RTX 4060 Ti video cards, they are connected to one PCI Express Bridge. It is known about them that they do not support NVIDIA Direct P2P technology. I need to run the TensorRT-LLM Engine built using the library on them. After I built this engine with the command:
trtllm-build --checkpoint_dir /workspace/TensorRT-LLM/quantized-llama-3-70b-pp1-tp4-awq-w4a16-kvint8-gs64 --output_dir ./quantized-llama-3-70b --gemm_plugin auto
And I'm trying to run it with the command
mpirun -n 4 --allow-run-as-root python3 ../run.py --max_output_len=40 --tokenizer_dir ./llama70b_hf/models--meta-llama--Meta-Llama-3-70B-Instruct/ snapshots/7129260dd854a80eb10ace5f61c20324b472b31c/ --engine_dir quantized-llama-3-70b --input_text "In Bash, how do I list all text files?"
I use a ready-made checkpoint.
When I run this engine for execution I get an error
[TensorRT-LLM][WARNING] Device 0 peer access Device 1 is not available.
[TensorRT-LLM][WARNING] Device 0 peer access Device 2 is not available.
[TensorRT-LLM][WARNING] Device 0 peer access Device 3 is not available.
Traceback (most recent call last):
File "/workspace/TensorRT-LLM/TensorRT-LLM/examples/llama/../run.py", line 632, in
main(args)
File "/workspace/TensorRT-LLM/TensorRT-LLM/examples/llama/../run.py", line 478, in main
runner = runner_cls.from_dir(**runner_kwargs)
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/runtime/model_runner_cpp.py", line 222, in from_dir
executor = trtllm.Executor(engine_dir, trtllm.ModelType.DECODER_ONLY,
RuntimeError: [TensorRT-LLM][ERROR] CUDA runtime error in error: peer access is not supported between these two devices
It is obvious that the cards cannot communicate directly via the PCI Express bus.
How can I change the settings for building the engine or launching it, so that the cards interact through RAM.
Or maybe I need to rework the language model.
Beta Was this translation helpful? Give feedback.
All reactions