Parallelization on RTX 4060 Ti cards. #1789

AntonThai2022 · 2024-06-17T11:25:56Z

AntonThai2022
Jun 17, 2024

I have 4 RTX 4060 Ti video cards, they are connected to one PCI Express Bridge. It is known about them that they do not support NVIDIA Direct P2P technology. I need to run the TensorRT-LLM Engine built using the library on them. After I built this engine with the command:

trtllm-build --checkpoint_dir /workspace/TensorRT-LLM/quantized-llama-3-70b-pp1-tp4-awq-w4a16-kvint8-gs64 --output_dir ./quantized-llama-3-70b --gemm_plugin auto

And I'm trying to run it with the command

mpirun -n 4 --allow-run-as-root python3 ../run.py --max_output_len=40 --tokenizer_dir ./llama70b_hf/models--meta-llama--Meta-Llama-3-70B-Instruct/ snapshots/7129260dd854a80eb10ace5f61c20324b472b31c/ --engine_dir quantized-llama-3-70b --input_text "In Bash, how do I list all text files?"

I use a ready-made checkpoint.

When I run this engine for execution I get an error

[TensorRT-LLM][WARNING] Device 0 peer access Device 1 is not available.
[TensorRT-LLM][WARNING] Device 0 peer access Device 2 is not available.
[TensorRT-LLM][WARNING] Device 0 peer access Device 3 is not available.

Traceback (most recent call last):
File "/workspace/TensorRT-LLM/TensorRT-LLM/examples/llama/../run.py", line 632, in
main(args)
File "/workspace/TensorRT-LLM/TensorRT-LLM/examples/llama/../run.py", line 478, in main
runner = runner_cls.from_dir(**runner_kwargs)
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/runtime/model_runner_cpp.py", line 222, in from_dir
executor = trtllm.Executor(engine_dir, trtllm.ModelType.DECODER_ONLY,
RuntimeError: [TensorRT-LLM][ERROR] CUDA runtime error in error: peer access is not supported between these two devices

It is obvious that the cards cannot communicate directly via the PCI Express bus.

How can I change the settings for building the engine or launching it, so that the cards interact through RAM.
Or maybe I need to rework the language model.

Advait251206 · 2026-07-04T10:36:42Z

Advait251206
Jul 4, 2026

The error is expected because tensor parallelism (tp4) requires GPU-to-GPU communication, and TensorRT-LLM relies on CUDA peer-to-peer (P2P) access (typically over NVLink or PCIe P2P) for exchanging activations and synchronizing computation.

The RTX 4060 Ti does not support CUDA P2P, so when TensorRT-LLM attempts to enable peer access between the devices, it fails with:

CUDA runtime error: peer access is not supported between these two devices

Unfortunately, there isn't a build flag or runtime option to instruct TensorRT-LLM to transparently route these communications through host (CPU) memory instead. Tensor parallel execution assumes that the participating GPUs can communicate efficiently using CUDA IPC/P2P.

Your options are therefore:

Run on a single GPU (tp=1), provided the model fits after quantization.
Use GPUs that support CUDA P2P (for example, data-center GPUs or GeForce cards/topologies where PCIe P2P is available).
Use another distributed inference framework that explicitly supports host-mediated communication, although this generally comes with a significant performance penalty and is not how TensorRT-LLM's tensor parallel runtime is designed.

In your specific example, you built the engine from a checkpoint configured for tp4:

trtllm-build \
  --checkpoint_dir ... \
  --output_dir ... \
  --gemm_plugin auto

If that checkpoint itself was generated for tensor parallelism (tp=4), rebuilding with different runtime options alone will not eliminate the dependency on inter-GPU communication. You would need a checkpoint and engine built for the desired parallel configuration (e.g., tp=1 if targeting a single GPU).

So, with the current TensorRT-LLM implementation, there is no supported way to make a tp4 engine fall back to CPU-memory-based communication on GPUs that do not support CUDA peer access.

If this answer helped or pointed you in the right direction, I'd appreciate it if you could mark it as the accepted answer so it's easier for others with the same issue to find.

Also, if you found my contribution useful, I'd appreciate it if you could check out my GitHub profile, follow me, and star any repositories you find interesting.

GitHub: https://github.com/Advait251206

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Parallelization on RTX 4060 Ti cards. #1789

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

Parallelization on RTX 4060 Ti cards. #1789

Uh oh!

AntonThai2022 Jun 17, 2024

Replies: 1 comment

Uh oh!

Advait251206 Jul 4, 2026

AntonThai2022
Jun 17, 2024

Advait251206
Jul 4, 2026