EGL backend internally remaps CUDA_VISIBLE_DEVICES, causing use of wrong physical GPU

### Intro

Hi!

I am a student working on embodied AI and robotics simulation. I use MuJoCo via LIBERO for large‑scale reinforcement learning in headless Docker containers with multiple NVIDIA GPUs.


### My setup

- MuJoCo 3.4.0, Python, x86_64, Ubuntu 22.04 Docker (--gpus all)
- 7× RTX 3090, Driver 550.163.01, CUDA 12.4

### What's happening? What did you expect?

When I restrict a process to a specific GPU using CUDA_VISIBLE_DEVICES and then create an
EGL context with MuJoCo (MUJOCO_GL=egl), the rendering load lands on a completely different
physical GPU. This happens for every GPU index I tried, not just a single one.

For example, if I set CUDA_VISIBLE_DEVICES=2, the rendering appears on GPU 0; if I set
CUDA_VISIBLE_DEVICES=3, it appears on GPU 5. The mapping is deterministic on a given
machine but differs across servers.

I expected CUDA_VISIBLE_DEVICES=2 to unambiguously select physical GPU 2, just as it does for CUDA computation.  Instead, to actually use physical GPU X, I must set CUDA_VISIBLE_DEVICES to a different “Translated” value, which – after EGL’s internal remapping – ends up targeting the originally intended GPU.  The mapping is shown below (measured on a 7×RTX 3090 node):

Original (desired physical GPU) → Translated (changed CUDA_VISIBLE_DEVICES)
0 → 2
1 → 1
2 → 0
3 → 5
4 → 6
5 → 4
6 → 3

Without this translation table, there is no reliable way to target a specific GPU for EGL
rendering.

Example: Inference workers use GPU 0 and 1, environment workers are assigned to GPU 2 and 3.  
Actual GPU memory from nvidia-smi:

**OSMesa mode:**  
Only inference workers on GPU 0,1 
<img width="1028" height="938" alt="Image" src="https://github.com/user-attachments/assets/1c2b45fa-8d31-4bfe-954f-133987d340b2" />

**EGL mode:**  
Environment workers incorrectly running on GPU 0,6 
<img width="1023" height="940" alt="Image" src="https://github.com/user-attachments/assets/99d4eb6a-2cd6-430c-82f0-6273c82f4e10" />

### Steps for reproduction

1. On a multi-GPU machine, run the script below with CUDA_VISIBLE_DEVICES=2.
2. Monitor nvidia-smi in another terminal.
3. Observe the GPU with increased memory is not GPU 2 (on my machine it is GPU 0).
4. Repeat with other values to build the translation table.


### Minimal model for reproduction

Issue is in EGL initialization, no MJCF needed.

### Code required for reproduction

import os, time

os.environ["MUJOCO_GL"] = "egl"
os.environ["MUJOCO_EGL_DEVICE_ID"] = "2"    # change this index as needed

from mujoco.egl import egl_ext as EGL
import OpenGL.EGL as EGL2

devices = EGL.eglQueryDevicesEXT()
print(f"Devices: {len(devices)}")

d = devices[2]
disp = EGL.eglGetPlatformDisplayEXT(EGL.EGL_PLATFORM_DEVICE_EXT, d, None)
EGL.eglInitialize(disp, None, None)
print("Vendor:", EGL2.eglQueryString(disp, EGL2.EGL_VENDOR))

attrs = [EGL.EGL_SURFACE_TYPE, EGL.EGL_PBUFFER_BIT,
         EGL.EGL_RENDERABLE_TYPE, EGL.EGL_OPENGL_BIT,
         EGL.EGL_RED_SIZE, 8, EGL.EGL_GREEN_SIZE, 8, EGL.EGL_BLUE_SIZE, 8, EGL.EGL_NONE]
cfg = (EGL.EGLConfig * 1)()
n = EGL.c_int()
EGL.eglChooseConfig(disp, attrs, cfg, 1, n)
ctx = EGL.eglCreateContext(disp, cfg[0], EGL.EGL_NO_CONTEXT, None)
surf = EGL.eglCreatePbufferSurface(disp, cfg[0], [EGL.EGL_WIDTH, 256, EGL.EGL_HEIGHT, 256, EGL.EGL_NONE])
EGL.eglMakeCurrent(disp, surf, surf, ctx)
print("Context bound to device 2. Check nvidia-smi for 30s...")
time.sleep(30)
EGL.eglTerminate(disp)
print("Done.")

### Confirmations

- [x] I searched the [latest documentation](https://mujoco.readthedocs.io/en/latest/overview.html) thoroughly before posting.
- [x] I searched previous [Issues](https://github.com/google-deepmind/mujoco/issues) and [Discussions](https://github.com/google-deepmind/mujoco/discussions), I am certain this has not been raised before.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

EGL backend internally remaps CUDA_VISIBLE_DEVICES, causing use of wrong physical GPU #3245

Intro

My setup

What's happening? What did you expect?

Steps for reproduction

Minimal model for reproduction

Code required for reproduction

Confirmations

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

EGL backend internally remaps CUDA_VISIBLE_DEVICES, causing use of wrong physical GPU #3245

Description

Intro

My setup

What's happening? What did you expect?

Steps for reproduction

Minimal model for reproduction

Code required for reproduction

Confirmations

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions