Skip to content

EGL backend internally remaps CUDA_VISIBLE_DEVICES, causing use of wrong physical GPU #3245

@zzzyishi

Description

@zzzyishi

Intro

Hi!

I am a student working on embodied AI and robotics simulation. I use MuJoCo via LIBERO for large‑scale reinforcement learning in headless Docker containers with multiple NVIDIA GPUs.

My setup

  • MuJoCo 3.4.0, Python, x86_64, Ubuntu 22.04 Docker (--gpus all)
  • 7× RTX 3090, Driver 550.163.01, CUDA 12.4

What's happening? What did you expect?

When I restrict a process to a specific GPU using CUDA_VISIBLE_DEVICES and then create an
EGL context with MuJoCo (MUJOCO_GL=egl), the rendering load lands on a completely different
physical GPU. This happens for every GPU index I tried, not just a single one.

For example, if I set CUDA_VISIBLE_DEVICES=2, the rendering appears on GPU 0; if I set
CUDA_VISIBLE_DEVICES=3, it appears on GPU 5. The mapping is deterministic on a given
machine but differs across servers.

I expected CUDA_VISIBLE_DEVICES=2 to unambiguously select physical GPU 2, just as it does for CUDA computation. Instead, to actually use physical GPU X, I must set CUDA_VISIBLE_DEVICES to a different “Translated” value, which – after EGL’s internal remapping – ends up targeting the originally intended GPU. The mapping is shown below (measured on a 7×RTX 3090 node):

Original (desired physical GPU) → Translated (changed CUDA_VISIBLE_DEVICES)
0 → 2
1 → 1
2 → 0
3 → 5
4 → 6
5 → 4
6 → 3

Without this translation table, there is no reliable way to target a specific GPU for EGL
rendering.

Example: Inference workers use GPU 0 and 1, environment workers are assigned to GPU 2 and 3.
Actual GPU memory from nvidia-smi:

OSMesa mode:
Only inference workers on GPU 0,1
Image

EGL mode:
Environment workers incorrectly running on GPU 0,6
Image

Steps for reproduction

  1. On a multi-GPU machine, run the script below with CUDA_VISIBLE_DEVICES=2.
  2. Monitor nvidia-smi in another terminal.
  3. Observe the GPU with increased memory is not GPU 2 (on my machine it is GPU 0).
  4. Repeat with other values to build the translation table.

Minimal model for reproduction

Issue is in EGL initialization, no MJCF needed.

Code required for reproduction

import os, time

os.environ["MUJOCO_GL"] = "egl"
os.environ["MUJOCO_EGL_DEVICE_ID"] = "2" # change this index as needed

from mujoco.egl import egl_ext as EGL
import OpenGL.EGL as EGL2

devices = EGL.eglQueryDevicesEXT()
print(f"Devices: {len(devices)}")

d = devices[2]
disp = EGL.eglGetPlatformDisplayEXT(EGL.EGL_PLATFORM_DEVICE_EXT, d, None)
EGL.eglInitialize(disp, None, None)
print("Vendor:", EGL2.eglQueryString(disp, EGL2.EGL_VENDOR))

attrs = [EGL.EGL_SURFACE_TYPE, EGL.EGL_PBUFFER_BIT,
EGL.EGL_RENDERABLE_TYPE, EGL.EGL_OPENGL_BIT,
EGL.EGL_RED_SIZE, 8, EGL.EGL_GREEN_SIZE, 8, EGL.EGL_BLUE_SIZE, 8, EGL.EGL_NONE]
cfg = (EGL.EGLConfig * 1)()
n = EGL.c_int()
EGL.eglChooseConfig(disp, attrs, cfg, 1, n)
ctx = EGL.eglCreateContext(disp, cfg[0], EGL.EGL_NO_CONTEXT, None)
surf = EGL.eglCreatePbufferSurface(disp, cfg[0], [EGL.EGL_WIDTH, 256, EGL.EGL_HEIGHT, 256, EGL.EGL_NONE])
EGL.eglMakeCurrent(disp, surf, surf, ctx)
print("Context bound to device 2. Check nvidia-smi for 30s...")
time.sleep(30)
EGL.eglTerminate(disp)
print("Done.")

Confirmations

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions