ci(dgx): proven GB10 unified-memory safety harness (RMM cap + watchdog + preflight) by lmeyerov · Pull Request #1671 · graphistry/pygraphistry

lmeyerov · 2026-07-01T18:50:30Z

Why

We OOM-thrashed the shared dgx-spark box for ~9.5 h with a 1.8B-edge cudf/cugraph load. Root cause: the GB10 is unified memory — GPU memory is the 119 GB system RAM (16 GB swap). A cudf/cugraph over-allocation consumes host RAM and OOM-kills the OS's own services; the container then hangs and the box wedges.

What (proven on dgx, small safe scale, host flat throughout)

docker --memory is TRANSPARENT to cudf/unified allocations (reached 8 GB under a 4 GB cap) → useless as a guardrail.
RMM LimitingResourceAdaptor caps both cudf AND cugraph cleanly — the exact crash call (compute_cugraph('pagerank')) hit the limit and raised a caught MemoryError, host untouched. This is the real containment.
A host watchdog kills a runaway host (pandas/numpy) allocation at a RAM floor (exit 137, host recovers) — covers the path RMM doesn't.

Files (`benchmarks/dgx/`)

sitecustomize.py — auto-applies GFQL_RMM_LIMIT_GB to any Python in the container (non-invasive, no workload edits).
preflight.py + test_preflight.py — peak_gb()/is_safe(); the test guards that friendster-1.8B is REFUSED and the 80M handoff run allowed.
safe_run.sh — wraps docker run with preflight-refuse + RMM inject + watchdog force-kill + hard timeout. All dgx GPU/big runs go through it.
README.md — usage + rationale.

Standalone infra off master; benefits every branch that runs dgx benchmarks.

🤖 Generated with Claude Code

…g + preflight) We OOM-thrashed the shared dgx-spark box for ~9.5h with a 1.8B-edge cudf/cugraph load: on the GB10, GPU memory IS system RAM (unified, 119GB, 16GB swap), so a cudf/cugraph over-allocation consumes host RAM and OOM-kills the OS. Proven on dgx (small safe scale, host flat throughout): - docker --memory is TRANSPARENT to cudf/unified allocs (reached 8GB under a 4GB cap) -> useless as a cap. - RMM LimitingResourceAdaptor caps BOTH cudf AND cugraph cleanly (caught MemoryError, host untouched) -> the real containment. - a host watchdog kills a runaway host (pandas/numpy) alloc at a RAM floor. benchmarks/dgx/: - sitecustomize.py: auto-applies GFQL_RMM_LIMIT_GB to any Python in the container (non-invasive; no workload edits). - preflight.py + test_preflight.py: peak_gb()/is_safe() refuse over-budget runs; test guards that friendster-1.8B is REFUSED and the 80M handoff run allowed. - safe_run.sh: wraps docker run with preflight-refuse + RMM inject + watchdog force-kill + hard timeout. ALL dgx GPU/big runs must go through it. - README.md: usage + rationale. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

The LOCAL box is a ~31GB workstation, not dgx's 119GB — a runaway OOM logs the user out (happened 2026-07-01 with a 5M/20M local bench). local_run.sh caps address space (ulimit -v, default 8GB) so a runaway dies with a clean MemoryError instead of the desktop session. Tested: 2GB cap + 3GB alloc -> MemoryError, desktop survived. Benchmarks still go to dgx via safe_run.sh; this guards the tiny local work that remains. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

lmeyerov and others added 4 commits July 1, 2026 11:49

Merge remote-tracking branch 'origin/master' into ci/dgx-safety-harness

bbf38a3

docs(changelog): dgx GB10 benchmark safety harness (#1671)

4de7856

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

lmeyerov merged commit 3ffc2d8 into master Jul 3, 2026
69 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

ci(dgx): proven GB10 unified-memory safety harness (RMM cap + watchdog + preflight)#1671

ci(dgx): proven GB10 unified-memory safety harness (RMM cap + watchdog + preflight)#1671
lmeyerov merged 4 commits into
masterfrom
ci/dgx-safety-harness

lmeyerov commented Jul 1, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

lmeyerov commented Jul 1, 2026

Why

What (proven on dgx, small safe scale, host flat throughout)

Files (benchmarks/dgx/)

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Files (`benchmarks/dgx/`)