Skip to content

ci(dgx): proven GB10 unified-memory safety harness (RMM cap + watchdog + preflight)#1671

Merged
lmeyerov merged 4 commits into
masterfrom
ci/dgx-safety-harness
Jul 3, 2026
Merged

ci(dgx): proven GB10 unified-memory safety harness (RMM cap + watchdog + preflight)#1671
lmeyerov merged 4 commits into
masterfrom
ci/dgx-safety-harness

Conversation

@lmeyerov

@lmeyerov lmeyerov commented Jul 1, 2026

Copy link
Copy Markdown
Contributor

Why

We OOM-thrashed the shared dgx-spark box for ~9.5 h with a 1.8B-edge cudf/cugraph load. Root cause: the GB10 is unified memory — GPU memory is the 119 GB system RAM (16 GB swap). A cudf/cugraph over-allocation consumes host RAM and OOM-kills the OS's own services; the container then hangs and the box wedges.

What (proven on dgx, small safe scale, host flat throughout)

  • docker --memory is TRANSPARENT to cudf/unified allocations (reached 8 GB under a 4 GB cap) → useless as a guardrail.
  • RMM LimitingResourceAdaptor caps both cudf AND cugraph cleanly — the exact crash call (compute_cugraph('pagerank')) hit the limit and raised a caught MemoryError, host untouched. This is the real containment.
  • A host watchdog kills a runaway host (pandas/numpy) allocation at a RAM floor (exit 137, host recovers) — covers the path RMM doesn't.

Files (benchmarks/dgx/)

  • sitecustomize.py — auto-applies GFQL_RMM_LIMIT_GB to any Python in the container (non-invasive, no workload edits).
  • preflight.py + test_preflight.pypeak_gb()/is_safe(); the test guards that friendster-1.8B is REFUSED and the 80M handoff run allowed.
  • safe_run.sh — wraps docker run with preflight-refuse + RMM inject + watchdog force-kill + hard timeout. All dgx GPU/big runs go through it.
  • README.md — usage + rationale.

Standalone infra off master; benefits every branch that runs dgx benchmarks.

🤖 Generated with Claude Code

lmeyerov and others added 4 commits July 1, 2026 11:49
…g + preflight)

We OOM-thrashed the shared dgx-spark box for ~9.5h with a 1.8B-edge cudf/cugraph
load: on the GB10, GPU memory IS system RAM (unified, 119GB, 16GB swap), so a
cudf/cugraph over-allocation consumes host RAM and OOM-kills the OS.

Proven on dgx (small safe scale, host flat throughout):
  - docker --memory is TRANSPARENT to cudf/unified allocs (reached 8GB under a 4GB
    cap) -> useless as a cap.
  - RMM LimitingResourceAdaptor caps BOTH cudf AND cugraph cleanly (caught
    MemoryError, host untouched) -> the real containment.
  - a host watchdog kills a runaway host (pandas/numpy) alloc at a RAM floor.

benchmarks/dgx/:
  - sitecustomize.py: auto-applies GFQL_RMM_LIMIT_GB to any Python in the container
    (non-invasive; no workload edits).
  - preflight.py + test_preflight.py: peak_gb()/is_safe() refuse over-budget runs;
    test guards that friendster-1.8B is REFUSED and the 80M handoff run allowed.
  - safe_run.sh: wraps docker run with preflight-refuse + RMM inject + watchdog
    force-kill + hard timeout. ALL dgx GPU/big runs must go through it.
  - README.md: usage + rationale.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The LOCAL box is a ~31GB workstation, not dgx's 119GB — a runaway OOM logs the
user out (happened 2026-07-01 with a 5M/20M local bench). local_run.sh caps
address space (ulimit -v, default 8GB) so a runaway dies with a clean MemoryError
instead of the desktop session. Tested: 2GB cap + 3GB alloc -> MemoryError,
desktop survived. Benchmarks still go to dgx via safe_run.sh; this guards the
tiny local work that remains.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@lmeyerov lmeyerov merged commit 3ffc2d8 into master Jul 3, 2026
69 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant