Skip to content

fix(nvidia-smi): exit gracefully instead of panicking on errors (#206)#210

Merged
benamizilberNvidia merged 1 commit into
mainfrom
bzilber/RUN-39984-nvidia-smi-error-handling
Jun 3, 2026
Merged

fix(nvidia-smi): exit gracefully instead of panicking on errors (#206)#210
benamizilberNvidia merged 1 commit into
mainfrom
bzilber/RUN-39984-nvidia-smi-error-handling

Conversation

@benamizilberNvidia
Copy link
Copy Markdown
Contributor

@benamizilberNvidia benamizilberNvidia commented Jun 3, 2026

What

The fake nvidia-smi (cmd/nvidia-smi/main.go) panic()d on essentially every error path, so failures surfaced in user pods as a cryptic Go stack trace instead of a useful message. This replaces the panics with error accumulation and a graceful exit.

Behavior

Failure Handling
http.Get / non-2xx status / JSON decode fatal-for-devices — print no devices found + accumulated errors to stderr, exit 1 (no bogus table)
bad RUNAI_NUM_OF_GPUS non-fatal — keep default portion (1.0); table still renders, error logged to stderr
readProcessName (/proc/1/cmdline) non-fatal — only blanks the process-name column

Key fixes:

  • New HTTP status check before decoding — a non-2xx text body (the unset-NODE_NAME case) no longer reaches the JSON decoder and panics.
  • A bad RUNAI_NUM_OF_GPUS is parsed into a temp var, so it no longer zeroes reported memory.
  • Added the previously-missing topology response body Close.
  • Dropped the two os.Setenv(TOPOLOGY_CM_*) calls — this binary has fetched topology over a hardcoded HTTP URL since inception and never reads those env vars (dead boilerplate; confirmed via history back to the original commit).

Happy-path output is unchanged, so the existing nvidia-smi e2e specs (device_plugin_test.go, e2e_test.go) continue to hold.

Closes RUN-39984. Fixes #206.

The fake nvidia-smi panicked on essentially every error path, surfacing a
Go stack trace in user pods.

Now errors are accumulated rather than panicked:
- topology fetch / status / decode failures are fatal-for-devices: print
  `no devices found` plus the accumulated errors to stderr and exit 1,
  instead of rendering a bogus table.
- a bad RUNAI_NUM_OF_GPUS keeps the default portion (no longer zeroes the
  reported memory) and readProcessName failures only blank one column —
  both non-fatal; the table still renders with the errors logged to stderr.
- add the missing topology response body Close.

Also drop the two os.Setenv(TOPOLOGY_CM_*) calls: this binary has fetched
topology over a hardcoded HTTP URL since inception and never reads those
env vars — dead boilerplate.
@benamizilberNvidia benamizilberNvidia requested a review from a team as a code owner June 3, 2026 12:46
@benamizilberNvidia benamizilberNvidia merged commit 8666247 into main Jun 3, 2026
11 checks passed
@benamizilberNvidia benamizilberNvidia deleted the bzilber/RUN-39984-nvidia-smi-error-handling branch June 3, 2026 12:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Improve fake nvidia-smi error handling: avoid panics, exit gracefully with accumulated errors

1 participant