fix(nvidia-smi): exit gracefully instead of panicking on errors (#206) by benamizilberNvidia · Pull Request #210 · run-ai/fake-gpu-operator

benamizilberNvidia · 2026-06-03T12:46:21Z

What

The fake nvidia-smi (cmd/nvidia-smi/main.go) panic()d on essentially every error path, so failures surfaced in user pods as a cryptic Go stack trace instead of a useful message. This replaces the panics with error accumulation and a graceful exit.

Behavior

Failure	Handling
`http.Get` / non-2xx status / JSON decode	fatal-for-devices — print `no devices found` + accumulated errors to stderr, exit 1 (no bogus table)
bad `RUNAI_NUM_OF_GPUS`	non-fatal — keep default portion (`1.0`); table still renders, error logged to stderr
`readProcessName` (`/proc/1/cmdline`)	non-fatal — only blanks the process-name column

Key fixes:

New HTTP status check before decoding — a non-2xx text body (the unset-NODE_NAME case) no longer reaches the JSON decoder and panics.
A bad RUNAI_NUM_OF_GPUS is parsed into a temp var, so it no longer zeroes reported memory.
Added the previously-missing topology response body Close.
Dropped the two os.Setenv(TOPOLOGY_CM_*) calls — this binary has fetched topology over a hardcoded HTTP URL since inception and never reads those env vars (dead boilerplate; confirmed via history back to the original commit).

Happy-path output is unchanged, so the existing nvidia-smi e2e specs (device_plugin_test.go, e2e_test.go) continue to hold.

Closes RUN-39984. Fixes #206.

The fake nvidia-smi panicked on essentially every error path, surfacing a Go stack trace in user pods. Now errors are accumulated rather than panicked: - topology fetch / status / decode failures are fatal-for-devices: print `no devices found` plus the accumulated errors to stderr and exit 1, instead of rendering a bogus table. - a bad RUNAI_NUM_OF_GPUS keeps the default portion (no longer zeroes the reported memory) and readProcessName failures only blank one column — both non-fatal; the table still renders with the errors logged to stderr. - add the missing topology response body Close. Also drop the two os.Setenv(TOPOLOGY_CM_*) calls: this binary has fetched topology over a hardcoded HTTP URL since inception and never reads those env vars — dead boilerplate.

benamizilberNvidia requested a review from a team as a code owner June 3, 2026 12:46

benamizilberNvidia merged commit 8666247 into main Jun 3, 2026
11 checks passed

benamizilberNvidia deleted the bzilber/RUN-39984-nvidia-smi-error-handling branch June 3, 2026 12:53

benamizilberNvidia mentioned this pull request Jun 3, 2026

fix(nvidia-smi): use nvidia-smi-style failure messages per error path (#206) #211

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(nvidia-smi): exit gracefully instead of panicking on errors (#206)#210

fix(nvidia-smi): exit gracefully instead of panicking on errors (#206)#210
benamizilberNvidia merged 1 commit into
mainfrom
bzilber/RUN-39984-nvidia-smi-error-handling

benamizilberNvidia commented Jun 3, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

benamizilberNvidia commented Jun 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What

Behavior

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

benamizilberNvidia commented Jun 3, 2026 •

edited

Loading