Skip to content

Port selected 2.8 fixes to main#4595

Merged
pcnudde merged 10 commits into
NVIDIA:mainfrom
YuanTingHsieh:codex/port-2-8-parity-to-main-ordered
May 13, 2026
Merged

Port selected 2.8 fixes to main#4595
pcnudde merged 10 commits into
NVIDIA:mainfrom
YuanTingHsieh:codex/port-2-8-parity-to-main-ordered

Conversation

@YuanTingHsieh
Copy link
Copy Markdown
Collaborator

@YuanTingHsieh YuanTingHsieh commented May 13, 2026

Summary

Port the selected 2.8 fixes back to main in 2.8 merge order:

- Updates Docker and Kubernetes launcher docs to use `nvflare deploy
prepare` instead of provisioning-time runtime builders.
- Reworks the Helm chart docs/design around the same default generated
chart, with chart generation moved to deploy preparation.
- Updates Brev helper docs and the Docker example to consume prepared
runtime kits.
- Adds `nvflare deploy prepare` CLI reference docs and links it from the
CLI/deployment docs.
- Removes stale user-facing references to deprecated runtime builder
workflows while keeping developer-facing design boundary language where
it is useful.

- `git diff --check`
- `bash -n
docs/user_guide/admin_guide/deployment/brev_scripts/prepare_brev_startup_kits.sh
docs/user_guide/admin_guide/deployment/brev_scripts/launch_brev_nvflare.sh`
- `rg -n
'HelmChartBuilder|DockerLauncherBuilder|HelmCharBuilder|K8sLauncherBuilder|docker_k8s_job_launchers|job_launchers'
docs examples web`
- `rg -n 'Out of Scope|out of scope|outside the scope|not an? [A-Z]|not
.* guide|separate guide|should live|if
included|Provider-specific|cloud-provider-specific' docs/user_guide
docs/examples docs/security_faq.rst docs/programming_guide examples web`

(cherry picked from commit 08c5f25)
- Backport of NVIDIA#4541 to the `2.8` branch, adjusted per review to use the
existing `clean_up` spelling.
- Adds a `clean_up: bool = True` parameter to `Run.get_result()` and
forwards it to `exec_env.stop(clean_up=clean_up)` instead of hard-coding
cleanup.
- Preserves existing behavior by default while allowing callers to keep
the workspace with `run.get_result(clean_up=False)` for debugging and
log inspection.
- Adds unit coverage for the `clean_up=False` propagation path,
including result and cached-state assertions.

- `git diff --check`
- `python3 -m py_compile nvflare/recipe/run.py
tests/unit_test/recipe/run_test.py`

(cherry picked from commit ac7d6e0)
## What changed
- Hide the legacy interactive-console `remove_client` command from
normal help.
- Clarify docs and API/server docstrings that `remove_client` only
releases the active client token and allows the client to register
again.
- Remove the command from the user-facing admin operation table.
- Add/adjust focused tests for the hidden command metadata and legacy
Session API routing.

## Why
`remove_client` sounds like it removes or disables a client, but the
implementation only clears the current active token. During code freeze
this keeps behavior/API stable while reducing user-facing confusion.

## Validation
- `/private/tmp/nvflare-deploy-test-venv/bin/python -m pytest
tests/unit_test/private/fed/server/training_cmds_test.py
tests/unit_test/fuel/flare_api/session_new_methods_test.py
tests/unit_test/tool/system/system_status_test.py::TestSystemStatus::test_system_parser_rejects_unsupported_remove_client_command
-q`
- `python3 -m compileall -q nvflare/private/fed/server/training_cmds.py
nvflare/fuel/flare_api/api_spec.py nvflare/fuel/flare_api/flare_api.py
nvflare/private/fed/server/client_manager.py
nvflare/private/fed/server/server_engine.py
nvflare/private/fed/server/server_engine_internal_spec.py
tests/unit_test/private/fed/server/training_cmds_test.py
tests/unit_test/fuel/flare_api/session_new_methods_test.py`
- `git diff upstream/2.8 --check`
- GitHub reports the PGP-signed commit as verified.

(cherry picked from commit 12cbce3)
## Summary
- Restrict GPUResourceManager host validation to integer GPU IDs
selected by CUDA_VISIBLE_DEVICES.
- Preserve CUDA documented semantics for unset, empty, and invalid
integer CUDA_VISIBLE_DEVICES entries.
- Add unit coverage for mixed-memory GPUs, selected GPU IDs, empty
visibility, and invalid-index stopping behavior.

## Why
POC clients started with `nvflare poc start -gpu ...` receive
`CUDA_VISIBLE_DEVICES`, but GPUResourceManager was validating memory
against every physical GPU returned by nvidia-smi. On hosts with a small
display GPU plus large training GPUs, the startup check failed on the
unused display GPU.

## Validation
- `python3 -m compileall -q
nvflare/app_common/resource_managers/gpu_resource_manager.py
tests/unit_test/app_common/resource_managers/gpu_resource_manager_test.py`
- `git diff --check`
- Manual stubbed GPUResourceManager checks for mixed-memory visible GPU
scenarios

(cherry picked from commit a54949c)
## Summary

- make the Docker job launcher workspace tmpfs writable with sticky
`1777` permissions
- keep `startup/` and `local/` read-only and the current job directory
as the only host-backed writable mount
- update the Docker launcher unit expectation and design doc to describe
the ephemeral SJ/CJ storage behavior
- make the Docker example startup commands copy/pasteable by
backgrounding parent processes with per-site logs
- update the Docker example job submit command to use `--startup-kit`
- make Docker example paths consistently relative to `examples/docker`

## Root Cause

`deploy prepare` rewrites server storage paths to
`/var/tmp/nvflare/workspace/snapshot-storage` and
`/var/tmp/nvflare/workspace/jobs-storage`. Server job containers inherit
those resources, but the Docker job launcher mounted
`/var/tmp/nvflare/workspace` as a read-only-style `0555` tmpfs. During
SJ boot, `snapshot_persistor` and `job_manager` both try to create their
storage directories under the tmpfs root and fail with `EACCES`.

## Impact

SJ/CJ containers still do not receive the parent site workspace bind
mount. The top-level workspace root is now writable only as ephemeral
tmpfs so startup-time storage initialization can succeed, matching the
K8s emptyDir behavior more closely. Host-backed writes remain limited to
`/var/tmp/nvflare/workspace/<job_id>`.

(cherry picked from commit 949dcf3)
## Summary

Narrow the client job-failure reporting path added for 2.8 so generic
launcher `JobReturnCode.EXECUTION_ERROR` is not promoted into an
authoritative server-side job failure.

## Why

`client_api_qa` can complete the FL workflow successfully and then hit
local worker teardown noise that surfaces as a generic launcher
execution error. Reporting that generic code to the server causes the
server to fail an already-finished job.

This keeps the explicit failure cases used by the recent job-timeout
status fix:

- `ProcessExitCode.EXCEPTION`
- `ProcessExitCode.CONFIG_ERROR`
- `ProcessExitCode.UNSAFE_COMPONENT`
- `JobReturnCode.ABORTED`

The K8s pending-timeout path still reports `JobReturnCode.EXCEPTION`, so
it should continue to show `FINISHED:EXECUTION_EXCEPTION` rather than
`RUNNING`.

(cherry picked from commit 510150c)
## Summary
- Replace the dynamic `_exp_tracking_model` import in the experiment
tracking recipe smoke tests with the documented dict model config.
- Bundle the example `model.py` into the server app so
`model.SimpleNetwork` resolves when the job runs.
- Add a dedicated `tracking` integration backend and route it through
the PyTorch CI setup.
- Refresh the integration test README/backend list and stale manual
pytest command.

## Why
The tests loaded the example model under a synthetic module name and
then passed that live `nn.Module` instance into `FedAvgRecipe`. During
job export, Python source inspection could not resolve the synthetic
module and raised:

```text
TypeError: <class '_exp_tracking_model.SimpleNetwork'> is a built-in class
```

Using `model={"class_path": "model.SimpleNetwork", "args": {}}` matches
the recipe API and avoids relying on importlib state.

Signed-off-by: YuanTingHsieh <yuantingh@nvidia.com>
(cherry picked from commit 5bade21)
@YuanTingHsieh YuanTingHsieh marked this pull request as ready for review May 13, 2026 20:10
Copilot AI review requested due to automatic review settings May 13, 2026 20:10
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Backports eight selected fixes from the 2.8 branch into main in merge order. These cover study-data warning diagnostics, deploy prepare documentation, recipe Run.get_result(clean_up=...) parameter alignment, hidden remove_client token-cleanup semantics, GPU resource manager respecting CUDA_VISIBLE_DEVICES, Docker SJ workspace tmpfs permission fix, narrowed client failure reporting, and a tracking recipe integration test fix.

Changes:

  • Code fixes: GPU resource manager visibility filtering, Docker tmpfs permissions (0o555→0o1777), removal of JobReturnCode.EXECUTION_ERROR from authoritative client failure reporting, clean_up parameter rename, hidden remove_client command with clarified docstrings, and study-data warning logging.
  • Docs/design updates around nvflare deploy prepare, Helm chart, launcher_spec/resource_spec separation, and removed Docker example references.
  • Test updates and additions covering each of the above.

Reviewed changes

Copilot reviewed 49 out of 49 changed files in this pull request and generated no comments.

Show a summary per file
File Description
nvflare/recipe/run.py Rename cleanupclean_up; updated docstring.
nvflare/private/fed/server/training_cmds.py Hide remove_client; clarify description.
nvflare/private/fed/server/server_engine.py, server_engine_internal_spec.py, client_manager.py Docstring clarifications for token-only removal.
nvflare/private/fed/server/fed_server.py Stop treating generic EXECUTION_ERROR as authoritative job failure.
nvflare/private/fed/client/client_executor.py Drop EXECUTION_ERROR from REPORTABLE_JOB_FAILURES.
nvflare/fuel/flare_api/{flare_api,api_spec}.py Clarify remove_client semantics.
nvflare/app_opt/job_launcher/{docker_launcher,k8s_launcher,study_data}.py Tmpfs 1777; pass logger; warn on missing study mappings.
nvflare/app_common/resource_managers/gpu_resource_manager.py Respect CUDA_VISIBLE_DEVICES integer indices for memory/count checks.
examples/docker/* Update README, add docker.yaml, move job image config to launcher_spec with GPUs in resource_spec.
docs/** Extensive doc/design updates aligning with the above.
tests/unit_test/** and tests/integration_test/** Add/adjust test coverage for each fix.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented May 13, 2026

Greptile Summary

This PR ports seven targeted fixes from the 2.8 branch to main, covering GPU resource scheduling, Docker workspace permissions, study data warnings, token-cleanup semantics, and launcher error handling.

  • GPU resource manager (gpu_resource_manager.py): CUDA_VISIBLE_DEVICES is now parsed at init time; the manager restricts itself to visible nvidia-smi indices and uses actual GPU IDs as resource keys, with correct per-GPU memory validation. An autouse fixture in the test file ensures CUDA_VISIBLE_DEVICES is always cleared between test cases.
  • Docker tmpfs mode (docker_launcher.py): Root workspace tmpfs mode changed from 0o555 (read-only for non-root) to 0o1777 (sticky world-writable) so non-root job-container users can create ephemeral dirs like snapshot-storage.
  • Study data warnings (study_data.py): load_study_data_file and resolve_study_dataset_mounts now accept an optional logger and emit WARNING-level messages when the data file is absent, empty, or missing a requested study entry — making silent mount omissions visible in logs.

Confidence Score: 5/5

Safe to merge; all changes are well-scoped bug fixes with matching unit tests and no regressions in the reviewed paths.

Every code change has direct test coverage. The GPU resource manager correctly maps visible device indices to host GPU IDs and validates memory only for selected GPUs. The tmpfs mode fix is straightforward. The narrowing of EXECUTION_ERROR reporting is consistent between client and server and tested with an explicit new test case. No untested paths or missing guards were found in the changed code.

No files require special attention.

Important Files Changed

Filename Overview
nvflare/app_common/resource_managers/gpu_resource_manager.py Added CUDA_VISIBLE_DEVICES support: resource IDs now reflect actual nvidia-smi GPU indices rather than 0..n-1, with correct memory mapping and boundary handling
nvflare/app_opt/job_launcher/docker_launcher.py Changed tmpfs root mode from 0o555 to 0o1777 (sticky world-writable) so non-root job containers can create ephemeral dirs; passes logger to study data helpers
nvflare/app_opt/job_launcher/study_data.py Added optional logger parameter to load_study_data_file and resolve_study_dataset_mounts; emits warnings when study file is missing, empty, or lacks a requested study entry
nvflare/private/fed/client/client_executor.py Removed JobReturnCode.EXECUTION_ERROR from REPORTABLE_JOB_FAILURES so generic launcher errors are no longer escalated to the server as job failures
nvflare/private/fed/server/fed_server.py Removed JobReturnCode.EXECUTION_ERROR from the server-side condition that triggers fail_run(), aligned with the client-side change
nvflare/recipe/run.py Renamed get_result parameter from cleanup to clean_up to match exec_env.stop(); existing keyword-argument callers with cleanup= will receive TypeError
nvflare/fuel/flare_api/flare_api.py Clarified remove_client docstring: releases active token only, does not stop the client or prevent reconnect
nvflare/private/fed/server/training_cmds.py Hid remove_client admin command (visible=False) and updated its description to reflect token-release semantics
tests/integration_test/experiment_tracking_recipes_test.py Fixed recipe integration test: replaced direct importlib model loading with class_path config dict and file-to-apps distribution, fixing the recipe job dispatch flow
tests/unit_test/app_common/resource_managers/gpu_resource_manager_test.py Added extensive tests for CUDA_VISIBLE_DEVICES filtering: visible GPU count, memory checks by actual GPU ID, stop-at-invalid behavior, and empty-string handling

Sequence Diagram

sequenceDiagram
    participant Env as Environment
    participant GRM as GPUResourceManager.__init__
    participant Helper as _get_cuda_visible_device_indices
    participant NvSMI as nvidia-smi helpers

    GRM->>NvSMI: get_host_gpu_ids()
    NvSMI-->>GRM: host_gpu_ids (e.g. [0,1,2,3])
    GRM->>Helper: _get_cuda_visible_device_indices(host_gpu_ids)
    Helper->>Env: os.environ.get("CUDA_VISIBLE_DEVICES")
    alt Not set
        Env-->>Helper: None
        Helper-->>GRM: None → use all host_gpu_ids
    else Set to integers (e.g. "2,3")
        Env-->>Helper: "2,3"
        Helper-->>GRM: [2, 3]
    else Set to UUID / NoDevFiles
        Env-->>Helper: "GPU-abc..."
        Helper-->>GRM: None → fall back to all host_gpu_ids
    else Set to empty string
        Env-->>Helper: ""
        Helper-->>GRM: [] → zero GPUs available
    end
    GRM->>NvSMI: get_host_gpu_memory_total()
    NvSMI-->>GRM: memory list (indexed by host position)
    GRM->>GRM: build host_gpu_mem_by_id
    GRM->>GRM: "resource_gpu_ids = managed[:num_of_gpus]"
    GRM->>GRM: validate memory per selected GPU ID
    GRM->>GRM: "resources = {gpu_id: GPUResource(gpu_id, mem)}"
Loading

Reviews (3): Last reviewed commit: "Merge branch 'main' into codex/port-2-8-..." | Re-trigger Greptile

Comment thread nvflare/app_common/resource_managers/gpu_resource_manager.py
Comment thread nvflare/recipe/run.py
@pcnudde pcnudde merged commit 03b56e8 into NVIDIA:main May 13, 2026
24 checks passed
@YuanTingHsieh YuanTingHsieh deleted the codex/port-2-8-parity-to-main-ordered branch May 13, 2026 23:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants