Port selected 2.8 fixes to main by YuanTingHsieh · Pull Request #4595 · NVIDIA/NVFlare

YuanTingHsieh · 2026-05-13T20:05:40Z

Summary

Port the selected 2.8 fixes back to main in 2.8 merge order:

[2.8] Add warnings for missing study data mappings #4528 Add warnings for missing study data mappings
[2.8] Update deploy prepare launcher docs #4538 Update deploy prepare launcher docs
[2.8] Add clean_up parameter to recipe Run.get_result() #4550 Align Run.get_result() with the clean_up parameter spelling
[2.8] Clarify remove_client token cleanup semantics #4561 Clarify remove_client token cleanup semantics
[2.8] Respect visible GPUs in resource manager #4563 Respect CUDA_VISIBLE_DEVICES in the GPU resource manager
[2.8] Fix Docker SJ workspace tmpfs permissions #4574 Fix Docker SJ workspace tmpfs permissions
[2.8] Narrow client failure reporting #4576 Narrow client failure reporting for generic launcher execution errors
[2.8] Fix tracking recipe integration test #4583 Fix tracking recipe integration test

(cherry picked from commit 674a441)

- Updates Docker and Kubernetes launcher docs to use `nvflare deploy prepare` instead of provisioning-time runtime builders. - Reworks the Helm chart docs/design around the same default generated chart, with chart generation moved to deploy preparation. - Updates Brev helper docs and the Docker example to consume prepared runtime kits. - Adds `nvflare deploy prepare` CLI reference docs and links it from the CLI/deployment docs. - Removes stale user-facing references to deprecated runtime builder workflows while keeping developer-facing design boundary language where it is useful. - `git diff --check` - `bash -n docs/user_guide/admin_guide/deployment/brev_scripts/prepare_brev_startup_kits.sh docs/user_guide/admin_guide/deployment/brev_scripts/launch_brev_nvflare.sh` - `rg -n 'HelmChartBuilder|DockerLauncherBuilder|HelmCharBuilder|K8sLauncherBuilder|docker_k8s_job_launchers|job_launchers' docs examples web` - `rg -n 'Out of Scope|out of scope|outside the scope|not an? [A-Z]|not .* guide|separate guide|should live|if included|Provider-specific|cloud-provider-specific' docs/user_guide docs/examples docs/security_faq.rst docs/programming_guide examples web` (cherry picked from commit 08c5f25)

- Backport of NVIDIA#4541 to the `2.8` branch, adjusted per review to use the existing `clean_up` spelling. - Adds a `clean_up: bool = True` parameter to `Run.get_result()` and forwards it to `exec_env.stop(clean_up=clean_up)` instead of hard-coding cleanup. - Preserves existing behavior by default while allowing callers to keep the workspace with `run.get_result(clean_up=False)` for debugging and log inspection. - Adds unit coverage for the `clean_up=False` propagation path, including result and cached-state assertions. - `git diff --check` - `python3 -m py_compile nvflare/recipe/run.py tests/unit_test/recipe/run_test.py` (cherry picked from commit ac7d6e0)

## What changed - Hide the legacy interactive-console `remove_client` command from normal help. - Clarify docs and API/server docstrings that `remove_client` only releases the active client token and allows the client to register again. - Remove the command from the user-facing admin operation table. - Add/adjust focused tests for the hidden command metadata and legacy Session API routing. ## Why `remove_client` sounds like it removes or disables a client, but the implementation only clears the current active token. During code freeze this keeps behavior/API stable while reducing user-facing confusion. ## Validation - `/private/tmp/nvflare-deploy-test-venv/bin/python -m pytest tests/unit_test/private/fed/server/training_cmds_test.py tests/unit_test/fuel/flare_api/session_new_methods_test.py tests/unit_test/tool/system/system_status_test.py::TestSystemStatus::test_system_parser_rejects_unsupported_remove_client_command -q` - `python3 -m compileall -q nvflare/private/fed/server/training_cmds.py nvflare/fuel/flare_api/api_spec.py nvflare/fuel/flare_api/flare_api.py nvflare/private/fed/server/client_manager.py nvflare/private/fed/server/server_engine.py nvflare/private/fed/server/server_engine_internal_spec.py tests/unit_test/private/fed/server/training_cmds_test.py tests/unit_test/fuel/flare_api/session_new_methods_test.py` - `git diff upstream/2.8 --check` - GitHub reports the PGP-signed commit as verified. (cherry picked from commit 12cbce3)

## Summary - Restrict GPUResourceManager host validation to integer GPU IDs selected by CUDA_VISIBLE_DEVICES. - Preserve CUDA documented semantics for unset, empty, and invalid integer CUDA_VISIBLE_DEVICES entries. - Add unit coverage for mixed-memory GPUs, selected GPU IDs, empty visibility, and invalid-index stopping behavior. ## Why POC clients started with `nvflare poc start -gpu ...` receive `CUDA_VISIBLE_DEVICES`, but GPUResourceManager was validating memory against every physical GPU returned by nvidia-smi. On hosts with a small display GPU plus large training GPUs, the startup check failed on the unused display GPU. ## Validation - `python3 -m compileall -q nvflare/app_common/resource_managers/gpu_resource_manager.py tests/unit_test/app_common/resource_managers/gpu_resource_manager_test.py` - `git diff --check` - Manual stubbed GPUResourceManager checks for mixed-memory visible GPU scenarios (cherry picked from commit a54949c)

## Summary - make the Docker job launcher workspace tmpfs writable with sticky `1777` permissions - keep `startup/` and `local/` read-only and the current job directory as the only host-backed writable mount - update the Docker launcher unit expectation and design doc to describe the ephemeral SJ/CJ storage behavior - make the Docker example startup commands copy/pasteable by backgrounding parent processes with per-site logs - update the Docker example job submit command to use `--startup-kit` - make Docker example paths consistently relative to `examples/docker` ## Root Cause `deploy prepare` rewrites server storage paths to `/var/tmp/nvflare/workspace/snapshot-storage` and `/var/tmp/nvflare/workspace/jobs-storage`. Server job containers inherit those resources, but the Docker job launcher mounted `/var/tmp/nvflare/workspace` as a read-only-style `0555` tmpfs. During SJ boot, `snapshot_persistor` and `job_manager` both try to create their storage directories under the tmpfs root and fail with `EACCES`. ## Impact SJ/CJ containers still do not receive the parent site workspace bind mount. The top-level workspace root is now writable only as ephemeral tmpfs so startup-time storage initialization can succeed, matching the K8s emptyDir behavior more closely. Host-backed writes remain limited to `/var/tmp/nvflare/workspace/<job_id>`. (cherry picked from commit 949dcf3)

## Summary Narrow the client job-failure reporting path added for 2.8 so generic launcher `JobReturnCode.EXECUTION_ERROR` is not promoted into an authoritative server-side job failure. ## Why `client_api_qa` can complete the FL workflow successfully and then hit local worker teardown noise that surfaces as a generic launcher execution error. Reporting that generic code to the server causes the server to fail an already-finished job. This keeps the explicit failure cases used by the recent job-timeout status fix: - `ProcessExitCode.EXCEPTION` - `ProcessExitCode.CONFIG_ERROR` - `ProcessExitCode.UNSAFE_COMPONENT` - `JobReturnCode.ABORTED` The K8s pending-timeout path still reports `JobReturnCode.EXCEPTION`, so it should continue to show `FINISHED:EXECUTION_EXCEPTION` rather than `RUNNING`. (cherry picked from commit 510150c)

## Summary - Replace the dynamic `_exp_tracking_model` import in the experiment tracking recipe smoke tests with the documented dict model config. - Bundle the example `model.py` into the server app so `model.SimpleNetwork` resolves when the job runs. - Add a dedicated `tracking` integration backend and route it through the PyTorch CI setup. - Refresh the integration test README/backend list and stale manual pytest command. ## Why The tests loaded the example model under a synthetic module name and then passed that live `nn.Module` instance into `FedAvgRecipe`. During job export, Python source inspection could not resolve the synthetic module and raised: ```text TypeError: <class '_exp_tracking_model.SimpleNetwork'> is a built-in class ``` Using `model={"class_path": "model.SimpleNetwork", "args": {}}` matches the recipe API and avoids relying on importlib state. Signed-off-by: YuanTingHsieh <yuantingh@nvidia.com> (cherry picked from commit 5bade21)

Copilot

Pull request overview

Backports eight selected fixes from the 2.8 branch into main in merge order. These cover study-data warning diagnostics, deploy prepare documentation, recipe Run.get_result(clean_up=...) parameter alignment, hidden remove_client token-cleanup semantics, GPU resource manager respecting CUDA_VISIBLE_DEVICES, Docker SJ workspace tmpfs permission fix, narrowed client failure reporting, and a tracking recipe integration test fix.

Changes:

Code fixes: GPU resource manager visibility filtering, Docker tmpfs permissions (0o555→0o1777), removal of JobReturnCode.EXECUTION_ERROR from authoritative client failure reporting, clean_up parameter rename, hidden remove_client command with clarified docstrings, and study-data warning logging.
Docs/design updates around nvflare deploy prepare, Helm chart, launcher_spec/resource_spec separation, and removed Docker example references.
Test updates and additions covering each of the above.

Reviewed changes

Copilot reviewed 49 out of 49 changed files in this pull request and generated no comments.

Show a summary per file

File	Description
nvflare/recipe/run.py	Rename `cleanup`→`clean_up`; updated docstring.
nvflare/private/fed/server/training_cmds.py	Hide `remove_client`; clarify description.
nvflare/private/fed/server/server_engine.py, server_engine_internal_spec.py, client_manager.py	Docstring clarifications for token-only removal.
nvflare/private/fed/server/fed_server.py	Stop treating generic `EXECUTION_ERROR` as authoritative job failure.
nvflare/private/fed/client/client_executor.py	Drop `EXECUTION_ERROR` from `REPORTABLE_JOB_FAILURES`.
nvflare/fuel/flare_api/{flare_api,api_spec}.py	Clarify `remove_client` semantics.
nvflare/app_opt/job_launcher/{docker_launcher,k8s_launcher,study_data}.py	Tmpfs `1777`; pass logger; warn on missing study mappings.
nvflare/app_common/resource_managers/gpu_resource_manager.py	Respect `CUDA_VISIBLE_DEVICES` integer indices for memory/count checks.
examples/docker/*	Update README, add `docker.yaml`, move job image config to `launcher_spec` with GPUs in `resource_spec`.
docs/**	Extensive doc/design updates aligning with the above.
tests/unit_test/ and tests/integration_test/	Add/adjust test coverage for each fix.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

greptile-apps · 2026-05-13T20:17:19Z

Greptile Summary

This PR ports seven targeted fixes from the 2.8 branch to main, covering GPU resource scheduling, Docker workspace permissions, study data warnings, token-cleanup semantics, and launcher error handling.

GPU resource manager (gpu_resource_manager.py): CUDA_VISIBLE_DEVICES is now parsed at init time; the manager restricts itself to visible nvidia-smi indices and uses actual GPU IDs as resource keys, with correct per-GPU memory validation. An autouse fixture in the test file ensures CUDA_VISIBLE_DEVICES is always cleared between test cases.
Docker tmpfs mode (docker_launcher.py): Root workspace tmpfs mode changed from 0o555 (read-only for non-root) to 0o1777 (sticky world-writable) so non-root job-container users can create ephemeral dirs like snapshot-storage.
Study data warnings (study_data.py): load_study_data_file and resolve_study_dataset_mounts now accept an optional logger and emit WARNING-level messages when the data file is absent, empty, or missing a requested study entry — making silent mount omissions visible in logs.

Confidence Score: 5/5

Safe to merge; all changes are well-scoped bug fixes with matching unit tests and no regressions in the reviewed paths.

Every code change has direct test coverage. The GPU resource manager correctly maps visible device indices to host GPU IDs and validates memory only for selected GPUs. The tmpfs mode fix is straightforward. The narrowing of EXECUTION_ERROR reporting is consistent between client and server and tested with an explicit new test case. No untested paths or missing guards were found in the changed code.

No files require special attention.

Important Files Changed

Filename	Overview
nvflare/app_common/resource_managers/gpu_resource_manager.py	Added CUDA_VISIBLE_DEVICES support: resource IDs now reflect actual nvidia-smi GPU indices rather than 0..n-1, with correct memory mapping and boundary handling
nvflare/app_opt/job_launcher/docker_launcher.py	Changed tmpfs root mode from 0o555 to 0o1777 (sticky world-writable) so non-root job containers can create ephemeral dirs; passes logger to study data helpers
nvflare/app_opt/job_launcher/study_data.py	Added optional logger parameter to load_study_data_file and resolve_study_dataset_mounts; emits warnings when study file is missing, empty, or lacks a requested study entry
nvflare/private/fed/client/client_executor.py	Removed JobReturnCode.EXECUTION_ERROR from REPORTABLE_JOB_FAILURES so generic launcher errors are no longer escalated to the server as job failures
nvflare/private/fed/server/fed_server.py	Removed JobReturnCode.EXECUTION_ERROR from the server-side condition that triggers fail_run(), aligned with the client-side change
nvflare/recipe/run.py	Renamed get_result parameter from cleanup to clean_up to match exec_env.stop(); existing keyword-argument callers with cleanup= will receive TypeError
nvflare/fuel/flare_api/flare_api.py	Clarified remove_client docstring: releases active token only, does not stop the client or prevent reconnect
nvflare/private/fed/server/training_cmds.py	Hid remove_client admin command (visible=False) and updated its description to reflect token-release semantics
tests/integration_test/experiment_tracking_recipes_test.py	Fixed recipe integration test: replaced direct importlib model loading with class_path config dict and file-to-apps distribution, fixing the recipe job dispatch flow
tests/unit_test/app_common/resource_managers/gpu_resource_manager_test.py	Added extensive tests for CUDA_VISIBLE_DEVICES filtering: visible GPU count, memory checks by actual GPU ID, stop-at-invalid behavior, and empty-string handling

Sequence Diagram

sequenceDiagram
    participant Env as Environment
    participant GRM as GPUResourceManager.__init__
    participant Helper as _get_cuda_visible_device_indices
    participant NvSMI as nvidia-smi helpers

    GRM->>NvSMI: get_host_gpu_ids()
    NvSMI-->>GRM: host_gpu_ids (e.g. [0,1,2,3])
    GRM->>Helper: _get_cuda_visible_device_indices(host_gpu_ids)
    Helper->>Env: os.environ.get("CUDA_VISIBLE_DEVICES")
    alt Not set
        Env-->>Helper: None
        Helper-->>GRM: None → use all host_gpu_ids
    else Set to integers (e.g. "2,3")
        Env-->>Helper: "2,3"
        Helper-->>GRM: [2, 3]
    else Set to UUID / NoDevFiles
        Env-->>Helper: "GPU-abc..."
        Helper-->>GRM: None → fall back to all host_gpu_ids
    else Set to empty string
        Env-->>Helper: ""
        Helper-->>GRM: [] → zero GPUs available
    end
    GRM->>NvSMI: get_host_gpu_memory_total()
    NvSMI-->>GRM: memory list (indexed by host position)
    GRM->>GRM: build host_gpu_mem_by_id
    GRM->>GRM: "resource_gpu_ids = managed[:num_of_gpus]"
    GRM->>GRM: validate memory per selected GPU ID
    GRM->>GRM: "resources = {gpu_id: GPUResource(gpu_id, mem)}"

_{Reviews (3): Last reviewed commit: "Merge branch 'main' into codex/port-2-8-..." | Re-trigger Greptile}

YuanTingHsieh added 8 commits May 13, 2026 13:04

[2.8] Add warnings for missing study data mappings (NVIDIA#4528)

62009bd

(cherry picked from commit 674a441)

YuanTingHsieh marked this pull request as ready for review May 13, 2026 20:10

Copilot AI review requested due to automatic review settings May 13, 2026 20:10

Copilot started reviewing on behalf of YuanTingHsieh May 13, 2026 20:12 View session

Copilot AI reviewed May 13, 2026

View reviewed changes

greptile-apps Bot reviewed May 13, 2026

View reviewed changes

Comment thread nvflare/app_common/resource_managers/gpu_resource_manager.py

Comment thread nvflare/recipe/run.py

YuanTingHsieh added 2 commits May 13, 2026 13:42

Merge branch 'main' into codex/port-2-8-parity-to-main-ordered

589a2ae

Merge branch 'main' into codex/port-2-8-parity-to-main-ordered

e08a330

YuanTingHsieh requested review from IsaacYangSLA, nvidianz and pcnudde May 13, 2026 22:11

pcnudde approved these changes May 13, 2026

View reviewed changes

pcnudde merged commit 03b56e8 into NVIDIA:main May 13, 2026
24 checks passed

YuanTingHsieh deleted the codex/port-2-8-parity-to-main-ordered branch May 13, 2026 23:15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Port selected 2.8 fixes to main#4595

Port selected 2.8 fixes to main#4595
pcnudde merged 10 commits into
NVIDIA:mainfrom
YuanTingHsieh:codex/port-2-8-parity-to-main-ordered

YuanTingHsieh commented May 13, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

greptile-apps Bot commented May 13, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

YuanTingHsieh commented May 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

greptile-apps Bot commented May 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Sequence Diagram

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

YuanTingHsieh commented May 13, 2026 •

edited

Loading

greptile-apps Bot commented May 13, 2026 •

edited

Loading