Port selected 2.8 fixes to main#4595
Conversation
(cherry picked from commit 674a441)
- Updates Docker and Kubernetes launcher docs to use `nvflare deploy prepare` instead of provisioning-time runtime builders. - Reworks the Helm chart docs/design around the same default generated chart, with chart generation moved to deploy preparation. - Updates Brev helper docs and the Docker example to consume prepared runtime kits. - Adds `nvflare deploy prepare` CLI reference docs and links it from the CLI/deployment docs. - Removes stale user-facing references to deprecated runtime builder workflows while keeping developer-facing design boundary language where it is useful. - `git diff --check` - `bash -n docs/user_guide/admin_guide/deployment/brev_scripts/prepare_brev_startup_kits.sh docs/user_guide/admin_guide/deployment/brev_scripts/launch_brev_nvflare.sh` - `rg -n 'HelmChartBuilder|DockerLauncherBuilder|HelmCharBuilder|K8sLauncherBuilder|docker_k8s_job_launchers|job_launchers' docs examples web` - `rg -n 'Out of Scope|out of scope|outside the scope|not an? [A-Z]|not .* guide|separate guide|should live|if included|Provider-specific|cloud-provider-specific' docs/user_guide docs/examples docs/security_faq.rst docs/programming_guide examples web` (cherry picked from commit 08c5f25)
- Backport of NVIDIA#4541 to the `2.8` branch, adjusted per review to use the existing `clean_up` spelling. - Adds a `clean_up: bool = True` parameter to `Run.get_result()` and forwards it to `exec_env.stop(clean_up=clean_up)` instead of hard-coding cleanup. - Preserves existing behavior by default while allowing callers to keep the workspace with `run.get_result(clean_up=False)` for debugging and log inspection. - Adds unit coverage for the `clean_up=False` propagation path, including result and cached-state assertions. - `git diff --check` - `python3 -m py_compile nvflare/recipe/run.py tests/unit_test/recipe/run_test.py` (cherry picked from commit ac7d6e0)
## What changed - Hide the legacy interactive-console `remove_client` command from normal help. - Clarify docs and API/server docstrings that `remove_client` only releases the active client token and allows the client to register again. - Remove the command from the user-facing admin operation table. - Add/adjust focused tests for the hidden command metadata and legacy Session API routing. ## Why `remove_client` sounds like it removes or disables a client, but the implementation only clears the current active token. During code freeze this keeps behavior/API stable while reducing user-facing confusion. ## Validation - `/private/tmp/nvflare-deploy-test-venv/bin/python -m pytest tests/unit_test/private/fed/server/training_cmds_test.py tests/unit_test/fuel/flare_api/session_new_methods_test.py tests/unit_test/tool/system/system_status_test.py::TestSystemStatus::test_system_parser_rejects_unsupported_remove_client_command -q` - `python3 -m compileall -q nvflare/private/fed/server/training_cmds.py nvflare/fuel/flare_api/api_spec.py nvflare/fuel/flare_api/flare_api.py nvflare/private/fed/server/client_manager.py nvflare/private/fed/server/server_engine.py nvflare/private/fed/server/server_engine_internal_spec.py tests/unit_test/private/fed/server/training_cmds_test.py tests/unit_test/fuel/flare_api/session_new_methods_test.py` - `git diff upstream/2.8 --check` - GitHub reports the PGP-signed commit as verified. (cherry picked from commit 12cbce3)
## Summary - Restrict GPUResourceManager host validation to integer GPU IDs selected by CUDA_VISIBLE_DEVICES. - Preserve CUDA documented semantics for unset, empty, and invalid integer CUDA_VISIBLE_DEVICES entries. - Add unit coverage for mixed-memory GPUs, selected GPU IDs, empty visibility, and invalid-index stopping behavior. ## Why POC clients started with `nvflare poc start -gpu ...` receive `CUDA_VISIBLE_DEVICES`, but GPUResourceManager was validating memory against every physical GPU returned by nvidia-smi. On hosts with a small display GPU plus large training GPUs, the startup check failed on the unused display GPU. ## Validation - `python3 -m compileall -q nvflare/app_common/resource_managers/gpu_resource_manager.py tests/unit_test/app_common/resource_managers/gpu_resource_manager_test.py` - `git diff --check` - Manual stubbed GPUResourceManager checks for mixed-memory visible GPU scenarios (cherry picked from commit a54949c)
## Summary - make the Docker job launcher workspace tmpfs writable with sticky `1777` permissions - keep `startup/` and `local/` read-only and the current job directory as the only host-backed writable mount - update the Docker launcher unit expectation and design doc to describe the ephemeral SJ/CJ storage behavior - make the Docker example startup commands copy/pasteable by backgrounding parent processes with per-site logs - update the Docker example job submit command to use `--startup-kit` - make Docker example paths consistently relative to `examples/docker` ## Root Cause `deploy prepare` rewrites server storage paths to `/var/tmp/nvflare/workspace/snapshot-storage` and `/var/tmp/nvflare/workspace/jobs-storage`. Server job containers inherit those resources, but the Docker job launcher mounted `/var/tmp/nvflare/workspace` as a read-only-style `0555` tmpfs. During SJ boot, `snapshot_persistor` and `job_manager` both try to create their storage directories under the tmpfs root and fail with `EACCES`. ## Impact SJ/CJ containers still do not receive the parent site workspace bind mount. The top-level workspace root is now writable only as ephemeral tmpfs so startup-time storage initialization can succeed, matching the K8s emptyDir behavior more closely. Host-backed writes remain limited to `/var/tmp/nvflare/workspace/<job_id>`. (cherry picked from commit 949dcf3)
## Summary Narrow the client job-failure reporting path added for 2.8 so generic launcher `JobReturnCode.EXECUTION_ERROR` is not promoted into an authoritative server-side job failure. ## Why `client_api_qa` can complete the FL workflow successfully and then hit local worker teardown noise that surfaces as a generic launcher execution error. Reporting that generic code to the server causes the server to fail an already-finished job. This keeps the explicit failure cases used by the recent job-timeout status fix: - `ProcessExitCode.EXCEPTION` - `ProcessExitCode.CONFIG_ERROR` - `ProcessExitCode.UNSAFE_COMPONENT` - `JobReturnCode.ABORTED` The K8s pending-timeout path still reports `JobReturnCode.EXCEPTION`, so it should continue to show `FINISHED:EXECUTION_EXCEPTION` rather than `RUNNING`. (cherry picked from commit 510150c)
## Summary
- Replace the dynamic `_exp_tracking_model` import in the experiment
tracking recipe smoke tests with the documented dict model config.
- Bundle the example `model.py` into the server app so
`model.SimpleNetwork` resolves when the job runs.
- Add a dedicated `tracking` integration backend and route it through
the PyTorch CI setup.
- Refresh the integration test README/backend list and stale manual
pytest command.
## Why
The tests loaded the example model under a synthetic module name and
then passed that live `nn.Module` instance into `FedAvgRecipe`. During
job export, Python source inspection could not resolve the synthetic
module and raised:
```text
TypeError: <class '_exp_tracking_model.SimpleNetwork'> is a built-in class
```
Using `model={"class_path": "model.SimpleNetwork", "args": {}}` matches
the recipe API and avoids relying on importlib state.
Signed-off-by: YuanTingHsieh <yuantingh@nvidia.com>
(cherry picked from commit 5bade21)
There was a problem hiding this comment.
Pull request overview
Backports eight selected fixes from the 2.8 branch into main in merge order. These cover study-data warning diagnostics, deploy prepare documentation, recipe Run.get_result(clean_up=...) parameter alignment, hidden remove_client token-cleanup semantics, GPU resource manager respecting CUDA_VISIBLE_DEVICES, Docker SJ workspace tmpfs permission fix, narrowed client failure reporting, and a tracking recipe integration test fix.
Changes:
- Code fixes: GPU resource manager visibility filtering, Docker tmpfs permissions (0o555→0o1777), removal of
JobReturnCode.EXECUTION_ERRORfrom authoritative client failure reporting,clean_upparameter rename, hiddenremove_clientcommand with clarified docstrings, and study-data warning logging. - Docs/design updates around
nvflare deploy prepare, Helm chart, launcher_spec/resource_spec separation, and removed Docker example references. - Test updates and additions covering each of the above.
Reviewed changes
Copilot reviewed 49 out of 49 changed files in this pull request and generated no comments.
Show a summary per file
| File | Description |
|---|---|
| nvflare/recipe/run.py | Rename cleanup→clean_up; updated docstring. |
| nvflare/private/fed/server/training_cmds.py | Hide remove_client; clarify description. |
| nvflare/private/fed/server/server_engine.py, server_engine_internal_spec.py, client_manager.py | Docstring clarifications for token-only removal. |
| nvflare/private/fed/server/fed_server.py | Stop treating generic EXECUTION_ERROR as authoritative job failure. |
| nvflare/private/fed/client/client_executor.py | Drop EXECUTION_ERROR from REPORTABLE_JOB_FAILURES. |
| nvflare/fuel/flare_api/{flare_api,api_spec}.py | Clarify remove_client semantics. |
| nvflare/app_opt/job_launcher/{docker_launcher,k8s_launcher,study_data}.py | Tmpfs 1777; pass logger; warn on missing study mappings. |
| nvflare/app_common/resource_managers/gpu_resource_manager.py | Respect CUDA_VISIBLE_DEVICES integer indices for memory/count checks. |
| examples/docker/* | Update README, add docker.yaml, move job image config to launcher_spec with GPUs in resource_spec. |
| docs/** | Extensive doc/design updates aligning with the above. |
| tests/unit_test/** and tests/integration_test/** | Add/adjust test coverage for each fix. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Greptile SummaryThis PR ports seven targeted fixes from the 2.8 branch to
Confidence Score: 5/5Safe to merge; all changes are well-scoped bug fixes with matching unit tests and no regressions in the reviewed paths. Every code change has direct test coverage. The GPU resource manager correctly maps visible device indices to host GPU IDs and validates memory only for selected GPUs. The tmpfs mode fix is straightforward. The narrowing of EXECUTION_ERROR reporting is consistent between client and server and tested with an explicit new test case. No untested paths or missing guards were found in the changed code. No files require special attention. Important Files Changed
Sequence DiagramsequenceDiagram
participant Env as Environment
participant GRM as GPUResourceManager.__init__
participant Helper as _get_cuda_visible_device_indices
participant NvSMI as nvidia-smi helpers
GRM->>NvSMI: get_host_gpu_ids()
NvSMI-->>GRM: host_gpu_ids (e.g. [0,1,2,3])
GRM->>Helper: _get_cuda_visible_device_indices(host_gpu_ids)
Helper->>Env: os.environ.get("CUDA_VISIBLE_DEVICES")
alt Not set
Env-->>Helper: None
Helper-->>GRM: None → use all host_gpu_ids
else Set to integers (e.g. "2,3")
Env-->>Helper: "2,3"
Helper-->>GRM: [2, 3]
else Set to UUID / NoDevFiles
Env-->>Helper: "GPU-abc..."
Helper-->>GRM: None → fall back to all host_gpu_ids
else Set to empty string
Env-->>Helper: ""
Helper-->>GRM: [] → zero GPUs available
end
GRM->>NvSMI: get_host_gpu_memory_total()
NvSMI-->>GRM: memory list (indexed by host position)
GRM->>GRM: build host_gpu_mem_by_id
GRM->>GRM: "resource_gpu_ids = managed[:num_of_gpus]"
GRM->>GRM: validate memory per selected GPU ID
GRM->>GRM: "resources = {gpu_id: GPUResource(gpu_id, mem)}"
Reviews (3): Last reviewed commit: "Merge branch 'main' into codex/port-2-8-..." | Re-trigger Greptile |
Summary
Port the selected 2.8 fixes back to
mainin 2.8 merge order:Run.get_result()with theclean_upparameter spellingremove_clienttoken cleanup semanticsCUDA_VISIBLE_DEVICESin the GPU resource manager