Skip to content

Commit bce7c71

Browse files
committed
Clarify simulator and API doc signposts
1 parent 77b5934 commit bce7c71

6 files changed

Lines changed: 67 additions & 36 deletions

File tree

docs/programming_guide/execution_api_type/executor.rst

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -10,6 +10,13 @@ An :class:`Executor<nvflare.apis.executor.Executor>` is an FLComponent for FL cl
1010
wherein the ``execute`` method receives and returns a Shareable object given a task name,
1111
``FLContext``, and ``abort_signal``.
1212

13+
.. note::
14+
15+
The Executor API is the low-level client task API. Most new ML training
16+
examples should start with the :ref:`client_api` and :ref:`job_recipe`, and
17+
use Executor directly only when they need a custom task contract or framework
18+
integration.
19+
1320
.. literalinclude:: ../../../nvflare/apis/executor.py
1421
:language: python
1522
:lines: 24-

docs/programming_guide/filters.rst

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -44,6 +44,14 @@ purpose of security. In fact, privacy and homomorphic encryption techniques are
4444
- SVTPrivacy for differential privacy through sparse vector techniques (:mod:`nvflare.app_common.filters.svt_privacy`)
4545
- Homomorphic encryption filters to encrypt data before sharing (:mod:`nvflare.app_common.homomorphic_encryption.he_model_encryptor.py` and :mod:`nvflare.app_common.homomorphic_encryption.he_model_decryptor`)
4646

47+
Model update compression should use the same filter boundary: compress before
48+
sending and decompress after receiving so trainer and aggregator code can
49+
exchange normal model updates. Use :ref:`message_quantization` for built-in
50+
model quantization. For custom schemes, implement a ``DXOFilter`` for
51+
``DataKind.WEIGHTS`` or ``DataKind.WEIGHT_DIFF`` and register it as a task
52+
result filter for client-to-server updates, or as a task data filter for
53+
server-to-client model messages.
54+
4755
DXO - Data Exchange Object
4856
===========================
4957

docs/user_guide/data_scientist_guide/job_recipe.rst

Lines changed: 8 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -163,7 +163,7 @@ Execution Environments
163163

164164
A **Job Recipe** defines *what* to run in a federated learning setting, but it also needs to know *where* to run. NVFlare provides several **execution environments** that allow the same recipe to be executed in different contexts:
165165

166-
* **Simulation (** ``SimEnv`` **)** – For local testing and experimentation on a single machine
166+
* **Simulation (** ``SimEnv`` **)** – For local testing and experimentation on a single machine or in one batch job
167167
* **Proof-of-Concept (** ``PocEnv`` **)** – For small-scale, multi-process setups that mimic real-world deployment on a single machine
168168
* **Production (** ``ProdEnv`` **)** – For full-scale distributed deployments across multiple organizations and sites
169169

@@ -172,17 +172,22 @@ This separation enables users to **prototype once and deploy anywhere** without
172172
SimEnv – Simulation Environment
173173
-------------------------------
174174

175-
Runs all clients and the server as **threads** within a single process. This is lightweight and easy to set up with no networking required. Best suited for:
175+
Runs the job with the local FL simulator backend: no provisioned project or
176+
long-running server/client daemons. Simulated clients use local worker
177+
processes; ``num_threads`` is the historical name for the worker-process
178+
concurrency. Best suited for:
176179

177180
* Quick experiments
178181
* Debugging scripts and models
179182
* Educational use cases
183+
* Batch-scheduled experiments where one submitted job should run the complete
184+
federated workflow and then exit
180185

181186
**Arguments:**
182187

183188
* ``num_clients`` (int): Number of simulated clients
184189
* ``clients``: A list of client names (length needs to match ``num_clients`` if both are provided)
185-
* ``num_threads``: Number of threads to use to run simulated clients
190+
* ``num_threads``: Number of concurrent simulated client worker processes
186191
* ``gpu_config`` (str): List of GPU device IDs, comma separated
187192
* ``log_config`` (str): Log config mode (``'concise'``, ``'full'``, ``'verbose'``), filepath, or level
188193

docs/user_guide/nvflare_cli/fl_simulator.rst

Lines changed: 39 additions & 29 deletions
Original file line numberDiff line numberDiff line change
@@ -11,10 +11,11 @@ The FL Simulator is a lightweight simulator of a running NVFLARE FL deployment,
1111
and it can allow researchers to test and debug their application without
1212
provisioning a real project.
1313

14-
The FL jobs run on a server and
15-
multiple clients in the same process but in a similar way to how it would run
16-
in a real deployment so researchers can more quickly build out new components
17-
and jobs that can then be directly used in a real production deployment.
14+
The FL jobs run on a local simulator-managed server and simulated clients,
15+
without provisioning a real project or starting long-running server/client
16+
daemons. Use the simulator for single-machine development, tests, and
17+
batch-scheduled experiments where one Python command should start, run, and
18+
exit. Use POC or production modes for the provisioned deployment model.
1819

1920
***********************
2021
Command Usage
@@ -846,56 +847,65 @@ application run.
846847
Processes, Clients, and Events
847848
******************************
848849
849-
Specifying number of processes
850-
==============================
851-
The simulator ``-t`` option provides the ability to specify how many processes to run the simulator with.
850+
Specifying Client Worker Processes
851+
==================================
852+
The simulator ``-t`` option provides the ability to specify how many simulated
853+
client worker processes can run concurrently.
852854
853855
.. note::
854856
855-
The ``-t`` and ``--threads`` option for simulator was originally due to clients running in separate threads.
856-
However each client now actually runs in a separate process. This distinction will not affect the user experience.
857+
The ``-t`` and ``--threads`` option name is historical. Simulated client
858+
execution now uses separate worker processes, and the option controls worker
859+
process concurrency.
857860
858861
- N = number of clients (``-n``)
859-
- T = number of processes (``-t``)
862+
- T = number of concurrent client worker processes (``-t``)
860863
861-
When running the simulator with fewer processes than clients (T < N)
862-
the simulator will need to swap-in/out the clients for the processes, resulting in some of the clients running sequentially as processes are available.
863-
This also will cause the ClientRunner/learner objects to go through setup and teardown in every round.
864-
Using T < N is only needed when trying to simulate of large number of clients using a single machine with limited resources.
864+
When running the simulator with fewer worker processes than clients (T < N),
865+
the simulator swaps clients in and out as worker processes become available.
866+
This also causes the ClientRunner/learner objects to go through setup and
867+
teardown in every round. Using T < N is only needed when simulating many clients
868+
on a single machine with limited resources.
865869
866-
In most cases, run the simulator with the same number of processes as clients (T = N). The simulator will run the number of clients in separate processes at the same time. Each
867-
client will always be running in memory with no swap-in/out, but it will require more resources available.
870+
In most cases, run the simulator with the same number of worker processes as
871+
clients (T = N). Each client stays in memory with no swap-in/out, but this
872+
requires more available resources.
868873
869874
For the dataset / tensorboard initialization, you could make use of EventType.SWAP_IN and EventType.SWAP_OUT
870875
in the application.
871876
872877
SWAP_IN and SWAP_OUT events
873878
===========================
874-
During FLARE simulator execution, the client Apps are executed in turn in the same execution thread. Each executing client App will go
875-
fetch the task from the controller on the server, execute the task, and then submit the task results to the controller. Once finished submitting
876-
results, the current client App will yield the executing thread to the next client App to execute.
879+
During FLARE simulator execution, simulated client Apps fetch tasks from the
880+
controller, execute the tasks, and submit results back to the controller. When
881+
T < N, multiple simulated clients share a smaller pool of worker processes and
882+
may be swapped in and out as worker processes become available.
877883
878-
If the client App needs to preserve some states for the next "execution turn" to continue, the client executor can make use of the ``SWAP_OUT``
879-
event fired by the simulator engine to save the current states. When the client App gets the turn to execute again, use the ``SWAP_IN``
880-
event to recover the previous saved states.
884+
If the client App needs to preserve state for the next execution turn, the
885+
client executor can use the ``SWAP_OUT`` event fired by the simulator engine to
886+
save the current state. When the client App gets another turn to execute, use
887+
the ``SWAP_IN`` event to recover the previous saved state.
881888
882889
Multi-GPU and Separate Client Process with Simulator
883890
====================================================
884-
The simulator runs within the same process, and it will make use of a single GPU if it is detected with ``nvidia-smi``.
885-
If there are multiple GPUs available and you want to make use of them all for the simulator run, you can use the
886-
``-gpu`` option for this. The ``-gpu`` option provides the list of GPUs for the simulator to run on. The
887-
clients list will be distributed among the GPUs.
891+
The simulator uses separate client worker processes and assigns GPUs to those
892+
workers. If there are multiple GPUs available and you want to make use of them
893+
all for the simulator run, you can use the ``-gpu`` option for this. The
894+
``-gpu`` option provides the list of GPUs for the simulator to run on. The
895+
clients list will be distributed among the GPU groups.
888896
889897
For example:
890898
891899
.. code-block::shell
892900
893901
-c c1,c2,c3,c4,c5 -gpu 0,1
894902
895-
The clients c1, c3, and c5 will run on GPU 0 in one process, and clients c2 and c4 will run on GPU 1 in another process.
903+
The clients c1, c3, and c5 will be assigned to GPU 0, and clients c2 and c4
904+
will be assigned to GPU 1.
896905
897-
The GPU numbers do not have to be unique. If you use ``-gpu 0,0``, this will run 2 separate client processes on GPU 0, assuming this GPU will have
898-
enough memory to support the applications.
906+
The GPU numbers do not have to be unique. If you use ``-gpu 0,0``, this will
907+
create two client worker slots assigned to GPU 0, assuming this GPU has enough
908+
memory to support the applications.
899909
900910
.. note::
901911

examples/tutorials/flare_simulator.ipynb

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -7,7 +7,7 @@
77
"source": [
88
"## Intro to the FL Simulator\n",
99
"\n",
10-
"The [FL Simulator](https://nvflare.readthedocs.io/en/latest/user_guide/nvflare_cli/fl_simulator.html) runs a local simulation of a running NVFLARE FL deployment. This allows researchers to test and debug an application without provisioning a real, distributed FL project. The FL Simulator runs a server and multiple clients in the same local process, with communication that mimics a real deployment. This allows researchers to more quickly build out new components and jobs that can be directly used in a production deployment.\n",
10+
"The [FL Simulator](https://nvflare.readthedocs.io/en/latest/user_guide/nvflare_cli/fl_simulator.html) runs the same job concepts with a local simulator-managed server and simulated clients. It is useful for local development and batch jobs that should start, run, and exit without provisioned daemons.\n",
1111
"\n",
1212
"### Setup\n",
1313
"The NVFlare [Getting Started Guide](https://nvflare.readthedocs.io/en/main/getting_started.html) provides instructions for setting up NVFlare on a local system or in a Docker image. We've also cloned the NVFlare GitHub in our top-level working directory."

examples/tutorials/job_recipe.ipynb

Lines changed: 4 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -185,16 +185,17 @@
185185
"source": [
186186
"### SimEnv – Simulation Environment\n",
187187
"\n",
188-
"Runs all clients and the server as **threads** within a single process. This is lightweight and easy to set up; no networking required. Best suited for:\n",
188+
"Runs the job with the local FL simulator backend: no provisioned project or long-running server/client daemons. Simulated clients use local worker processes; `num_threads` is the historical name for the worker-process concurrency. Best suited for:\n",
189189
"\n",
190190
"* Quick experiments\n",
191191
"* Debugging scripts and models\n",
192192
"* Educational use cases\n",
193+
"* Batch-scheduled experiments where one submitted job should run the complete federated workflow and then exit\n",
193194
"\n",
194195
"**Arguments:**\n",
195196
"* `num_clients` (int): number of simulated clients\n",
196197
"* `clients`: a list of client names (length needs to match num_clients if both are provided)\n",
197-
"* `num_threads`: number of threads to use to run simulated clients\n",
198+
"* `num_threads`: number of concurrent simulated client worker processes\n",
198199
"* `gpu_config` (str): list of GPU Device Ids, comma separated\n",
199200
"* `log_config` (str): \"log config mode ('concise', 'full', 'verbose'), filepath, or level\"\n",
200201
"\n",
@@ -468,4 +469,4 @@
468469
},
469470
"nbformat": 4,
470471
"nbformat_minor": 5
471-
}
472+
}

0 commit comments

Comments
 (0)