Skip to content

Commit d8b1df1

Browse files
authored
Merge branch 'develop' into nshekhaw/fix_mu_pytorch_fedprox
2 parents 5c07782 + 333049f commit d8b1df1

12 files changed

Lines changed: 473 additions & 58 deletions

File tree

docs/about/features.rst

Lines changed: 14 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -109,4 +109,17 @@ In Federated Learning (FL), Secure Aggregation (SecAgg) is a technique that allo
109109
:hidden:
110110

111111
features_index/secure_aggregation
112-
112+
113+
.. _federated_analytics:
114+
115+
---------------------
116+
Federated Analytics
117+
---------------------
118+
119+
Federated Analytics enables the collection and analysis of data insights across decentralized nodes without compromising data privacy. This feature allows organizations to perform analytics on distributed data while ensuring compliance with privacy regulations. For more info see :doc:`features_index/fed_analytics`
120+
121+
.. toctree::
122+
:hidden:
123+
124+
features_index/fed_analytics
125+
Lines changed: 182 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,182 @@
1+
.. # Copyright (C) 2020-2024 Intel Corporation
2+
.. # SPDX-License-Identifier: Apache-2.0
3+
4+
Federated Analytics
5+
=======================================
6+
7+
Introduction to Federated Analytics
8+
-------------------------------------
9+
10+
Federated Analytics is a privacy-preserving approach to compute statistics or perform data analysis on distributed datasets without aggregating raw data into a centralized location. This method ensures data security while enabling insights to be drawn from decentralized data sources. For instance, one can compute the mean, frequency distributions, or other statistical measures across datasets located on multiple devices. Federated Analytics is particularly valuable in scenarios where data sharing is restricted due to privacy concerns or regulatory constraints.
11+
12+
OpenFL's Support for Federated Analytics
13+
------------------------------------------
14+
15+
OpenFL, a flexible framework for Federated Learning, extends its capabilities to support Federated Analytics. By leveraging the federation plan and task runner API, OpenFL enables users to perform analytics tasks across collaborators. These tasks are defined in the ``plan.yaml`` file and distributed to collaborators for execution. The results are then aggregated by the aggregator to provide global insights.
16+
17+
18+
Example Workspace: Histogram Calculation using sklearn IRIS Dataset
19+
------------------------------------------------------------------------------
20+
21+
The Federated Analytics workspace for histogram calculation demonstrates how to compute frequency distributions of specific features across distributed datasets. This workspace leverages the OpenFL framework to ensure privacy-preserving analytics while providing global insights into the data.
22+
23+
**Task Configuration:**
24+
25+
The analytics tasks are defined in the `plan.yaml` file. For example:
26+
27+
.. code-block:: yaml
28+
:emphasize-lines: 6,41,43,45
29+
30+
aggregator:
31+
defaults: plan/defaults/aggregator.yaml
32+
template: openfl.component.Aggregator
33+
settings:
34+
last_state_path: save/result.json
35+
rounds_to_train: 1 # Number of training rounds (set to 1 for Federated Analytics).
36+
37+
collaborator:
38+
defaults: plan/defaults/collaborator.yaml
39+
template: openfl.component.Collaborator
40+
settings:
41+
use_delta_updates: false
42+
opt_treatment: RESET
43+
44+
data_loader:
45+
defaults: plan/defaults/data_loader.yaml
46+
template: src.dataloader.IRISInMemory
47+
settings:
48+
collaborator_count: 2
49+
data_group_name: iris
50+
batch_size: 150
51+
52+
task_runner:
53+
defaults: plan/defaults/task_runner.yaml
54+
template: src.taskrunner.IrisHistogram
55+
56+
network:
57+
defaults: plan/defaults/network.yaml
58+
59+
assigner:
60+
template: openfl.component.RandomGroupedAssigner
61+
settings:
62+
task_groups:
63+
- name: analytics
64+
percentage: 1.0
65+
tasks:
66+
- analytics
67+
68+
tasks:
69+
analytics:
70+
function: analytics
71+
aggregation_type:
72+
template: src.aggregatehistogram.AggregateHistogram
73+
kwargs:
74+
columns: ['sepal length (cm)', 'sepal width (cm)']
75+
76+
**Note:** The `function` and `aggregation_type.template` fields in the configuration can be replaced with custom implementations to suit specific use cases. This flexibility allows users to define their own analytics logic and aggregation methods tailored to their requirements.
77+
78+
**Data Distribution**: The dataset is distributed across collaborators, with each collaborator holding a local shard of the data.
79+
80+
**Local Computation**: Each collaborator computes the histogram for the specified feature(s) on its local data shard. This ensures that raw data never leaves the collaborator's environment.
81+
82+
**Aggregation**: The aggregator collects the histograms from all collaborators and combines them to compute the global histogram. The aggregated results are saved in `save/result.json`. This file provides a global view of the frequency distribution for the selected feature, computed in a privacy-preserving manner.
83+
84+
85+
By following this structured approach, the Federated Analytics workspace enables secure and efficient computation of histograms across distributed datasets.
86+
87+
Detailed Instructions
88+
---------------------
89+
90+
Workspace Setup and Federation Run
91+
92+
Create a workspace for analytics (for example, using the federated_analytics/histogram template):
93+
94+
.. code-block:: bash
95+
96+
fx workspace create --prefix ./analytics_workspace --template federated_analytics/histogram
97+
cd analytics_workspace
98+
fx workspace certify
99+
fx aggregator generate-cert-request
100+
fx aggregator certify --silent
101+
102+
Initialize the plan normally:
103+
104+
.. code-block:: bash
105+
106+
fx plan initialize
107+
108+
Run the federation using your collaborators. For example:
109+
110+
.. code-block:: bash
111+
112+
fx collaborator create -n collaborator1 -d 1
113+
fx collaborator generate-cert-request -n collaborator1
114+
fx collaborator certify -n collaborator1 --silent
115+
116+
fx collaborator create -n collaborator2 -d 2
117+
fx collaborator generate-cert-request -n collaborator2
118+
fx collaborator certify -n collaborator2 --silent
119+
120+
fx aggregator start > ~/fx_aggregator.log 2>&1 &
121+
fx collaborator start -n collaborator1 > ~/collab1.log 2>&1 &
122+
fx collaborator start -n collaborator2 > ~/collab2.log 2>&1 &
123+
124+
Once the federation run is complete, the results will be saved.
125+
126+
The result file `save/result.json` contains the aggregated histogram data. For example:
127+
128+
.. code-block:: json
129+
130+
{
131+
"sepal length (cm) histogram": [
132+
0.0,
133+
0.0,
134+
9.0,
135+
50.0,
136+
56.0,
137+
28.0,
138+
7.0,
139+
0.0,
140+
0.0
141+
],
142+
"sepal length (cm) bins": [
143+
4.0,
144+
5.777777671813965,
145+
7.55555534362793,
146+
9.333333015441895,
147+
11.11111068725586,
148+
12.88888931274414,
149+
14.666666984558105,
150+
16.44444465637207,
151+
18.22222137451172,
152+
20.0
153+
],
154+
"sepal width (cm) histogram": [
155+
47.0,
156+
91.0,
157+
12.0,
158+
0.0,
159+
0.0,
160+
0.0,
161+
0.0,
162+
0.0,
163+
0.0
164+
],
165+
"sepal width (cm) bins": [
166+
4.0,
167+
5.777777671813965,
168+
7.55555534362793,
169+
9.333333015441895,
170+
11.11111068725586,
171+
12.88888931274414,
172+
14.666666984558105,
173+
16.44444465637207,
174+
18.22222137451172,
175+
20.0
176+
]
177+
}
178+
179+
180+
Conclusion
181+
----------
182+
Federated Analytics in OpenFL enables privacy-preserving data analysis on distributed datasets. By leveraging the task runner API and predefined analytics tasks, users can seamlessly compute global statistics without compromising data privacy. This feature simplifies the workflow for distributed data analysis and ensures compliance with privacy regulations.

docs/developer_guide/utilities.rst

Lines changed: 6 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -6,6 +6,9 @@ The following are utilities available in Open Federated Learning (OpenFL).
66
:doc:`utilities/pki`
77
Use the Public Key Infrastructure (PKI) solution workflows to certify the nodes in your federation.
88

9+
:doc:`utilities/verifiable_datasets`
10+
Build and verify datasets composed of multiple data sources.
11+
912
:doc:`utilities/splitters_data`
1013
Split your data to run your federation from a single dataset.
1114

@@ -17,5 +20,7 @@ The following are utilities available in Open Federated Learning (OpenFL).
1720
:hidden:
1821

1922
utilities/pki
23+
utilities/verifiable_datasets
2024
utilities/splitters_data
21-
utilities/timeouts
25+
utilities/timeouts
26+
Lines changed: 37 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,37 @@
1+
.. # Copyright (C) 2025 Intel Corporation
2+
.. # SPDX-License-Identifier: Apache-2.0
3+
4+
*************************************
5+
Verifiable Datasets and Data Sources
6+
*************************************
7+
8+
.. _verifiable_datasets_overview:
9+
10+
To accommodate for the proliferation of data sources and the need for trusted datasets, OpenFL provides a hierarchy of utility classes to build and verify datasets.
11+
This includes an extensible class hierarchy that enables the creation of datasets from various data sources, such as local file system, object storage and others.
12+
13+
The central abstraction is the :code:`VerifiableDatasetInfo` class that encapsulates the dataset's metadata and provides a method for verifying the integrity of the dataset.
14+
A dataset can be built from multiple data sources (not necessarily of the same type):
15+
16+
.. mermaid:: ../../mermaid/verifiable_dataset_info.mmd
17+
:caption: Verifiable Dataset with Multiple Data Sources
18+
:align: center
19+
20+
The :code:`VerifiableDatasetInfo` class can then be used to create higher-order dataset classes that enable iterating through multiple data sources, while verifying integrity if required.
21+
The :code:`root_hash` is used as a reference for integrity when loading items from the the data sources in the :code:`VerifiableDatasetInfo` object.
22+
23+
OpenFL comes with a toolbox of dataset layout classes per ML framework. For PyTorch's :code:`torch.utils.data.Dataset` OpenFL curently provides:
24+
25+
- :code:`FolderDataset` - represents an iterable folder-layout dataset from a single data source, by implementing the :code:`__getitem__` method.
26+
- :code:`ImageFolder` - a specialization of the :code:`FolderDataset` that is able to load binary images from a foler-like structure
27+
- :code:`VerifiableMapStyleDataset` - a base class for map-style datasets that can be built from multiple data sources (as specified by a :code:`VerifiableDatasetInfo` object), including integrity checks.
28+
- :code:`VerifiableImageFolder` - a specialization of the :code:`VerifiableMapStyleDataset` encapsulating a collection of :code:`ImageFolder` datasets
29+
30+
Note that the all those classes (directly or indirectly) extend :code:`torch.utils.data.DataLoader`, and are therefore compatible with all PyTorch utilities for pre-processing data sets.
31+
A similar class hierarchy can be created for other ML frameworks that offer dataset utilities, such as TensorFlow.
32+
33+
.. mermaid:: ../../mermaid/verifiable_image_folder.mmd
34+
:caption: Dataset hierarchy
35+
:align: center
36+
37+
A practical example for the :code:`VerifiableImageFolder` backed by :code:`S3DataSource` is provided in the `s3_histology <https://github.com/securefederatedai/openfl/tree/develop/openfl-workspace/torch/histology_s3>`_ workspace template.
Lines changed: 56 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,56 @@
1+
%% Copyright 2025 Intel Corporation
2+
%% SPDX-License-Identifier: Apache-2.0
3+
4+
classDiagram
5+
class VerifiableDatasetInfo {
6+
+label: str
7+
+data_sources: DataSource[]
8+
+metadata: dict[str, str]
9+
+root_hash: HASH
10+
11+
+verify_dataset(root_hash: HASH)
12+
+verify_single_file(file_path: str, file_hash: HASH)
13+
+to_json() str
14+
+from_json(json_str: str) VerifiableDatasetInfo
15+
}
16+
17+
class DataSource {
18+
<<abstract>>
19+
+name: str
20+
+type: DataSourceType
21+
+compute_file_hash(path: str) str
22+
+enumerate_files() Generator~str~
23+
+read_blob(path: str) bytes
24+
+from_dict(ds_dict: dict) DataSource
25+
+is_valid_hash_function(func) bool
26+
+to_dict() dict
27+
}
28+
29+
class LocalDataSource {
30+
+base_path: str
31+
...
32+
}
33+
34+
class S3DataSource {
35+
+uri: str
36+
+endpoint: str
37+
...
38+
}
39+
40+
class AzureBlobDataSource {
41+
+name: str
42+
+container_string: str
43+
+folder_prefix: str
44+
...
45+
}
46+
47+
VerifiableDatasetInfo "1" o-- "*" DataSource
48+
DataSource <|-- LocalDataSource
49+
DataSource <|-- S3DataSource
50+
DataSource <|-- AzureBlobDataSource
51+
52+
style VerifiableDatasetInfo fill:#FFFFE0,stroke:#000,stroke-width:1px
53+
style DataSource fill:#FFFFE0,stroke:#000,stroke-width:1px
54+
style LocalDataSource fill:#FFFFE0,stroke:#000,stroke-width:1px
55+
style S3DataSource fill:#FFFFE0,stroke:#000,stroke-width:1px
56+
style AzureBlobDataSource fill:#FFFFE0,stroke:#000,stroke-width:1px
Lines changed: 50 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,50 @@
1+
%% Copyright 2025 Intel Corporation
2+
%% SPDX-License-Identifier: Apache-2.0
3+
4+
classDiagram
5+
class torch_utils_data_Dataset {
6+
+__len__() int
7+
+__getitem__(index: int) Any
8+
}
9+
10+
class VerifiableDatasetInfo {
11+
+verify_dataset(root_hash: HASH)
12+
+verify_single_file(file_path: str, file_hash: HASH)
13+
+from_json(json_str: str) VerifiableDatasetInfo
14+
}
15+
16+
class VerifiableMapStyleDataset {
17+
<<abstract>>
18+
+__len__() int
19+
+__getitem__(index: int) Any
20+
+create_datasets() void*
21+
}
22+
23+
class VerifiableImageFolder {
24+
+__len__() int
25+
+__getitem__(index: int) Any
26+
+create_datasets() void
27+
}
28+
29+
class FolderDataset {
30+
<<abstract>>
31+
+__len__() int
32+
+__getitem__(index: int) Any
33+
+load_file(file_path: str) void*
34+
}
35+
36+
class ImageFolder {
37+
+__len__() int
38+
+__getitem__(index: int) Any
39+
+load_file(file_path: str) void
40+
}
41+
42+
torch_utils_data_Dataset <|.. VerifiableMapStyleDataset
43+
torch_utils_data_Dataset <|.. FolderDataset
44+
VerifiableMapStyleDataset o-- VerifiableDatasetInfo
45+
VerifiableMapStyleDataset <|-- VerifiableImageFolder
46+
VerifiableMapStyleDataset o-- FolderDataset
47+
FolderDataset <|-- ImageFolder
48+
49+
style torch_utils_data_Dataset fill:#D3D3D3,stroke:#000,stroke-width:1px
50+
style VerifiableDatasetInfo fill:#FFFFE0,stroke:#000,stroke-width:1px

0 commit comments

Comments
 (0)