Many real-world Federated Learning (FL) applications will rely on silos that are not in the same Azure tenant as the orchestrator. This is the case when the silos are owned by different companies. Furthermore, those silos might not even be in Azure at all - they might be on different cloud platforms, or on-premises.
We refer to those types of silos as external silos. The goal of this document is to provide guidance on how to provision a FL setup with external silos.
The Contoso corporation wants to train a model using a FL scheme. The underlying data belong to partner companies, and reside on-premises. One such company is Fabrikam.
At Contoso, one person is responsible for provisioning the FL setup and ensuring the security. We'll call that person the FL Admin.
At Fabrikam (one of the silos), one person owns the compute and the data. We'll call that person the Silo Admin.
Both FL Admin and Silo Admin have some prerequisites to meet. The Prerequisites section explains what is required of whom. After that, the title of every subsection in the Procedure section clearly indicates who of the FL Admin or Silo Admin should be performing the tasks. Silo Admin will only be involved in step C.
In all that follows, when we talk about an Azure subscription we mean the subscription where the Azure ML workspace and the orchestrator will be deployed. This subscription belongs to Contoso.
For common FL terms such as silo or orchestrator, please refer to the glossary.
- Some Kubernetes (k8s) cluster (at least one) with version <= 1.24.6, either on-premises, or in Azure (in a different tenant from that of the orchestrator). The cluster should have a minimum of 4 vCPU cores and 8-GB memory.
- For creating a k8s cluster on-premises one can use Kind, for instance.
- If you want your k8s cluster to have access to local data that reside on the same machine, you can create your cluster following this tutorial.
- Alternatively, if you are not familiar with Kind/Kubernetes, you can use AKS Edge Essentials to create your on-premises k8s cluster. Start by setting up your machine, then create a single machine deployment. After that, if you need your cluster to have access to local data, you need to add a local storage binding.
- For creating a k8s cluster in Azure (in a different tenant otherwise we'd be dealing with internal silos), one can use Azure Kubernetes Service (AKS).
- For creating a k8s cluster on-premises one can use Kind, for instance.
- FL Admin needs to have Owner role in the subscription, or at least in the resource group where the Azure ML workspace will be created. This is because some Role Assignments will have to be created.
- Silo Admin must be given (temporary) Contributor role to the subscription, or at least the Azure Arc Onboarding built-in role. This is because one step will require access to both the orchestrator subscription, and to the k8s cluster. It assumed that the FL Admin shouldn't have direct access to the k8s cluster.
- An identity that can be used to log in to the Azure CLI and connect the k8s cluster to Azure Arc (that would be the Silo Admin's identity).
- Azure CLI with version >= 2.24.0.
- FL Admin will need the following Azure CLI extensions.
- The ml Azure CLI extension (a.k.a. Azure ML CLI v2).
- Install via
az extension add --name ml. See over there for more details on installation.
- Install via
- The k8s-extension Azure CLI extension with version >= 1.2.3.
- Install via
az extension add --name k8s-extension.
- Install via
- The ml Azure CLI extension (a.k.a. Azure ML CLI v2).
- Silo Admin will need the following:
- The connectedk8s Azure CLI extension with version >= 1.2.0.
- Install via
az extension add --name connectedk8s.
- Install via
- A kubeconfig file and context pointing to the k8s cluster.
- Helm 3 with version < 3.7.0.
- The connectedk8s Azure CLI extension with version >= 1.2.0.
This is all explained in the first sections of the cookbook but repeated here for convenience.
Create an open Azure ML workspace named <workspace-name>. Owner permissions in <workspace-resource-group> are required, since Role Assignments will need to be created later on. (The <base-name> value will be used when creating associated resources and can be chosen arbitrarily, but note that it should be unique in the subscription.)
az deployment group create --template-file ./mlops/bicep/modules/azureml/open_azureml_workspace.bicep --resource-group <workspace-resource-group> --parameters machineLearningName=<workspace-name> baseName=<base-name>After that, create the compute and storage corresponding to the orchestrator (the value of the pairBaseName parameter will need to be adjusted if you have already created an orchestrator with this base name in the subscription).
az deployment group create --template-file ./mlops/bicep/modules/fl_pairs/open_compute_storage_pair.bicep --resource-group <workspace-resource-group> --parameters pairBaseName="orch" machineLearningName=<workspace-name>Detailed instructions for this phase, including steps for verification or for slightly different use cases can be found there. Here is a summary, which should be all you need.
- Register providers for Azure Arc-enabled Kubernetes. It only needs to be performed once, not for every silo.
-
Enter the following commands.
az provider register --namespace Microsoft.Kubernetes az provider register --namespace Microsoft.KubernetesConfiguration az provider register --namespace Microsoft.ExtendedLocation
-
Monitor the registration process. Registration may take up to 10 minutes.
az provider show -n Microsoft.Kubernetes -o table az provider show -n Microsoft.KubernetesConfiguration -o table az provider show -n Microsoft.ExtendedLocation -o table
-
Once registered, you should see the
RegistrationStatestate for these namespaces change toRegistered.
-
- Create a
<connected-cluster-resource-group>resource group for the connected clusters. Several connected clusters pointing to different k8s clusters can be added to this group - no need to create a separate group for each silo that will be created in the future. The location of this group<connected-cluster-resource-group-location>is not critical, but should preferably be the same as that of the orchestrator workspace.-
Enter the following command.
az group create --name <connected-cluster-resource-group> --location <connected-cluster-resource-group-location>
-
The connection is established by creating an Azure Arc-enabled Kubernetes resource named <Azure-Arc-enabled-k8s-resource-name>. This step should be performed by the Silo Admin, since it requires access to the k8s cluster (happening implicitly via the kube config file). It also requires Contributor role (at least) in the resource group <connected-cluster-resource-group> created in the previous step.
-
Enter the following command.
az connectedk8s connect --name <Azure-Arc-enabled-k8s-resource-name> --resource-group <connected-cluster-resource-group>
-
If the default kube config and context do not point to the k8s cluster, then the
--kube-configand--kube-contextparameters can be used to specify the correct values.
Detailed instructions for this phase, including current limitations, steps for verification or for slightly different use cases can be found there. Here is a summary, which should be all you need.
To deploy the Azure ML extension on the k8s cluster, enter the following command.
az k8s-extension create --name <extension-name> --extension-type Microsoft.AzureML.Kubernetes --config enableTraining=True --cluster-type connectedClusters --cluster-name <Azure-Arc-enabled-k8s-resource-name> --resource-group <connected-cluster-resource-group> --scope clusterwhere <extension-name> can be chosen arbitrarily.
Note: if you're using an AKS cluster (as opposed to a local k8s cluster), you'll need to change the
--cluster-typeparameter value fromconnectedClusterstomanagedClusters.
The deployment can be verified by the following.
az k8s-extension show --name <extension-name> --cluster-type connectedClusters --cluster-name <Azure-Arc-enabled-k8s-resource-name> --resource-group <connected-cluster-resource-group>In the response, look for "name" and "provisioningState": "Succeeded". Note that this step can take 10-15 minutes and will show "provisioningState": "Pending" at first.
(Detailed instructions for this phase, including steps for verification or for slightly different use cases can be found there.)
-
Create a user-assigned identity (UAI) that will later be assigned to the Azure ML attached compute:
az identity create --name uai-<azureml-compute-name> --resource-group <workspace-resource-group>
-
Attach the Arc cluster to the orchestrator workspace, or in other words create an Azure ML attached compute pointing to the Arc cluster:
az ml compute attach --resource-group <workspace-resource-group> --workspace-name <workspace-name> --type Kubernetes --name <azureml-compute-name> --resource-id "/subscriptions/<subscription-id>/resourceGroups/<connected-cluster-resource-group>/providers/Microsoft.Kubernetes/connectedClusters/<Azure-Arc-enabled-k8s-resource-name>" --identity-type UserAssigned --user-assigned-identities "subscriptions/<subscription-id>/resourceGroups/<workspace-resource-group>/providers/Microsoft.ManagedIdentity/userAssignedIdentities/uai-<azureml-compute-name>" --no-wait
where:
<workspace-resource-group>is the resource group of the orchestrator workspace that you used in step A;<workspace-name>is the name of the orchestrator workspace that was created during step A;<azureml-compute-name>is the name that you choose for the silo compute in the orchestrator workspace (arbitrary);<subscription-id>is the Id of the orchestrator subscription;<connected-cluster-resource-group>and<Azure-Arc-enabled-k8s-resource-name>have been defined in steps C and D;uai-<azureml-compute-name>should be the name of the user-assigned identity you just created.
-
Create a storage account for this external silo.
az deployment group create --template-file ./mlops/bicep/modules/storages/new_blob_storage_datastore.bicep --resource-group <workspace-resource-group> --parameters machineLearningName=<workspace-name> storageName=<storage-account-name> storageRegion=<workspace-region> publicNetworkAccess="Enabled" tags={}
where:
<storage-account-name>is the name that you choose for the silo storage account (arbitrary, but will be automatically formatted to remove forbidden characters);<workspace-region>is the region of the orchestrator workspace (we recommend creating this storage account in the same region as the orchestrator workspace).
-
Set permissions for the silo's compute to R/W from/to the orchestrator and silo storage accounts.
4.1. Navigate the Azure portal to find your workspace resource group.
4.2. Look for a resource of type Managed Identity named like
uai-<azureml-compute-name>. It should have been created by the instructions above.4.3. Open this identity and click on Azure role assignments.
4.4. Click on Add role assignment and add the 3 roles below towards the silo storage account, which should be named
<storage-account-name>(or something slightly different if you used any forbidden characters; in any case, you should be able to easily locate it from the Azure portal).- Storage Blob Data Contributor
- Reader and Data Access
- Storage Account Key Operator Service Role
4.5. Repeat the same steps for the storage account of your orchestrator (this storage account should be named
storchif you kept the default value for thepairBaseNameparameter suggested in step A, otherwise it will be the value you chose forpairBaseName, appended tost).
To validate everything is wired properly we are going to run a degenerate (using only one silo) HELLOWORLD-type FL job.
First, open the config.yaml file located in the examples/pipelines/fl_cross_silo_literal directory, and do the following.
- In the
amlsection, adjust the values ofsubscription_id,resource_group_name, andworkspace_nameto the proper values corresponding to your workspace. - In the
orchestratorsection, adjust the values ofcomputeanddatastoreto those corresponding to the orchestrator you created in step A.- The
computevalue should becpu-cluster-orchif you kept the default value for thepairBaseName, otherwise it will be the value you chose forpairBaseName, appended tocpu-cluster-. - The
datastorevalue should bedatastore_orchif you kept the default value for thepairBaseName, otherwise it will be the value you chose forpairBaseName, appended todatastore_(if you had '-' characters inpairBaseName, they will be replaced by '_').
- The
- In the
silosection:- Just keep one silo, by deleting or commenting out all entries but one (3 entries to start with, each with a
compute,datastore,training_data, andtesting_dataparameter). - Adjust the remaining values of
computeanddatastoreto those corresponding to the silo you created in step E. The value forcomputewill be<azureml-compute-name>, and the value fordatastorewill bedatastore_<storage-account-name>(with '-' characters replaced by '_' if applicable).
- Just keep one silo, by deleting or commenting out all entries but one (3 entries to start with, each with a
Submit the job by running the following command.
python ./examples/pipelines/fl_cross_silo_literal/submit.py --example HELLOWORLDNote: You can use --offline flag when running the job to just build and validate pipeline without submitting it.