Skip to content
Open
Show file tree
Hide file tree
Changes from 13 commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
79 changes: 79 additions & 0 deletions content/patterns/multicloud-federated-learning/_index.adoc
Original file line number Diff line number Diff line change
@@ -0,0 +1,79 @@
---
title: Multicloud Federated Learning
date: 2025-05-23
summary: This pattern helps you develop and deploy federated learning applications on an open hybrid cloud via Open Cluster Management.
rh_products:
- Red Hat Advanced Cluster Management
- Red Hat OpenShift Container Platform
industries:
- General
aliases: /multicloud-federated-learning/
links:
install: getting-started
feedback: https://docs.google.com/forms/d/e/1FAIpQLScI76b6tD1WyPu2-d_9CCVDr3Fu5jYERthqLKJDUGwqBg7Vcg/viewform
bugs: https://github.com/open-cluster-management-io/addon-contrib/issues
ci: multicloudfederatedlearning
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if we have CI set up for this. Kindly check this with the Validated Patterns engineering team
cc: @day0hero

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mlabonte-rh @yukinchan - what do we need to do to get this into 'testing' ?

---

:toc:
:imagesdir: /images
:_content-type: ASSEMBLY
include::modules/comm-attributes.adoc[]

== Multicloud Federated Learning

=== Background

As machine learning (ML) evolves, protecting data privacy becomes increasingly important. Since ML depends on large volumes of data, it is essential to secure that data without disrupting the learning process.

Federated learning addresses this by allowing multiple clusters or organizations to collaboratively train models without sharing sensitive data. Computation happens where the data lives, ensuring privacy, regulatory compliance, and efficiency.

By integrating federated learning with {rh-rhacm-first}, this pattern provides an automated and scalable solution for deploying FL workloads across hybrid and multicluster environments.

==== Technologies
* Open Cluster Management (OCM)
- ManagedCluster
- ManifestWork
- Placement
...
* Federated Learning frameworks
- Flower
- OpenFL
...
* Grafana
* OpenTelemetry

=== Why Use Advanced Cluster Management for Federated Learning?
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
=== Why Use Advanced Cluster Management for Federated Learning?
=== Why use Advanced Cluster Management for Federated Learning?
Suggested change
=== Why Use Advanced Cluster Management for Federated Learning?
=== Why Use Red Hat Advanced Cluster Management for federated learning?


**Advanced Cluster Management (ACM)** simplifies and automates the deployment and orchestration of Federated Learning (FL) workloads across clusters:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
**Advanced Cluster Management (ACM)** simplifies and automates the deployment and orchestration of Federated Learning (FL) workloads across clusters:
**{rh-rhacm-first}** simplifies and automates the deployment and orchestration of federated learning workloads across clusters:

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the bullet got mistakenly deleted from my earlier comment

Suggested change
**Advanced Cluster Management (ACM)** simplifies and automates the deployment and orchestration of Federated Learning (FL) workloads across clusters:
- **Advanced Cluster Management (ACM)** simplifies and automates the deployment and orchestration of Federated Learning (FL) workloads across clusters:


- **Automatic Deployment & Simplified Operations**: ACM provides a unified and automated approach to running FL workflows across different runtimes (e.g., Flower, OpenFL). Its controller manages the entire FL lifecycle—including setup, coordination, status tracking, and teardown—across multiple clusters in a multicloud environment. This eliminates repetitive manual configurations, significantly reduces operational overhead, and ensures consistent, scalable FL deployments.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use sentence style capitalization

Suggested change
- **Automatic Deployment & Simplified Operations**: ACM provides a unified and automated approach to running FL workflows across different runtimes (e.g., Flower, OpenFL). Its controller manages the entire FL lifecycle—including setup, coordination, status tracking, and teardown—across multiple clusters in a multicloud environment. This eliminates repetitive manual configurations, significantly reduces operational overhead, and ensures consistent, scalable FL deployments.
- **Automatic deployment and simplified operations**: ACM provides a unified and automated approach to running FL workflows across different runtimes (e.g., Flower, OpenFL). Its controller manages the entire FL lifecycle—including setup, coordination, status tracking, and teardown—across multiple clusters in a multicloud environment. This eliminates repetitive manual configurations, significantly reduces operational overhead, and ensures consistent, scalable FL deployments.


- **Dynamic Client Selection**: ACM's scheduling capabilities allow FL clients to be selected not only based on where the data resides, but also dynamically based on cluster labels, resource availability, and governance criteria. This enables a more adaptive and intelligent approach to client participation.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- **Dynamic Client Selection**: ACM's scheduling capabilities allow FL clients to be selected not only based on where the data resides, but also dynamically based on cluster labels, resource availability, and governance criteria. This enables a more adaptive and intelligent approach to client participation.
- **Dynamic client selection**: {rh-rhacm} scheduling capabilities allow federated learning clients to be selected not only based on where the data resides, but also dynamically based on cluster labels, resource availability, and governance criteria. This enables a more adaptive and intelligent approach to client participation.


Together, these capabilities support a **flexible FL client model**, where clusters can join or exit the training process dynamically, without requiring static or manual configuration.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Together, these capabilities support a **flexible FL client model**, where clusters can join or exit the training process dynamically, without requiring static or manual configuration.
Together, these capabilities support a flexible federated learning client model, where clusters can join or exit the training process dynamically, without requiring static or manual configuration.


=== Benefits

- 🔒 Privacy-preserving training without moving sensitive data
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- 🔒 Privacy-preserving training without moving sensitive data
- Privacy-preserving training without moving sensitive data

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please address all the comments about removing these small icons/images


- ⚙️ Automated dynamic FL client orchestration across distributed clusters
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- ⚙️ Automated dynamic FL client orchestration across distributed clusters
- Automated dynamic FL client orchestration across distributed clusters


- 🧩 Adaptable to different FL frameworks, such as OpenFL and Flower
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- 🧩 Adaptable to different FL frameworks, such as OpenFL and Flower
- Adaptable to different federated learning frameworks, such as OpenFL and Flower


- 🌍 Scalability across hybrid and edge clusters
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- 🌍 Scalability across hybrid and edge clusters
- Scalability across hybrid and edge clusters


- 📉 Lower infrastructure and operational costs
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- 📉 Lower infrastructure and operational costs
- Lower infrastructure and operational costs


This approach empowers organizations to build smarter, privacy-first AI solutions with less complexity and more flexibility.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Global comment: Avoid anthropomorphism

Do not assign human characteristics to inanimate objects, which are incapable of human behaviors and emotions. As much as possible, focus technical information on users and their actions, not on a product and its actions. Some examples are enables, allows, lets, expects, permits

Suggested change
This approach empowers organizations to build smarter, privacy-first AI solutions with less complexity and more flexibility.
Organizations can use this approach to build smarter, privacy-first AI solutions with less complexity and more flexibility.


=== Architecture

image::/images/multicloud-federated-learning/multicluster-federated-learning-workflow.png[multicloud-federated-learning-workflow]

- In this architecture, a central **Hub Cluster** acts as the aggregator, running the Federated Learning (FL) controller and scheduling workloads using ACM APIs like `Placement` and `ManifestWork`.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- In this architecture, a central **Hub Cluster** acts as the aggregator, running the Federated Learning (FL) controller and scheduling workloads using ACM APIs like `Placement` and `ManifestWork`.
- In this architecture, a central hub cluster acts as the aggregator, running the federated learning controller and scheduling workloads using {rh-rhacm} APIs like `Placement` and `ManifestWork`.


- Multiple **Managed Clusters**, potentially across different clouds, serve as FL clients—each holding private data. These clusters pull the global model from the hub, train it locally, and push model updates back.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use emphasis sparingly so that it does not lose its impact.

Suggested change
- Multiple **Managed Clusters**, potentially across different clouds, serve as FL clients—each holding private data. These clusters pull the global model from the hub, train it locally, and push model updates back.
- Multiple managed clusters, potentially across different clouds, serve as FL clients—each holding private data. These clusters pull the global model from the hub, train it locally, and push model updates back.


- The controller manages this lifecycle using custom resources and supports runtimes like Flower and OpenFL. This setup enables scalable, multi-cloud model training with **data privacy preserved by design**, requiring no changes to existing FL training code.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- The controller manages this lifecycle using custom resources and supports runtimes like Flower and OpenFL. This setup enables scalable, multi-cloud model training with **data privacy preserved by design**, requiring no changes to existing FL training code.
- The controller manages this lifecycle using custom resources and supports runtimes like Flower and OpenFL. This setup enables scalable, multi-cloud model training with data privacy preserved by design, requiring no changes to existing federated learning training code.


233 changes: 233 additions & 0 deletions content/patterns/multicloud-federated-learning/getting-started.adoc
Original file line number Diff line number Diff line change
@@ -0,0 +1,233 @@
---
title: Getting Started
weight: 10
aliases: /multicloud-federated-learning/getting-started/
---

:toc:
:imagesdir: /images
:_content-type: ASSEMBLY
include::modules/comm-attributes.adoc[]

== Deploying the Multicloud Federated Learning Pattern
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
== Deploying the Multicloud Federated Learning Pattern
== Deploying the Multicloud Federated Learning pattern


=== Prerequisites

==== Ensure the following tools are installed:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
==== Ensure the following tools are installed:
* Install the following tools:


- link:https://kubernetes.io/docs/reference/kubectl/[`kubectl`]
- link:https://kubectl.docs.kubernetes.io/installation/kustomize/[`kustomize`]
- link:https://kind.sigs.k8s.io/[`kind`] (recommended version > v0.9.0)
- link:https://www.gnu.org/software/make/[`make`] (for build automation)

==== Optional (for container image building):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
==== Optional (for container image building):
*(Optional) Install the following for building a container image:


- link:https://podman.io/[Podman] or link:https://www.docker.com/[Docker]
- link:https://go.dev/doc/install[Go] (version 1.19 or higher)

===== Advanced Cluster Management Environment
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
===== Advanced Cluster Management Environment
* {rh-rhacm-first}


Prepare at least three clusters: one hub cluster and two managed clusters.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Prepare at least three clusters: one hub cluster and two managed clusters.
* Prepare at least three clusters: one hub cluster and two managed clusters.


Verify the managed clusters are registered on the hub by running:

[source,bash]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
[source,bash]
[source,terminal]

----
$ kubectl get mcl
----

Example output:
[source,bash]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
[source,bash]
[source,terminal]

----
NAME HUB ACCEPTED MANAGED CLUSTER URLS JOINED AVAILABLE AGE
cluster1 true https://api.***.com:6443 True True 5m
cluster2 true https://api.***.com:6443 True True 5m
----

=== Deploy Federated Learning Controller

. Clone and navigate to the repository:
+
[source,bash]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
[source,bash]
[source,terminal]

----
$ git clone [email protected]:open-cluster-management-io/addon-contrib.git
$ cd federated-learning-controller
----

. Build and push the controller image (or use pre-built `quay.io/myan/federated-learning-controller:latest`):
+
[source,bash]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
[source,bash]
[source,terminal]

----
$ make docker-build docker-push IMG=<IMG>
----

. Deploy the controller to the hub cluster:
+
[source,bash]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
[source,bash]
[source,terminal]

----
$ kubectl config use-context kind-hub
$ make deploy IMG=<controller-image> NAMESPACE=<controller-namespace(default is open-cluster-management)>
----

. Verify the deployment:
+
The federated learning controller is running in the open-cluster-management namespace by default.
+
[source,bash]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
[source,bash]
[source,terminal]

----
$ kubectl get pods -n open-cluster-management
----
+
Example output
+
[source,bash]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
[source,bash]
[source,terminal]

----
NAME READY STATUS RESTARTS AGE
cluster-manager-d9db64db5-c7kfj 1/1 Running 0 3d22h
cluster-manager-d9db64db5-t7grh 1/1 Running 0 3d22h
cluster-manager-d9db64db5-wndd8 1/1 Running 0 3d22h
federated-learning-controller-d7df846c9-nb4wc 1/1 Running 0 3d22h
----

=== Deploy the Federated Learning Instance
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
=== Deploy the Federated Learning Instance
=== Deploy the federated learning instance


. Build the Application Image
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
. Build the Application Image
. Build the application image

+
*Note*: You can skip this step by using the pre-built image `quay.io/myan/flower-app-torch:latest`.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
*Note*: You can skip this step by using the pre-built image `quay.io/myan/flower-app-torch:latest`.
[NOTE]
====
You can skip this step by using the pre-built image `quay.io/myan/flower-app-torch:latest`.
====

+
[source,bash]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
[source,bash]
[source,terminal]

----
$ cd examples/flower

$ export REGISTRY=<your-registry>
$ export IMAGE_TAG=<your-tag>
$ make build-app-image
$ make push-app-image
----
+
Image format: `<REGISTRY>/flower-app-torch:<IMAGE_TAG>`

. Deploy a Federated Learning Instance
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
. Deploy a Federated Learning Instance
. Deploy a federated learning instance

+
In this example, both the server and clients use the same image—either the one built above or the pre-built `quay.io/myan/flower-app-torch:latest`. Once the resource is created, the server is deployed to the hub cluster, and the clients are prepared for deployment to the managed clusters.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
In this example, both the server and clients use the same imageeither the one built above or the pre-built `quay.io/myan/flower-app-torch:latest`. Once the resource is created, the server is deployed to the hub cluster, and the clients are prepared for deployment to the managed clusters.
In this example, both the server and clients use the same imageeither the one built above or the prebuilt `quay.io/myan/flower-app-torch:latest`. After the resource is created, the server is deployed to the hub cluster, and the clients are prepared for deployment to the managed clusters.

+
Create a `FederatedLearning` resource in the controller namespace on the hub cluster:
+
[source,yaml]
----
apiVersion: federation-ai.open-cluster-management.io/v1alpha1
kind: FederatedLearning
metadata:
name: federated-learning-sample
spec:
framework: flower
server:
image: <REGISTRY>/flower-app-torch:<IMAGE_TAG>
rounds: 3
minAvailableClients: 2
listeners:
- name: server-listener
port: 8080
type: LoadBalancer
storage:
type: PersistentVolumeClaim
name: model-pvc
path: /data/models
size: 2Gi
client:
image: <REGISTRY>/flower-app-torch:<IMAGE_TAG>
placement:
clusterSets:
- global
predicates:
- requiredClusterSelector:
claimSelector:
matchExpressions:
- key: federated-learning-sample.client-data
operator: Exists
----

. Schedule the Federated Learning Clients into Managed Clusters
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
. Schedule the Federated Learning Clients into Managed Clusters
. Schedule the federated learning clients into managed clusters

+
The above configuration schedules only clusters with a `ClusterClaim` having the key `federated-learning-sample.client-data`. You can combine this with other scheduling policies (refer to the Placement API for details).
+
Add the `ClusterClaim` object to these clusters own the data for the client:

.. **Cluster1: **
+
[source,bash]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
[source,bash]
[source,terminal]

----
$ kubectl config use-context kind-cluster1

$ cat <<EOF | kubectl apply -f -
apiVersion: cluster.open-cluster-management.io/v1alpha1
kind: ClusterClaim
metadata:
name: federated-learning-sample.client-data
spec:
value: /data/private/cluster1
EOF
----

.. **Cluster2: **
+
[source,bash]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
[source,bash]
[source,terminal]

----
$ kubectl config use-context kind-cluster2
$ cat <<EOF | kubectl apply -f -
apiVersion: cluster.open-cluster-management.io/v1alpha1
kind: ClusterClaim
metadata:
name: federated-learning-sample.client-data
spec:
value: /data/private/cluster2
EOF
----

. Check the status of the federated learning instance

.. After creating the instance, the server initially shows a status of `Waiting`
+
*Example - server in hub cluster:*
+
[source,bash]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
[source,bash]
[source,terminal]

----
$ kubectl get pods
NAME READY STATUS RESTARTS AGE
federated-learning-sample-server-7jnfs 0/1 Completed 0 9m
----

.. Once the required clients are ready, status changes to `Running`
+
*Example - client in managed cluster:*
+
[source,terminal]
----
$ kubectl get pods -n open-cluster-management
NAME READY STATUS RESTARTS AGE
federated-learning-sample-client-75sc8 0/1 Completed 0 8m
----

.. After the training and aggregation rounds complete, the status becomes `Completed`
+
*Example - federated learning instance:*
+
[source,terminal]
----
status:
listeners:
- address: 172.18.0.2:31166
name: listener(service):federated-learning-sample-server
port: 31166
type: NodePort
message: Model training successful. Check storage for details
phase: Completed
----

.. Download and Verify the Trained Model
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
.. Download and Verify the Trained Model
.. Download and verify the trained model

+
The trained MNIST model is saved in the `model-pvc` volume.
+
- link:https://github.com/open-cluster-management-io/addon-contrib/blob/main/federated-learning-controller/examples/notebooks/deploy[Deploy a Jupyter notebook server]
- link:https://github.com/open-cluster-management-io/addon-contrib/blob/main/federated-learning-controller/examples/notebooks/1.hub-evaluation.ipynb[Validate the model]

1 change: 1 addition & 0 deletions modules/comm-attributes.adoc
Original file line number Diff line number Diff line change
Expand Up @@ -26,6 +26,7 @@
:ie-pattern: Industrial Edge pattern
:ie: Industrial Edge
:fe-pattern: Federated Edge Observability pattern
:mcfl-parttern: Multicloud Federated Learning pattern
:mcg-pattern: Multicloud GitOps pattern
:amx-mcg-pattern: Intel AMX accelerated Multicloud GitOps pattern
:mcg-sgx-hello-world-pattern: Intel SGX protected application in Multicloud GitOps pattern
Expand Down
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading