Skip to content

GoogleCloudPlatform/federated-learning

Repository files navigation

Federated Learning on Google Cloud

This repository contains a blueprint that creates and secures a Google Kubernetes Engine (GKE) cluster that is ready to host custom apps distributed by a third party.

You can use this blueprint to implement Federated Learning (FL) use cases on Google Cloud.

This blueprint suggests controls that you can use to help configure and secure GKE clusters that host custom apps distributed by third-party tenants. These custom apps are considered as untrusuted workloads within the cluster. Therefore, the cluster is configured according to security best practices to isolate and constrain the workloads from other workloads and from the cluster control plane.

This blueprint provisions cloud resources on Google Cloud. After the initial provisioning, you can extended the infrastructure to GKE clusters running on premises or on other public clouds.

This blueprint is aimed at cloud platform administrator and data scientists that need to provision and configure a secure environment to run potentially untrusted workloads in their Google Cloud environment.

This blueprint assumes that you are familiar with GKE and Kubernetes.

Get started

To deploy this blueprint you need:

You create the infastructure using Terraform. The blueprint uses a local Terraform backend, but we recommend to configure a remote backend for anything other than experimentation.

Understand the repository structure

This repository has the following key directories:

  • examples: contains examples that build on top of this blueprint.
  • terraform: contains the Terraform code used to create the project-level infrastructure and resources, for example a GKE cluster, VPC network, firewall rules etc. It also installs Anthos components into the cluster
  • configsync: contains the cluster-level resources and configurations that are applied to your GKE cluster.
  • tenant-config-pkg: a kpt package that you can use as a template to configure new tenants in the GKE cluster.

Architecture

The following diagram describes the architecture that you create with this blueprint:

alt_text

As shown in the preceding diagram, the blueprint helps you to create and configure the following infrastructure components:

  • A Virtual Private Cloud (VPC) network and subnet.
  • A private GKE cluster that helps you:
    • Isolate cluster nodes from the internet.
    • Limit exposure of your cluster nodes and control plane to the internet by creating a private GKE cluster with authorised networks.
    • Use shielded cluster nodes that use a hardened node image with the containerd runtime.
    • Enable Dataplane V2 for optimised Kubernetes networking.
    • Encrypt cluster secrets at the application layer.
  • Dedicated GKE node pools.
    • You create a dedicated node pool to exclusively host tenant apps and resources. The nodes have taints to ensure that only tenant workloads are scheduled onto the tenant nodes.
    • Other cluster resources are hosted in the main node pool.
  • VPC Firewall rules
    • Baseline rules that apply to all nodes in the cluster.
    • Additional rules that apply only to the nodes in the tenant node pool. These firewall rules limit ingress to and egress from tenant nodes.
  • Cloud NAT to allow egress to the internet
  • Cloud DNS records to enable Private Google Access such that apps within the cluster can access Google APIs without traversing the internet.
  • Service Accounts:
    • Dedicated service account for the nodes in the tenant node pool.
    • Dedicated service account for tenant apps to use with Workload Identity.
  • Support for using Google Groups for Kubernetes RBAC.
  • A Cloud Source Repository to store configuration descriptors.
  • An Artifact Registry repository to store container images.

Applications

The following diagram shows the cluster-level resources that you create and configure with the blueprint.

alt_text

As shown in the preceding diagram, in the blueprint, you use the following to create and configure the cluster-level resources:

  • Anthos Config Management Config Sync, to sync cluster configuration and policies from a Git repository.
    • When you provision the resources using this blueprint, the tooling initializes a Git repository for Config Sync to consume, and automatically renders the relevant templates and commits changes.
    • The tooling automatically commits any modification to templates in the Config Sync repository on each run of the provisioning process.
  • Anthos Config Management Policy Controller enforces policies ('constraints') to enforce policies on resources in the cluster.
  • Anthos Service Mesh to control and help secure network traffic.
  • A dedicated namespace and node pools for tenant apps and resources. Custom apps are treated as a tenant within the cluster.
  • Policies and controls applied to the tenant namespace:
    • Allow egress only to known hosts.
    • Allow requests that originate from within the same namespace.
    • By default, deny all ingress and egress traffic to and from pods. This acts as baseline 'deny all' rule.
    • Allow traffic between pods in the namespace.
    • Allow egress to required cluster resources such as: Kubernetes DNS, the service mesh control plane, and the GKE metadata server.
    • Allow egress to Google APIs only using Private Google Access.
    • Allow running host tenant pods on nodes in the dedicated tenant node pool exclusively.
    • Use a dedicated Kubernetes service account that is linked to a Cloud Identity and Access Management service account using Workload Identity.

Users and teams managing tenant apps should not have permissions to change cluster configuration or modify service mesh resources

Deploy the blueprint

  1. Open Cloud Shell

  2. Initialize the local repository where the environment configuration will be stored:

    ACM_REPOSITORY_PATH= # Path on the host running Terraform to store environment configuration
    ACM_REPOSITORY_URL= # URL of the repository to store environment configuration
    ACM_BRANCH= # Name of the Git branch in the repository that Config Sync will sync with
    git clone "${ACM_REPOSITORY_URL}" --branch "${ACM_BRANCH}" "${ACM_REPOSITORY_PATH}"
  3. Clone this Git repository.

  4. Change into the directory that contains the Terraform code:

    cd [REPOSITORY]/terraform

    Where [REPOSITORY] is the path to the directory where you cloned this repository.

  5. Initialize Terraform:

    terraform init
  6. Initialize the following Terraform variables:

    project_id                  = # Google Cloud project ID where to provision resources with the blueprint.
    acm_branch                  = # Use the same value that you used for ${ACM_BRANCH}
    acm_repository_path         = # Use the same value that you used for ${ACM_REPOSITORY_PATH}
    acm_repository_url          = # Use the same value that you used for ${ACM_REPOSITORY_URL}
    acm_secret_type             = # Secret type to authenticate with the Config Sync Git repository
    acm_source_repository_fqdns = # FQDNs of source repository for Config Sync to allow in the Network Firewall Policy

    For more information about setting acm_secret_type, see Grant access to Git.

    If you don't provide all the necessary inputs, Terraform will exit with an error, and will provide information about the missing inputs. For example, you can create a Terraform variables initialization file and set inputs there. For more information about providing these inputs, see Terraform input variables.

  7. Review the proposed changes, and apply them:

    terraform apply

    The provisioning process may take about 15 minutes to complete.

  8. Wait for the Cloud Service Mesh custom resource definitions to be available:

    /bin/sh -c 'while ! kubectl wait crd/controlplanerevisions.mesh.cloud.google.com --for condition=established --timeout=60m --all-namespaces; do echo \"crd/controlplanerevisions.mesh.cloud.google.com not yet available, waiting...\"; sleep 5; done'
  9. Wait for the Cloud Service Mesh custom resources to be available:

    /bin/sh -c 'while ! kubectl -n istio-system wait ControlPlaneRevision --all --timeout=60m --for condition=Reconciled; do echo \"ControlPlaneRevision not yet available, waiting...\"; sleep 5; done'
  10. Commit and push generated configuration files to the environment configuration repository:

    git -C "${ACM_REPOSITORY_PATH}" add .
    git -C "${ACM_REPOSITORY_PATH}" commit -m "Config update: $(date -u +'%Y-%m-%dT%H:%M:%SZ')"
    git -C "${ACM_REPOSITORY_PATH}" push -u origin "${ACM_BRANCH}"

    Every time you modify the environment configuration, you need to commit and push changes to the environment configuration repository.

  11. Grant the Config Sync agent access to the Git repository where the environment configuration will be stored.

  12. Wait for the GKE cluster to be reported as ready in the GKE Kuberentes clusters dashboard.

Next steps

After deploying the blueprint completes, the GKE cluster is ready to host untrusted workloads. To familiarize with the environment that you provisioned, you can also deploy the following examples in the GKE cluster:

Federated learning is typically split into Cross-silo and Cross-device federated learning. Cross-silo federated computation is where the participating members are organizations or companies, and the number of members is usually small (e.g., within a hundred).

Cross-device computation is a type of federated computation where the participating members are end user devices such as mobile phones and vehicles. The number of members can reach up to a scale of millions or even tens of millions.

You can deploy a cross-device infrastructure by following this README.md

Add another tenant

This blueprint dynamically provisions a runtime environment for each tenant you configure.

To add another tenant:

  1. Add its name to the list of tenants to configure using the tenant_names variable.
  2. Follow the steps to Deploy the blueprint again.

Connect to cluster nodes

To open an SSH session against a node of the cluster, you use an IAP tunnel because cluster nodes don't have external IP addresses:

gcloud compute ssh --tunnel-through-iap node_name

Where node_name is the Compute Engine instance name to connect to.

Troubleshooting

This section describes common issues and troubleshooting steps.

I/O timeouts when running Terraform plan or apply

If Terraform reports errors when you run plan or apply because it can't get the status of a resource inside a GKE cluster, and it also reports that it needs to update the cidr_block of the master_authorized_networks block of that cluster, it might be that the instance that runs Terraform is not part of any CIDR that is authorized to connect to that GKE cluster control plane.

To solve this issue, you can try updating the cidr_block by targeting the GKE cluster specifically when applying changes:

terraform apply -target module.gke

Then, you can try running terraform apply again, without any resource targeting.

Network address assignment errors when running Terraform

If Terraform reports connect: cannot assign requested address errors when you run Terraform, try running the command again.

Errors when adding the GKE cluster to the Fleet

If Terraform reports errors about the format of the fleet membership configuration, it may mean that the Fleet API initialization didn't complete when Terraform tried to add the GKE cluster to the fleet. Example:

Error creating FeatureMembership: googleapi: Error 400: InvalidValueError for
field membership_specs["projects/<project number>/locations/global/memberships/<cluster name>"].feature_spec:
does not match a current membership in this project. Keys should be in the form: projects/<project number>/locations/{l}/memberships/{m}

If this error occurs, try running terraform apply again.

Errors when pulling container images

If istio-ingress or istio-egress Pods fail to run because GKE cannot download their container images and GKE reports ImagePullBackOff errors, see Troubleshoot gateways for details about the potential root cause. You can inspect the status of these Pods in the GKE Workloads Dashboard.

If this happens, wait for the cluster to complete the initialiazation, and delete the Deployment that has this issue. Config Sync will deploy it again with the correct container image identifiers.

Errors when deleting and cleaning up the environment

When running terraform destroy to remove resources that this reference architecture provisioned and configured, it might happen that you get the following errors:

  • Dangling network endpoint groups (NEGs):

    Error waiting for Deleting Network: The network resource 'projects/PROJECT_NAME/global/networks/NETWORK_NAME' is already being used by 'projects/PROJECT_NAME/zones/ZONE_NAME/networkEndpointGroups/NETWORK_ENDPOINT_GROUP_NAME'.

    If this happens:

    1. Open the NEGs dashboard for your project.
    2. Delete all the NEGs that were associated with the GKE cluster that Terraform deleted.
    3. Run terraform destroy again.

Understanding the security controls that you need

This section discusses the controls that you apply with the blueprint to help you secure your GKE cluster.

Enhanced security of GKE clusters

Creating clusters according to security best practices.

The blueprint helps you create a GKE cluster which implements the following security settings:

For more information about GKE security settings, refer to Hardening your cluster's security.

VPC firewall rules: Restricting traffic between virtual machines

VPC firewall rules govern which traffic is allowed to or from Compute Engine VMs. The rules let you filter traffic at VM granularity, depending on Layer 4 attributes.

You create a GKE cluster with the default GKE cluster firewall rules. These firewall rules enable communication between the cluster nodes and GKE control plane, and between nodes and Pods in the cluster.

You apply additional firewall rules to the nodes in the tenant node pool. These firewall rules restrict egress traffic from the tenant nodes. This approach lets you increase the isolation of the tenant nodes. By default, all egress traffic from the tenant nodes is denied. Any required egress must be explicitly configured. For example, you use the blueprint to create firewall rules to allow egress from the tenant nodes to the GKE control plane, and to Google APIs using Private Google Access. The firewall rules are targeted to the tenant nodes using the tenant node pool service account.

<<_shared/_anthos_snippets/_anthos-blueprints-snippets-namespaces.md>>

The blueprint helps you create a dedicated namespace to host the third-party apps. The namespace and its resources are treated as a tenant within your cluster. You apply policies and controls to the namespace to limit the scope of resources in the namespace.

Network policies: Enforcing network traffic flow within clusters

Network policies enforce Layer 4 network traffic flows by using Pod-level firewall rules. Network policies are scoped to a namespace.

In the blueprint, you apply network policies to the tenant namespace that hosts the third-party apps. By default, the network policy denies all traffic to and from pods in the namespace. Any required traffic must be explicitly allowlisted. For example, the network policies in the blueprint explicitly allow traffic to required cluster services, such as the cluster internal DNS and the Anthos Service Mesh control plane.

Config Sync: Applying configurations to your GKE clusters

Config Sync keeps your GKE clusters in sync with configs stored in a Git repository. The Git repository acts as the single source of truth for your cluster configuration and policies. Config Sync is declarative. It continuously checks cluster state and applies the state declared in the configuration file in order to enforce policies, which helps to prevent configuration drift.

You install Config Sync into your GKE cluster. You configure Config Sync to sync cluster configurations and policies from the GitHub repository associated with the blueprint. The synced resources include the following:

  • Cluster-level Anthos Service Mesh configuration
  • Cluster-level security policies
  • Tenant namespace-level configuration and policy including network policies, service accounts, RBAC rules, and Anthos Service Mesh configuration

Policy Controller: Enforcing compliance with policies

Anthos Policy Controller is a dynamic admission controller for Kubernetes that enforces CustomResourceDefinition-based (CRD-based) policies that are executed by the Open Policy Agent (OPA).

Admission controllers are Kubernetes plugins that intercept requests to the Kubernetes API server before an object is persisted, but after the request is authenticated and authorized. You can use admission controllers to limit how a cluster is used.

You install Policy Controller into your GKE cluster. The blueprint includes example policies to help secure your cluster. You automatically apply the policies to your cluster using Config Sync. You apply the following policies:

Anthos Service Mesh: Managing secure communications between services

Anthos Service Mesh helps you monitor and manage an Istio-based service mesh. A service mesh is an infrastructure layer that helps create managed, observable, and secure communication across your services.

Anthos Service Mesh helps simplify the management of secure communications across services in the following ways:

  • Managing authentication and encryption of traffic (supported protocols within the cluster using mutual Transport Layer Communication (mTLS)). Anthos Service Mesh manages the provisioning and rotation of mTLS keys and certificates for Anthos workloads without disrupting communications. Regularly rotating mTLS keys is a security best practice that helps reduce exposure in the event of an attack.
  • Letting you configure network security policies based on service identity rather than on the IP address of a peers on the network. Anthos Service Mesh is used to configure identity-aware access control (firewall) policies that let you create security policies that are independent of the network location of the workload. This approach simplifies the process of setting up service-to-service communications policies.
  • Letting you configure policies that permit access from certain clients.

The blueprint guides you to install Anthos Service Mesh in your cluster. You configure the tenant namespace for automatic sidecar proxy injection. This approach ensures that apps in the tenant namespace are part of the mesh. You automatically configure Anthos Service Mesh using Config Sync. You configure the mesh to do the following:

  • Enforce mTLS communication between services in the mesh.
  • Limit outbound traffic from the mesh to only known hosts.
  • Limit authorized communication between services in the mesh. For example, apps in the tenant namespace are only allowed to communicate with apps in the same namespace, or with a set of known external hosts.
  • Route all outbound traffic through a mesh gateway where you can apply further traffic controls.

Node taints and affinities: Controlling workload scheduling

Node taints and node affinity are Kubernetes mechanisms that let you influence how pods are scheduled onto cluster nodes.

Tainted nodes repel pods. Kubernetes will not schedule a Pod onto a tainted node unless the Pod has a toleration for the taint. You can use node taints to reserve nodes for use only by certain workloads or tenants. Taints and tolerations are often used in multi-tenant clusters. See the dedicated nodes with taints and tolerations documentation for more information.

Node affinity lets you constrain pods to nodes with particular labels. If a pod has a node affinity requirement, Kubernetes will not schedule the Pod onto a node unless the node has a label that matches the affinity requirement. You can use node affinity to ensure that pods are scheduled onto appropriate nodes.

You can use node taints and node affinity together to ensure tenant workload pods are scheduled exclusively onto nodes reserved for the tenant.

The blueprint helps you control the scheduling of the tenant apps in the following ways:

  • Creating a GKE node pool dedicated to the tenant. Each node in the pool has a taint related to the tenant name.
  • Automatically applying the appropriate toleration and node affinity to any Pod targeting the tenant namespace. You apply the toleration and affinity using PolicyController mutations.

Least privilege: Limiting access to cluster and project resources

It is a security best practice to adopt a principle of least privilege for your Google Cloud projects and resources like GKE clusters. This way, the apps that run inside your cluster, and the developers and operators that use the cluster, have only the minimum set of permissions required.

The blueprint helps you use least privilege service accounts in the following ways:

  • Each GKE node pool receives its own service account. For example, the nodes in the tenant node pool use a service account dedicated to those nodes. The node service accounts are configured with the minimum required permissions.
  • The cluster uses Workload Identity to associate Kubernetes service accounts with Google service accounts. This way, the tenant apps can be granted limited access to any required Google APIs without downloading and storing a service account key. For example, you can grant the service account permissions to read data from a Cloud Storage bucket.

The blueprint helps you restrict access to cluster resources in the following ways:

  • You create a sample Kubernetes RBAC role with limited permissions to manage apps. You can grant this role to the users and groups who operate the apps in the tenant namespace. This way, those users only have permissions to modify app resources in the tenant namespace. They do not have permissions to modify cluster-level resources or sensitive security settings like Anthos Service Mesh policies.

References