Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add a federated learning example using nvidia flare #99

Merged
merged 1 commit into from
Feb 20, 2025

Conversation

ferrarimarco
Copy link
Member

@ferrarimarco ferrarimarco commented Jan 28, 2025

  • Federated Learning use case:

    • Implement an example that deploys NVIDIA FLARE in the cluster.
    • Support getting Terraform output in JSON format
    • Implement a service to provision Cloud Storage buckets
    • Enable Kubernetes network policy logging
    • Don't enforce injecting Cloud Service Mesh sidecars in the kube-system namespace
    • Fix Cloud Service Mesh auto-injection namespace labels
    • Add tolerations to istio-egress so it can be deployed in the cluster
    • Fix mutator configuration to add the right tolerations to federated learning workloads
    • Allow DNS queries to GKE control plane
    • Create missing tenant service account
    • Configure Terraform outputs to get Artifact Registry repository FQDN.
    • Use the truncated Cloud Build BUILD_ID as the project name suffix to avoid conflicts with projects pending deletion.

@ferrarimarco ferrarimarco self-assigned this Jan 28, 2025
@ferrarimarco ferrarimarco changed the base branch from main to int-federated-learning January 28, 2025 08:56
@ferrarimarco ferrarimarco force-pushed the example-fl-nvflare branch 7 times, most recently from 626741c to 7fe70e3 Compare January 30, 2025 10:07
@ferrarimarco ferrarimarco force-pushed the int-federated-learning branch from 4a79619 to 8582f43 Compare February 3, 2025 09:58
@ferrarimarco ferrarimarco force-pushed the example-fl-nvflare branch 4 times, most recently from c4ed051 to 5a4ee0a Compare February 7, 2025 18:48
@ferrarimarco ferrarimarco force-pushed the example-fl-nvflare branch 5 times, most recently from e272a6c to 702690b Compare February 11, 2025 08:52
@ferrarimarco ferrarimarco marked this pull request as ready for review February 11, 2025 08:52
@ferrarimarco ferrarimarco requested a review from arueth February 12, 2025 08:04
@arueth
Copy link
Collaborator

arueth commented Feb 12, 2025

It seems like you are really intertwining this use case into your "core" federated learning platform. Is it possible to keep them more distinct? Possibly using a separate config file or a different folder structure.

To me this seems like an additional use case on top of your "core" platform, similar to Fine Tuning and RAG on the AI/ML platform.

@ferrarimarco
Copy link
Member Author

Good point, let me see how I can restructure this.

@ferrarimarco ferrarimarco force-pushed the example-fl-nvflare branch 4 times, most recently from 1db2fc8 to 59b4a55 Compare February 14, 2025 15:05
@ferrarimarco
Copy link
Member Author

@arueth I've reworked things a bit, PTAL, thanks.

@arueth
Copy link
Collaborator

arueth commented Feb 18, 2025

It looks like we need to find a way to generate a unique project name per build. Right now the job is failing because the project with that SHORT_SHA already exists in a pending delete state. Maybe we can use a truncated version of BUILD_ID.

@ferrarimarco
Copy link
Member Author

It looks like we need to find a way to generate a unique project name per build. Right now the job is failing because the project with that SHORT_SHA already exists in a pending delete state. Maybe we can use a truncated version of BUILD_ID.

Done in the latest commit. I used the BUILD_ID as you suggested, truncating it at 7 characters as a short Git hash.

@ferrarimarco ferrarimarco merged commit 3963e42 into int-federated-learning Feb 20, 2025
14 checks passed
@ferrarimarco ferrarimarco deleted the example-fl-nvflare branch February 20, 2025 16:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants