-
Notifications
You must be signed in to change notification settings - Fork 4
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Co-authored-by: Kent Hua <[email protected]> Co-authored-by: Kavitha Rajendran <[email protected]> Co-authored-by: Ali Zaidi <[email protected]> Co-authored-by: Shobhit Gupta <[email protected]> Co-authored-by: Xiang Shen <[email protected]> Co-authored-by: Ishmeet Mehta <[email protected]> Co-authored-by: Laurent Grangeau <[email protected]> Co-authored-by: Jun Sheng <[email protected]>
- Loading branch information
1 parent
0cff600
commit 42d97e9
Showing
258 changed files
with
182,513 additions
and
1 deletion.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,2 +1,13 @@ | ||
# IDEs | ||
*.code-workspace | ||
|
||
# Python | ||
__pycache__/ | ||
.venv/ | ||
venv/ | ||
|
||
# Terraform | ||
.terraform | ||
|
||
# Test | ||
test/log/*.log |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1 +1,5 @@ | ||
# Google Cloud AI/ML Platforms | ||
# Google Cloud AI/ML Platform Reference Architectures | ||
|
||
This repository is collection of AI/ML platform reference architectures and use cases for Google Cloud. | ||
|
||
- [GKE ML Platform for enabling ML Ops](/docs/gke-ml-platform.md) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,82 @@ | ||
# GKE Machine learning platform (MLP) reference architecture for enabling Machine Learning Operations (MLOps) | ||
|
||
## Platform Principles | ||
|
||
This reference architecture demonstrates how to build a GKE platform that facilitates Machine Learning. The reference architecture is based on the following principles: | ||
|
||
- The platform admin will create the GKE platform using IaC tool like [Terraform][terraform]. The IaC will come with re-usable modules that can be referred to create more resources as the demand grows. | ||
- The platform will be based on [GitOps][gitops]. | ||
- After the GKE platform has been created, cluster scoped resources on it will be created through [Config Sync][config-sync] by the admins. | ||
- Platform admins will create a namespace per application and provide the application team member full access to it. | ||
- The namespace scoped resources will be created by the Application/ML teams either via [Config Sync][config-sync] or through a deployment tool like [Cloud Deploy][cloud-deploy] | ||
|
||
For an outline of products and features used in the platform, see the [Platform Products and Features](/docs/gke-ml-platform/products-and-features.md) document. | ||
|
||
## Critical User Journeys (CUJs) | ||
|
||
### Persona : Platform Admin | ||
|
||
- Offer a platform that incorporates established best practices. | ||
- Grant end users the essential resources, guided by the principle of least privilege, empowering them to manage and maintain their workloads. | ||
- Establish secure channels for end users to interact seamlessly with the platform. | ||
- Empower the enforcement of robust security policies across the platform. | ||
|
||
### Persona : Machine Learning Engineer | ||
|
||
- Deploy the model with ease and make the endpoints available only to the intended audience | ||
- Continuously monitor the model performance and resource utilization | ||
- Troubleshoot any performance or integration issues | ||
- Ability to version, store and access the models and model artifacts: | ||
- To debug & troubleshoot in production and track back to the specific model version & associated training data | ||
- To quick & controlled rollback to a previous, more stable version | ||
- Implement the feedback loop to adapt to changing data & business needs: | ||
- Ability to retrain / fine-tune the model. | ||
- Ability to split the traffic between models (A/B testing) | ||
- Switching between the models without breaking inference system for the end-users | ||
- Ability to scaling up/down the infra to accommodate changing needs | ||
- Ability to share the insights and findings with stakeholders to take data-driven decisions | ||
|
||
### Persona : Machine Learning Operator | ||
|
||
- Provide and maintain software required by the end users of the platform. | ||
- Operationalize experimental workload by providing guidance and best practices for running the workload on the platform. | ||
- Deploy the workloads on the platform. | ||
- Assist with enabling observability and monitoring for the workloads to ensure smooth operations. | ||
|
||
## Prerequisites | ||
|
||
- This guide is meant to be run on [Cloud Shell](https://shell.cloud.google.com) which comes preinstalled with the [Google Cloud SDK](https://cloud.google.com/sdk) and other tools that are required to complete this tutorial. | ||
- Familiarity with following | ||
- [Google Kubernetes Engine][gke] | ||
- [Terraform][terraform] | ||
- [git][git] | ||
- [Google Configuration Management root-sync][root-sync] | ||
- [Google Configuration Management repo-sync][repo-sync] | ||
- [GitHub][github] | ||
|
||
## Deploy the platform | ||
|
||
[Playground Reference Architecture](/examples/platform/playground/README.md): Set up an environment to familiarize yourself with the architecture and get an understanding of the concepts. | ||
|
||
## Use cases | ||
|
||
- [Distributed Data Processing with Ray](/examples/use-case/data-processing/ray/README.md): Run a distributed data processing job using Ray. | ||
- [Dataset Preparation for Fine Tuning Gemma IT With Gemini Flash](/examples/use-case/data-preparation/gemma-it/README.md): Generate prompts for fine tuning Gemma Instruction Tuned model with Vertex AI Gemini Flash | ||
- [Fine Tuning Gemma2 9B IT model With FSDP](/examples/use-case/fine-tuning/pytorch/README.md): Fine tune Gemma2 9B IT model with PyTorch FSDP | ||
|
||
## Resources | ||
|
||
- [Packaging Jupyter notebooks](/docs/notebook/packaging.md): Patterns and tools to get your ipynb's ready for deployment in a container runtime. | ||
|
||
[gitops]: https://about.gitlab.com/topics/gitops/ | ||
[repo-sync]: https://cloud.google.com/anthos-config-management/docs/reference/rootsync-reposync-fields | ||
[root-sync]: https://cloud.google.com/anthos-config-management/docs/reference/rootsync-reposync-fields | ||
[config-sync]: https://cloud.google.com/anthos-config-management/docs/config-sync-overview | ||
[cloud-deploy]: https://cloud.google.com/deploy?hl=en | ||
[terraform]: https://www.terraform.io/ | ||
[gke]: https://cloud.google.com/kubernetes-engine?hl=en | ||
[git]: https://git-scm.com/ | ||
[github]: https://github.com/ | ||
[gcp-project]: https://cloud.google.com/resource-manager/docs/creating-managing-projects | ||
[personal-access-token]: https://docs.github.com/en/authentication/keeping-your-account-and-data-secure/managing-your-personal-access-tokens | ||
[machine-user-account]: https://docs.github.com/en/get-started/learning-about-github/types-of-github-accounts |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,46 @@ | ||
# Playground Machine learning platform (MLP) on GKE: Architecture | ||
|
||
![Playground Architecture](/docs/images/platform/playground/mlp_playground_architecture.svg) | ||
|
||
## Platform | ||
|
||
- [Google Cloud Project](https://console.cloud.google.com/cloud-resource-manager) | ||
- Environment project | ||
- Service APIs | ||
- [Cloud Storage](https://console.cloud.google.com/storage/browser) | ||
- Terraform bucket | ||
- [VPC networks](https://console.cloud.google.com/networking/networks/list) | ||
- VPC network | ||
- Subnet | ||
- [Cloud Router](https://console.cloud.google.com/hybrid/routers/list) | ||
- Cloud NAT gateway | ||
- [Google Kubernetes Engine (GKE)](https://console.cloud.google.com/kubernetes/list/overview) | ||
- Standard Cluster | ||
- CPU on-demand node pool | ||
- CPU system node pool | ||
- GPU on-demand node pool | ||
- GPU spot node pool | ||
- Google Kubernetes Engine (GKE) Enterprise ([docs])(https://cloud.google.com/kubernetes-engine/enterprise/docs) | ||
- Configuration Management | ||
- Config Sync | ||
- Policy Controller | ||
- Connect gateway | ||
- Fleet | ||
- Security posture dashboard | ||
- Threat detection | ||
- Git repository | ||
- Config Sync | ||
|
||
### Each namespace | ||
|
||
- [Load Balancer](https://console.cloud.google.com/net-services/loadbalancing/list/loadBalancers) | ||
- Gateway External Load Balancer | ||
- [Classic SSL Certificate](https://console.cloud.google.com/security/ccm/list/lbCertificates) | ||
- Gateway SSL Certificate | ||
- Ray dashboard | ||
- Identity-Aware Proxy (IAP) ([docs])(https://cloud.google.com/iap/docs/concepts-overview) | ||
- Ray head Backend Service | ||
- [Service Accounts](https://console.cloud.google.com/iam-admin/serviceaccount) | ||
- Default | ||
- Ray head | ||
- Ray worker |
Oops, something went wrong.