Skip to content

Commit

Permalink
Migration from ai-on-gke repository
Browse files Browse the repository at this point in the history
Co-authored-by: Kent Hua <[email protected]>
Co-authored-by: Kavitha Rajendran <[email protected]>
Co-authored-by: Ali Zaidi <[email protected]>
Co-authored-by: Shobhit Gupta <[email protected]>
Co-authored-by: Xiang Shen <[email protected]>
Co-authored-by: Ishmeet Mehta <[email protected]>
Co-authored-by: Laurent Grangeau <[email protected]>
Co-authored-by: Jun Sheng <[email protected]>
  • Loading branch information
9 people committed Sep 4, 2024
1 parent 0cff600 commit c3305df
Show file tree
Hide file tree
Showing 259 changed files with 182,525 additions and 1 deletion.
11 changes: 11 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -1,2 +1,13 @@
# IDEs
*.code-workspace

# Python
__pycache__/
.venv/
venv/

# Terraform
.terraform

# Test
test/log/*.log
12 changes: 12 additions & 0 deletions CONTRIBUTING.md
Original file line number Diff line number Diff line change
Expand Up @@ -25,6 +25,18 @@ This project follows

## Contribution process

### Coding style and formatting

#### Python

The repository requires that files use the [Black](https://github.com/psf/black) code formatter and style.

#### Terraform

We follow the guidelines and recommendations in the [Google Cloud Best practices for using Terraform](https://cloud.google.com/docs/terraform/best-practices-for-terraform) document, unless noted otherwise.

The repository requires that files use built-in formatting using the `terraform fmt` command.

### Code reviews

All submissions, including submissions by project members, require review. We
Expand Down
6 changes: 5 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
@@ -1 +1,5 @@
# Google Cloud AI/ML Platforms
# Google Cloud AI/ML Platform Reference Architectures

This repository is collection of AI/ML platform reference architectures and use cases for Google Cloud.

- [GKE ML Platform for enabling ML Ops](/docs/gke-ml-platform.md)
82 changes: 82 additions & 0 deletions docs/gke-ml-platform.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,82 @@
# GKE Machine learning platform (MLP) reference architecture for enabling Machine Learning Operations (MLOps)

## Platform Principles

This reference architecture demonstrates how to build a GKE platform that facilitates Machine Learning. The reference architecture is based on the following principles:

- The platform admin will create the GKE platform using IaC tool like [Terraform][terraform]. The IaC will come with re-usable modules that can be referred to create more resources as the demand grows.
- The platform will be based on [GitOps][gitops].
- After the GKE platform has been created, cluster scoped resources on it will be created through [Config Sync][config-sync] by the admins.
- Platform admins will create a namespace per application and provide the application team member full access to it.
- The namespace scoped resources will be created by the Application/ML teams either via [Config Sync][config-sync] or through a deployment tool like [Cloud Deploy][cloud-deploy]

For an outline of products and features used in the platform, see the [Platform Products and Features](/docs/gke-ml-platform/products-and-features.md) document.

## Critical User Journeys (CUJs)

### Persona : Platform Admin

- Offer a platform that incorporates established best practices.
- Grant end users the essential resources, guided by the principle of least privilege, empowering them to manage and maintain their workloads.
- Establish secure channels for end users to interact seamlessly with the platform.
- Empower the enforcement of robust security policies across the platform.

### Persona : Machine Learning Engineer

- Deploy the model with ease and make the endpoints available only to the intended audience
- Continuously monitor the model performance and resource utilization
- Troubleshoot any performance or integration issues
- Ability to version, store and access the models and model artifacts:
- To debug & troubleshoot in production and track back to the specific model version & associated training data
- To quick & controlled rollback to a previous, more stable version
- Implement the feedback loop to adapt to changing data & business needs:
- Ability to retrain / fine-tune the model.
- Ability to split the traffic between models (A/B testing)
- Switching between the models without breaking inference system for the end-users
- Ability to scaling up/down the infra to accommodate changing needs
- Ability to share the insights and findings with stakeholders to take data-driven decisions

### Persona : Machine Learning Operator

- Provide and maintain software required by the end users of the platform.
- Operationalize experimental workload by providing guidance and best practices for running the workload on the platform.
- Deploy the workloads on the platform.
- Assist with enabling observability and monitoring for the workloads to ensure smooth operations.

## Prerequisites

- This guide is meant to be run on [Cloud Shell](https://shell.cloud.google.com) which comes preinstalled with the [Google Cloud SDK](https://cloud.google.com/sdk) and other tools that are required to complete this tutorial.
- Familiarity with following
- [Google Kubernetes Engine][gke]
- [Terraform][terraform]
- [git][git]
- [Google Configuration Management root-sync][root-sync]
- [Google Configuration Management repo-sync][repo-sync]
- [GitHub][github]

## Deploy the platform

[Playground Reference Architecture](/examples/platform/playground/README.md): Set up an environment to familiarize yourself with the architecture and get an understanding of the concepts.

## Use cases

- [Distributed Data Processing with Ray](/examples/use-case/data-processing/ray/README.md): Run a distributed data processing job using Ray.
- [Dataset Preparation for Fine Tuning Gemma IT With Gemini Flash](/examples/use-case/data-preparation/gemma-it/README.md): Generate prompts for fine tuning Gemma Instruction Tuned model with Vertex AI Gemini Flash
- [Fine Tuning Gemma2 9B IT model With FSDP](/examples/use-case/fine-tuning/pytorch/README.md): Fine tune Gemma2 9B IT model with PyTorch FSDP

## Resources

- [Packaging Jupyter notebooks](/docs/notebook/packaging.md): Patterns and tools to get your ipynb's ready for deployment in a container runtime.

[gitops]: https://about.gitlab.com/topics/gitops/
[repo-sync]: https://cloud.google.com/anthos-config-management/docs/reference/rootsync-reposync-fields
[root-sync]: https://cloud.google.com/anthos-config-management/docs/reference/rootsync-reposync-fields
[config-sync]: https://cloud.google.com/anthos-config-management/docs/config-sync-overview
[cloud-deploy]: https://cloud.google.com/deploy?hl=en
[terraform]: https://www.terraform.io/
[gke]: https://cloud.google.com/kubernetes-engine?hl=en
[git]: https://git-scm.com/
[github]: https://github.com/
[gcp-project]: https://cloud.google.com/resource-manager/docs/creating-managing-projects
[personal-access-token]: https://docs.github.com/en/authentication/keeping-your-account-and-data-secure/managing-your-personal-access-tokens
[machine-user-account]: https://docs.github.com/en/get-started/learning-about-github/types-of-github-accounts
46 changes: 46 additions & 0 deletions docs/gke-ml-platform/playground/architecture.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,46 @@
# Playground Machine learning platform (MLP) on GKE: Architecture

![Playground Architecture](/docs/images/platform/playground/mlp_playground_architecture.svg)

## Platform

- [Google Cloud Project](https://console.cloud.google.com/cloud-resource-manager)
- Environment project
- Service APIs
- [Cloud Storage](https://console.cloud.google.com/storage/browser)
- Terraform bucket
- [VPC networks](https://console.cloud.google.com/networking/networks/list)
- VPC network
- Subnet
- [Cloud Router](https://console.cloud.google.com/hybrid/routers/list)
- Cloud NAT gateway
- [Google Kubernetes Engine (GKE)](https://console.cloud.google.com/kubernetes/list/overview)
- Standard Cluster
- CPU on-demand node pool
- CPU system node pool
- GPU on-demand node pool
- GPU spot node pool
- Google Kubernetes Engine (GKE) Enterprise ([docs])(https://cloud.google.com/kubernetes-engine/enterprise/docs)
- Configuration Management
- Config Sync
- Policy Controller
- Connect gateway
- Fleet
- Security posture dashboard
- Threat detection
- Git repository
- Config Sync

### Each namespace

- [Load Balancer](https://console.cloud.google.com/net-services/loadbalancing/list/loadBalancers)
- Gateway External Load Balancer
- [Classic SSL Certificate](https://console.cloud.google.com/security/ccm/list/lbCertificates)
- Gateway SSL Certificate
- Ray dashboard
- Identity-Aware Proxy (IAP) ([docs])(https://cloud.google.com/iap/docs/concepts-overview)
- Ray head Backend Service
- [Service Accounts](https://console.cloud.google.com/iam-admin/serviceaccount)
- Default
- Ray head
- Ray worker
Loading

0 comments on commit c3305df

Please sign in to comment.