Skip to content

Latest commit

 

History

History
71 lines (52 loc) · 4.83 KB

File metadata and controls

71 lines (52 loc) · 4.83 KB

GKE AI/ML Platform reference architecture for enabling Machine Learning Operations (MLOps)

Platform Principles

This reference architecture demonstrates how to build a GKE platform that facilitates Machine Learning. The reference architecture is based on the following principles:

  • The platform admin will create the GKE platform using IaC tool like Terraform. The IaC will come with re-usable modules that can be referred to create more resources as the demand grows.
  • The platform will be based on GitOps.
  • After the GKE platform has been created, cluster scoped resources on it will be created through Config Sync by the admins.
  • Platform admins will create a namespace per application and provide the application team member full access to it.
  • The namespace scoped resources will be created by the Application/ML teams either via Config Sync or through a deployment tool like Cloud Deploy

For an outline of products and features used in the platform, see the Platform Products and Features document.

Critical User Journeys (CUJs)

Persona : Platform Admin

  • Offer a platform that incorporates established best practices.
  • Grant end users the essential resources, guided by the principle of least privilege, empowering them to manage and maintain their workloads.
  • Establish secure channels for end users to interact seamlessly with the platform.
  • Empower the enforcement of robust security policies across the platform.

Persona : Machine Learning Engineer

  • Deploy the model with ease and make the endpoints available only to the intended audience
  • Continuously monitor the model performance and resource utilization
  • Troubleshoot any performance or integration issues
  • Ability to version, store and access the models and model artifacts:
    • To debug & troubleshoot in production and track back to the specific model version & associated training data
    • To quick & controlled rollback to a previous, more stable version
  • Implement the feedback loop to adapt to changing data & business needs:
    • Ability to retrain / fine-tune the model.
    • Ability to split the traffic between models (A/B testing)
    • Switching between the models without breaking inference system for the end-users
  • Ability to scaling up/down the infra to accommodate changing needs
  • Ability to share the insights and findings with stakeholders to take data-driven decisions

Persona : Machine Learning Operator

  • Provide and maintain software required by the end users of the platform.
  • Operationalize experimental workload by providing guidance and best practices for running the workload on the platform.
  • Deploy the workloads on the platform.
  • Assist with enabling observability and monitoring for the workloads to ensure smooth operations.

Prerequisites

Deploy the platform

Playground Reference Architecture: Set up an environment to familiarize yourself with the architecture and get an understanding of the concepts.

Use cases

Resources