Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support JAX Runtimes #2442

Open
Electronic-Waste opened this issue Feb 17, 2025 · 9 comments
Open

Support JAX Runtimes #2442

Electronic-Waste opened this issue Feb 17, 2025 · 9 comments

Comments

@Electronic-Waste
Copy link
Member

Electronic-Waste commented Feb 17, 2025

What you would like to be added?

Part of: #2170

As we planned in the Kubeflow Trainer V2 API, we should support JAX runtime after we implement pytorch runtime.

The works include:

  • Implement runtime plugins for JAX if necessary
  • Create ClusterTrainingRuntime for JAX (single-node, multi-nodes)
  • Add some unit tests and e2e tests
  • Add some user guides for JAX runtimes

/area runtime
/cc @kubeflow/wg-training-leads @saileshd1402 @astefanutti @juliusvonkohout @franciscojavierarceo @varodrig @rareddy @thesuperzapper @seanlaii @deepanker13 @helenxie-bit @Doris-xm @truc0 @mahdikhashan

Why is this needed?

This is planned in KEP-2170.

Love this feature?

Give it a 👍 We prioritize the features with most 👍

@Electronic-Waste
Copy link
Member Author

/remove-label lifecycle/needs-triage

@andreyvelich
Copy link
Member

Thank you for creating this @Electronic-Waste!
I think, you can remove the KEP-2170 from the title since we can implement it after the first v2.0.0 release.
Also, should we split this task into two separate issues ?

We might require two KEPs for every Runtime, since it requires:

  1. API changes
  2. Extension of our runtime framework to support more plugins (or re-using PlainML plugin).
    WDYT @kubeflow/wg-training-leads @astefanutti @Electronic-Waste ?

@Electronic-Waste Electronic-Waste changed the title KEP-2170: Support JAX/TensorFlow Runtimes Support JAX Runtimes Feb 17, 2025
@Electronic-Waste
Copy link
Member Author

@andreyvelich SGTM. I'll split it into two seperate issues.

@tenzen-y
Copy link
Member

  • API changes
  • Extension of our runtime framework to support more plugins (or re-using PlainML plugin).

Does "API" mean CRD? SDK API?

@andreyvelich
Copy link
Member

andreyvelich commented Feb 17, 2025

Does "API" mean CRD? SDK API?

If we need to create a new MLPolicy: JAX, we need to extend the CRD APIs.
But I think we should also ask contributors to include SDK APIs changes (if that is needed) into KEP since this is the main interface for the end-users.

@tenzen-y
Copy link
Member

Does "API" mean CRD? SDK API?

If we need to create a new MLPolicy: JAX, we need to extend the CRD APIs. But I think we should also ask contributors to include SDK APIs changes (if that is needed) into KEP since this is the main interface for the end-users.

SGTM

@Electronic-Waste
Copy link
Member Author

@andreyvelich @tenzen-y Shall we create two GSoC projects for supporting JAX/TF Runtimes? They need two separate KEPs.

@andreyvelich
Copy link
Member

@andreyvelich @tenzen-y Shall we create two GSoC projects for supporting JAX/TF Runtimes? They need two separate KEPs.

I am not sure if we have sufficient number of slots since we already propose 12 projects.
If you have bandwidth to mentor more students, maybe we could separate it.

@Electronic-Waste
Copy link
Member Author

Electronic-Waste commented Feb 19, 2025

@andreyvelich This is the first time for me to mentor students in GSoC. And I will serve as primary mentor for 2 projects and backup mentor for some others now. I'm not sure whether I could handle 3 projects... Anyway, let's discuss it in the upcoming WG Training/AutoML Call.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants