Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kubeflow install not working #1015

Open
NerdyShawn opened this issue Jan 22, 2025 · 4 comments
Open

kubeflow install not working #1015

NerdyShawn opened this issue Jan 22, 2025 · 4 comments

Comments

@NerdyShawn
Copy link
Contributor

it was reported that kubeflow install was not working and I was able to recreate this on k3s 1.30.5 as well.

k logs install-kubeflow-test02-889a-zpw57
{"time":"2025-01-22T13:38:22.859258313Z","level":"INFO","msg":"Cloneing git repo https://github.com/civo/kubernetes-marketplace\n"}
{"time":"2025-01-22T13:38:22.859526351Z","level":"INFO","msg":"Creating temp dir to clone git repo"}
{"time":"2025-01-22T13:38:22.859711861Z","level":"INFO","msg":"Created temp dir: /tmp/prefix858858872"}
{"time":"2025-01-22T13:38:24.585771359Z","level":"INFO","msg":"Validating that app exists: kubeflow\n"}
{"time":"2025-01-22T13:38:24.586054501Z","level":"INFO","msg":"Running App PreInstall"}
{"time":"2025-01-22T13:38:24.586096877Z","level":"INFO","msg":"Cheking the pre_install.sh is preset or not"}
{"time":"2025-01-22T13:38:24.586217012Z","level":"INFO","msg":"No pre_install.sh found fo kubeflow"}
{"time":"2025-01-22T13:38:24.586391162Z","level":"INFO","msg":""}
{"time":"2025-01-22T13:38:24.58642236Z","level":"INFO","msg":"Running App Install"}
Error: missing closing brace
Usage:
  marketplace-installer install [flags]

Examples:
install <app name>

Flags:
  -h, --help   help for install

Global Flags:
  -d, --git-url string   The git repo to clone from (default "https://git.civo.com/civo/marketplace.git")

I believe it has to do with the way the install is currently being handled breaking from the desired pattern as its performing an additional git retrieval operation to pull the manifest down

while ! kubectl apply -f https://raw.githubusercontent.com/civo/kubernetes-marketplace/master/kubeflow/kubeflow.yaml; do echo "Retrying to apply resources"; sleep 10; done

The desired fix would be o implement a supported install method

@NerdyShawn
Copy link
Contributor Author

another issue is the ordering here, the current setup was relying on just brute force applies instead of installing the needed CRD's before creating the desired CR's. Additionally there is an error with the knative dependcy relying a deprecated api which doesn't work with 1.30.5

error: resource mapping not found for name: "webhook" namespace: "knative-serving" from "kubeflow.yaml": no matches for kind "HorizontalPodAutoscaler" in version "autoscaling/v2beta2" so this should move to v2.
ensure CRDs are installed first

@NerdyShawn
Copy link
Contributor Author

proposed fix

  • breaks apart crd's install from the rest of the manifest so they can be installed prior to being referenced

  • creates an uninstall script

  • updates a reference to a deprecated hpa object

  • kubeflow rework install/uninstall #1016

@davenull
Copy link
Contributor

Hey Shawn! The issue that comes with using the kubeflow spec as released is they distribute it as a single yaml file that is 600k lines long, and their official install method is to loop over the install until it doesn't fail. This method means that any object that is applied out of order in regards to making things come up on the first pass over the yaml, of which there are numerous examples in the spec, will cause a full re-application of the full 600k lines of yaml, with the number 2 run and all runs after that coming with the additional overhead of a full compare-and-reconcile of each k8s object in the spec. This repeats until there is no error.

The changes we have made are purely organizational; the spec is broken into stages, the stages are ordered so that they satisfy dependencies, and we apply them one-by-one with a script that can individually retry the specs. The issue that has happened here is that the script is a bit naive, and it does not assume that the list of files could possibly return the list out of order, which is the event that is occurring in the install job pod. This is easy enough to fix, I am working on it at the moment, essentially some more checking and sorting of the list before application will fix the issue. To be more exact, the files are all prepended with a 2 digit int, with two files that are civo-specific additions being 98 and 99, so that they apply after the kf specs. The OS started returning them before file 00 some time in the last week, and that is what is causing the error on installation of the kf spec.

@NerdyShawn
Copy link
Contributor Author

Hi Dave! 👋🏼
Thanks for the detail, that is wild to see thats what they recommend on their install. Fwiw I took that large single yaml file and spilit into multiple files to the bash scripts could order the crd's and namespaces before the cr's and other resources. I was testing locally on a fresh civo k3s cluster and it seemed to work fine with that method on some large nodes.

The missing closing brace from marketplace-installer was a new one for me, so I had just assumed that had to be reworked to avoid the known errors if crd's aren't available yet.

I did find the documented install loop you talked about, albeit it looks like that sleep 10 is now a sleep 20 in v1.9.0 😆

One additional critique of the install was the additional call back out to the git repo even though it should already be pulled down and available to the marketplace-installer binary at that point so should just be able to call the kubeflow.yaml locally. I don't think that was the issue, just more of it seemed unnecessary when the file should already be local.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants