Improved retries in Spack CI

At the moment, every job within Spack CI is built 2 times regardless of the failure reason. Error modes can range from OOM kills to typos in package configs to true build failures to network interruptions.

Gitlab allows for [custom rules](https://docs.gitlab.com/ee/ci/yaml/#retry) for when retries are to be automatically triggered. The [CI generate script currently lists](https://github.com/spack/spack/blob/develop/lib/spack/spack/ci.py#L54-L68) all error modes except for timeout failures as valid reasons for retrying. 

Given that several of these potential failures are deterministic, there is no sense in retrying certain jobs --  this leads to wasted cycles and inefficiency. Gitlab now supports retry rules based on the [exit code of a job](https://docs.gitlab.com/ee/ci/yaml/#retryexit_codes), an approach we could use for jobs that fail due to deterministic issues.

Additionally, with the implementation of dynamic resource allocation in CI, we'll need to integrate safety measures such as the ability to automatically retry jobs with more resources if they were OOM killed.

**Example workflow**

A job fails and Gantry receives the webhook. It checks if it was OOM killed and updates the memory limit if necessary.

**Challenges**
There is no support in Gitlab to retry singular jobs while updating variables.

Issue [37268](https://gitlab.com/gitlab-org/gitlab/-/issues/37268) added support for specifying variables when retrying manual jobs via the web UI. A separate [issue](https://gitlab.com/gitlab-org/gitlab/-/issues/427662) was filed to push for this to be added to the API. Additionally, there is support to *start* manual jobs with custom variables [via the API](https://docs.gitlab.com/ee/api/jobs.html#run-a-job).

Unfortunately, jobs in Spack are not manual, as they run automatically after being triggered/scheduled. What we want to do is retry an individual job (no matter if schedules/triggered/manual) within a pipeline, while being able to pass custom variables. Without this functionality, we won't be able to modify the resource request/limit of a job that has been killed due to resource contention.

Issue [387798](https://gitlab.com/gitlab-org/gitlab/-/issues/387798) would resolve this issue, but there has been no significant movement to implement the features.

questions:
- are we able to determine if a gitlab job was retried and what the original id was? [seems like no](https://gitlab.com/gitlab-org/gitlab/-/issues/340141)

<details>
<summary>alternative solutions</summary>


<ins>restarting the entire pipeline</ins>

- if a pipeline failed due to OOM builds, we follow spackbot's model and recreate the pipeline with [this endpoint](https://docs.gitlab.com/ee/api/pipelines.html#create-a-new-pipeline)
- restarting the pipeline would trigger `ci generate` and request new updated allocations from gantry
- cons: every single job in the pipeline will be re-run, which leads to more wasted cycles (but are cached?)
- unfortunately, the "retry pipeline" endpoint [does not support passing variables](https://docs.gitlab.com/ee/api/pipelines.html#retry-jobs-in-a-pipeline)

<ins>pre_build script</ins>

- we could use the [`pre_build` script](https://github.com/spack/spack-infrastructure/blob/main/scripts/gitlab_runner_pre_build/pre_build.py) in spack-infrastructure to check if a job has been retried and update the job variables
- this is not possible because the runner for the restarted job would have already been allocated at this point

<ins>k8s middleware</ins>

- we could create a k8s controller or operator to update a container/pod with new resource requests
- this would be the most technically complex solution and not sure if it's feasible/worth it if we move away from Kubernetes in the future
- and: by the time the job reaches k8s, the packing decision has likely already been made by karpenter

<ins>downstream/child pipelines and manual jobs</ins>
- as a roundabout way to create jobs without necessarily restarting, we can create new jobs that will succeed after not being OOM killed
- if a build job was oom killed, create multiple [child pipelines](https://docs.gitlab.com/ee/ci/pipelines/downstream_pipelines.html) with an identical job that is manual
- gantry will "start" the manual job with updated resource limits and retry up to N times
- questions:
   - how exactly can we create a duplicate of the build job?
   - will this confuse users who will see even more nesting of jobs in the gitlab UI?
   - would the success of a child pipeline be reflected in the pipeline status?
</details>


TODO:

- [ ] #4 
- [ ] cut down on unnecessary retries based on exit code/reason
- [ ] create a follow up issue to track the gitlab retry w/ variables issue and refactor the OOM handling

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Improved retries in Spack CI #74

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Improved retries in Spack CI #74

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions