Skip to content

v20250310-124810

Latest
Compare
Choose a tag to compare
@github-actions github-actions released this 10 Mar 12:50
· 9 commits to main since this release
fc07220
Reuse Ephemeral runners (#6315)

# About

With the goal to eventually move to all instances being ephemeral, we
need to fix the major limitation we have with ephemeral instances:
stockouts.

This is a problem as we currently release the instances when they finish
the job.

The goal is to make the instances to be reused before return them to AWS
by:

* Tagging ephemeral instances that finished a job with
`EphemeralRunnerFinished=finish_timestamp` so scaleUp is hinted that it
can be reused;
* scaleUp finds instances that have the `EphemeralRunnerFinished` and
try to use them to run a new job;
* scaleUp acquires lock on the instance name to avoid concurrency on
reuse;
* scaleUp mark instances re-deployed with
`EBSVolumeReplacementRequestTm` tagging when the instance was marked for
reuse;
* scaleUp remove `EphemeralRunnerFinished` so others won't find the same
instance for reuse;
* scaleUp creates the necessary SSM parameters and return the instance
to its fresh state by restoring EBS volume;

ScaleDown then:
* Avoids removing ephemeral instances by `minRunningTime` using either
creation time or `EphemeralRunnerFinished` or
`EBSVolumeReplacementRequestTm` depending on instance status;

# Disaster recovery plan:

If this PR introduces breakages, they will mostly certainly be related
to the capacity of deploying new instances/runners over having any
different behaviour in the runner itself.

So, after reverting this change, it will be important to make sure the
runner queue is under control. What should be accomplished by checking
the queue size on [hud metrics](https://hud.pytorch.org/metrics) and
running the
[send_scale_message.py](https://github.com/pytorch-labs/pytorch-gha-infra/blob/main/scale_tools/send_scale_message.py)
script to make sure those instances will be properly deployed by the
stable version of the scaler.

## Step by step to revert this change from **META**

1 - Identify if this PR is causing the identified problem: [look at
queue size](https://hud.pytorch.org/metrics) and if it is related to
impacted runners (ephemeral ones); It can also help to investigate the
[metrics on
unidash](https://www.internalfb.com/intern/unidash/dashboard/aws_infra_monitoring_for_github_actions/lambda_scaleup)
and the
[logs](https://us-east-1.console.aws.amazon.com/lambda/home?region=us-east-1#/functions/gh-ci-scale-up?tab=monitoring)
related to the scaleUp lambda;

2 - In case of confirming the source of the problem be triggered by this
PR, revert it from main with the goal of making sure it won't impact
again in case someone else is working in other changes and accidentally
release a version of test-infra with this change.

3 - In order to restore the infrastructure to the point before this
change:

A) find the commit (or more than one, unlikely) that points to a release
version of test-infra that contains this change (will most likely be the
latest) on pytorch-gha-infra. It will be a change updating the Terrafile
pointing to a newer version of test-infra
([example](https://github.com/pytorch-labs/pytorch-gha-infra/commit/c4e888f58441b18a0fd6e19a1b935667750c6ba2)).
We maintain by standard the naming of such commit as `Release
vDATE-TIME` like `Release v20250204-163312`

B) Revert that commit from
https://github.com/pytorch-labs/pytorch-gha-infra

C) Follow [the
steps](https://docs.google.com/document/d/1nq3dx-_8wasii1koCkXJDSo3uz_0Ee8DzIS2-j2TOpA/edit?tab=t.0#heading=h.vj4fvy46wzwk)
outlined in the Pytorch GHA Infra runbook;

D) There are pointers in that document to monitoring and making sure you
are seeing recovery in metrics / queue / logs that you identified, and
how to make sure you are recovered;

4 - Restore user experience:

A) If you do have access, follow the [instructions into how to recover
ephemeral queueing
jobs](https://docs.google.com/document/d/1nq3dx-_8wasii1koCkXJDSo3uz_0Ee8DzIS2-j2TOpA/edit?tab=t.0#heading=h.ba0nyrda8jch)
on the above mentioned document;

B) Another option is to cancel jobs that are queued and trigger them
again;