Skip to content

Commit dddb74a

Browse files
committed
updating the resource autodetection logic
Signed-Off-By: Raghavendra Dani <[email protected]> Signed-off-by: Raghavendra Dani <[email protected]>
1 parent 0692fce commit dddb74a

File tree

1 file changed

+2
-1
lines changed

1 file changed

+2
-1
lines changed

reps/2023-01-09-aws-fleet-support.md

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -53,7 +53,7 @@ An EC2 fleet configuration will be encapsulated within a single node type from t
5353

5454
### How will the demand be converted into resources?
5555

56-
Autoscaling monitor receives several demands from the metrics like vCPUs, memory, gpus etc. The [resource_demand_scheduler.py](https://github.com/ray-project/ray/blob/master/python/ray/autoscaler/_private/resource_demand_scheduler.py) is responsible for converting the demand into number of workers per node type specified in the autoscaler config. However, it relies on the **node provider** to statically determine the CPU, memory, GPU and any custom resources. Hence, **node_provider** will determine CPU, memory and GPU from the `InstanceRequirements` and specified `InstanceType` parameters and aggregated them based on the latest family or high spec instance. Hence, autoscaler will end up spinning less nodes than necessary which avoid overscaling problem. However, the [current behavior](https://github.com/ray-project/ray/blob/master/python/ray/autoscaler/_private/aws/node_provider.py#L403) also does not guarantee target capacity will reached. Autoscaling monitor loop will figure out any underscaling issue
56+
Autoscaling monitor receives several demands from the metrics like vCPUs, memory, gpus etc. The [resource_demand_scheduler.py](https://github.com/ray-project/ray/blob/master/python/ray/autoscaler/_private/resource_demand_scheduler.py) is responsible for converting the demand into number of workers per node type specified in the autoscaler config. However, it relies on the **node provider** to statically determine the CPU, memory, GPU and any custom resources. Hence, **node_provider** will determine CPU, memory and GPU from the `InstanceRequirements` and specified `InstanceType` parameters and aggregate them based on the low spec instance. Hence, autoscaler may end up spinning up more nodes than necessary which leads to overscaling. Autoscaler will also identify the idle nodes and remove them if required in the subsequent iterations. However, the [current behavior](https://github.com/ray-project/ray/blob/master/python/ray/autoscaler/_private/aws/node_provider.py#L403) also does not guarantee target capacity will reached. Autoscaling monitor loop will figure out any underscaling issue
5757

5858
### How will spot interruptions be handled?
5959

@@ -109,6 +109,7 @@ ray.worker.default:
109109
- EC2 does not delete a fleet when all of its instances are terminated. Hence, implement [post_process](https://github.com/ray-project/ray/blob/c51b0c9a5664e5c6df3d92f9093b56e61b48f514/python/ray/autoscaler/node_provider.py#L258) method for aws node provider to clean up any active fleets which has only terminated instances.
110110
- Add an example [autoscaler config](https://github.com/ray-project/ray/tree/master/python/ray/autoscaler/aws) documentation to help developers in utilizing the EC2 fleet functionality.
111111
- Validate the [launch config check logic](https://github.com/ray-project/ray/blob/a03a141c296da065f333ea81445a1b9ad49c3d00/python/ray/autoscaler/_private/autoscaler.py#L541) given the `node_config` will be different for EC2 fleet.
112+
- Update documentation to recommend developers to use instances with similar resource shape to avoid overscaling though EC2 fleet allows them to acquire instances at lower costs.
112113
- Update ray test suite to cover integration and unit tests.
113114

114115
## Compatibility, Deprecation, and Migration Plan

0 commit comments

Comments
 (0)