updating the resource autodetection logic

raghumdani · raghumdani · commit dddb74a570d8 · 2023-01-13T16:01:54.000-08:00
Signed-Off-By: Raghavendra Dani &lt;draghave@amazon.com&gt;
Signed-off-by: Raghavendra Dani &lt;draghave@amazon.com&gt;
diff --git a/reps/2023-01-09-aws-fleet-support.md b/reps/2023-01-09-aws-fleet-support.md
@@ -53,7 +53,7 @@ An EC2 fleet configuration will be encapsulated within a single node type from t
 
 ### How will the demand be converted into resources?
 
-Autoscaling monitor receives several demands from the metrics like vCPUs, memory, gpus etc. The [resource_demand_scheduler.py](https://github.com/ray-project/ray/blob/master/python/ray/autoscaler/_private/resource_demand_scheduler.py) is responsible for converting the demand into number of workers per node type specified in the autoscaler config. However, it relies on the **node provider** to statically determine the CPU, memory, GPU and any custom resources. Hence, **node_provider** will determine CPU, memory and GPU from the `InstanceRequirements` and specified `InstanceType` parameters and aggregated them based on the latest family or high spec instance. Hence, autoscaler will end up spinning less nodes than necessary which avoid overscaling problem. However, the [current behavior](https://github.com/ray-project/ray/blob/master/python/ray/autoscaler/_private/aws/node_provider.py#L403) also does not guarantee target capacity will reached. Autoscaling monitor loop will figure out any underscaling issue
+Autoscaling monitor receives several demands from the metrics like vCPUs, memory, gpus etc. The [resource_demand_scheduler.py](https://github.com/ray-project/ray/blob/master/python/ray/autoscaler/_private/resource_demand_scheduler.py) is responsible for converting the demand into number of workers per node type specified in the autoscaler config. However, it relies on the **node provider** to statically determine the CPU, memory, GPU and any custom resources. Hence, **node_provider** will determine CPU, memory and GPU from the `InstanceRequirements` and specified `InstanceType` parameters and aggregate them based on the low spec instance. Hence, autoscaler may end up spinning up more nodes than necessary which leads to overscaling. Autoscaler will also identify the idle nodes and remove them if required in the subsequent iterations. However, the [current behavior](https://github.com/ray-project/ray/blob/master/python/ray/autoscaler/_private/aws/node_provider.py#L403) also does not guarantee target capacity will reached. Autoscaling monitor loop will figure out any underscaling issue
 
 ### How will spot interruptions be handled?
 
@@ -109,6 +109,7 @@ ray.worker.default:
 -   EC2 does not delete a fleet when all of its instances are terminated. Hence, implement [post_process](https://github.com/ray-project/ray/blob/c51b0c9a5664e5c6df3d92f9093b56e61b48f514/python/ray/autoscaler/node_provider.py#L258) method for aws node provider to clean up any active fleets which has only terminated instances.
 -   Add an example [autoscaler config](https://github.com/ray-project/ray/tree/master/python/ray/autoscaler/aws) documentation to help developers in utilizing the EC2 fleet functionality.
 -   Validate the [launch config check logic](https://github.com/ray-project/ray/blob/a03a141c296da065f333ea81445a1b9ad49c3d00/python/ray/autoscaler/_private/autoscaler.py#L541) given the `node_config` will be different for EC2 fleet.
+-   Update documentation to recommend developers to use instances with similar resource shape to avoid overscaling though EC2 fleet allows them to acquire instances at lower costs.
 -   Update ray test suite to cover integration and unit tests.
 
 ## Compatibility, Deprecation, and Migration Plan