[train][fix] Fix concurrency limitations in the new inference codepath#1320
[train][fix] Fix concurrency limitations in the new inference codepath#1320
Conversation
…rency inference Under high concurrency (batch_size * n_samples_per_prompt requests), the default connection limits (aiohttp/httpx: 100) and stale keep-alive connections caused ECONNRESET errors. This raises limits via SKYRL_HTTP_CONNECTION_LIMIT (default 50K), sets keepalive_timeout=2s on the client side, adds retry helpers for transient connection errors, and increases uvicorn backlog on the router and vLLM server. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: SumanthRH <sumanthrh99@gmail.com>
…ence Signed-off-by: SumanthRH <sumanthrh99@gmail.com>
There was a problem hiding this comment.
Code Review
This pull request effectively addresses concurrency limitations in the new inference codepath by introducing several key improvements. These include adding retry mechanisms with backoff for transient network errors in both the RemoteInferenceClient and InferenceRouter, increasing HTTP connection limits, and properly handling stale connections. A semaphore is also added to RemoteInferenceClient to control the concurrency of generation requests, preventing the server from being overwhelmed. A comprehensive load testing script has been added to validate these changes. The fixes are well-implemented and address the described issues. I have a couple of minor suggestions for improvement.
| logger.warning(f"Proxy retry {attempt + 1}/{_PROXY_RETRIES} for {path}: {e}") | ||
| continue | ||
| except Exception as e: | ||
| logger.info(f"Encountered an exception while proxying a request to path {path}: {e}") |
There was a problem hiding this comment.
For unexpected exceptions caught by a broad except Exception:, it's better to use logger.exception or at least logger.error. logger.info might not be visible enough for what could be a critical problem. logger.exception will also automatically include stack trace information, which is very helpful for debugging.
| logger.info(f"Encountered an exception while proxying a request to path {path}: {e}") | |
| logger.exception(f"Encountered an unexpected exception while proxying a request to path {path}: {e}") |
| return client, router, server_group | ||
|
|
||
|
|
||
| def shutdown_servers(client, router, server_group): |
There was a problem hiding this comment.
The client parameter in shutdown_servers is not used within the function. It can be removed to simplify the function signature. Remember to update the call to this function in main as well.
| def shutdown_servers(client, router, server_group): | |
| def shutdown_servers(router, server_group): |
What does this PR do?
Fixes #1307 .
This PR fixes concurrency limitations in the new inference codepath.
Summary
Training with the GSM8K script at
examples/train/gsm8k/run_gsm8k.shhangs with the new inference codepath. To investigate the issue, I ran some load tests to concurrently fire a bunch of requests at different scales (100 -> 10k). Note that for the configuration inexamples/train/gsm8k/run_gsm8k.sh, we actually end up running generate requests at the concurrency of ~ 5K during training, so this is a reasonable test. At scales of 50K, you'll start hitting OS limits of open ports, so the test sweeps concurrencies from 500-10,000.There are three parts in the new inference stack:
The load test tests concurrency limitations ablating on the different components.
Load test
Issues with
Router + vLLM Serverhttpcore.ReadError: Transient errors fixed by adding in retries.Issues with
RemoteInferenceClient + Router + vLLM ServerFirstly, we use a single shared session for re-use but with the default connection limit of 100 for the
aiohttp.TCPConnector. This will make generation extremely slow. Other pending tasks in the generation loop wait for a long time. This was fixed by raising the connection limit to 50K (very generous).Now, the load test script progressed faster, but there were failures even at low concurrency (~ 500). There were mainly two errors I saw:
Connection reset by peererror: There were two causes for this:a. Closing stale connections from a previous event loop
a. Increasing keep alive timeouts on the
aiohttpclientReadErrorfor different requests: These are transient errors fixed by adding in retriesThese fixes were enough to get the load test script to run without issues. I also added retries on the proxy to ensure that transient failures do not affect the router layer.
E2E testing
After load testing script ran successfully, I revisted the GSM8k training script again. I noticed that training now progressed but failed with a large number of connection errors (basically hitting 3/3 retries for a number of generation samples) after a 4 steps of training. I finally solved the issue by introducing a semaphore to guard against overwhelming the http server with too many concurrent requests. This solved the issue and training ran to completion for the GSM8K script.
Summary of Changes
RemoteInferenceClientlayer to handle transient read / connection errorsRemoteInferenceClientRemoteInferenceClient