Skip to content

[train][fix] Fix concurrency limitations in the new inference codepath#1320

Open
SumanthRH wants to merge 11 commits intomainfrom
fix-hanging-new-inference
Open

[train][fix] Fix concurrency limitations in the new inference codepath#1320
SumanthRH wants to merge 11 commits intomainfrom
fix-hanging-new-inference

Conversation

@SumanthRH
Copy link
Member

@SumanthRH SumanthRH commented Mar 13, 2026

What does this PR do?

Fixes #1307 .

This PR fixes concurrency limitations in the new inference codepath.

Summary

Training with the GSM8K script at examples/train/gsm8k/run_gsm8k.sh hangs with the new inference codepath. To investigate the issue, I ran some load tests to concurrently fire a bunch of requests at different scales (100 -> 10k). Note that for the configuration in examples/train/gsm8k/run_gsm8k.sh, we actually end up running generate requests at the concurrency of ~ 5K during training, so this is a reasonable test. At scales of 50K, you'll start hitting OS limits of open ports, so the test sweeps concurrencies from 500-10,000.

There are three parts in the new inference stack:

RemoteInferenceClient  -> InferenceEngineRouter -> vLLM Server

The load test tests concurrency limitations ablating on the different components.

Load test

Issues withRouter + vLLM Server

httpcore.ReadError: Transient errors fixed by adding in retries.

Issues withRemoteInferenceClient + Router + vLLM Server

Firstly, we use a single shared session for re-use but with the default connection limit of 100 for the aiohttp.TCPConnector. This will make generation extremely slow. Other pending tasks in the generation loop wait for a long time. This was fixed by raising the connection limit to 50K (very generous).

Now, the load test script progressed faster, but there were failures even at low concurrency (~ 500). There were mainly two errors I saw:

  1. Connection reset by peer error: There were two causes for this:
    a. Closing stale connections from a previous event loop
    a. Increasing keep alive timeouts on the aiohttp client
  2. ReadError for different requests: These are transient errors fixed by adding in retries

These fixes were enough to get the load test script to run without issues. I also added retries on the proxy to ensure that transient failures do not affect the router layer.

E2E testing

After load testing script ran successfully, I revisted the GSM8k training script again. I noticed that training now progressed but failed with a large number of connection errors (basically hitting 3/3 retries for a number of generation samples) after a 4 steps of training. I finally solved the issue by introducing a semaphore to guard against overwhelming the http server with too many concurrent requests. This solved the issue and training ran to completion for the GSM8K script.

Screenshot 2026-03-12 at 11 54 33 PM

Summary of Changes

  • Introduces a new load testing script to load test concurrency at different layers of the new HTTP stack
  • Introduces retries at the router and the RemoteInferenceClient layer to handle transient read / connection errors
  • Removes stale connections from previous event loop in RemoteInferenceClient
  • Raises the connection limit for the aiohttp session in RemoteInferenceClient
  • Adds a concurrency limit per engine to avoid overwhelming the http server with too many concurrent requests.

Open with Devin

SumanthRH and others added 11 commits March 11, 2026 05:20
…rency inference

Under high concurrency (batch_size * n_samples_per_prompt requests), the default
connection limits (aiohttp/httpx: 100) and stale keep-alive connections caused
ECONNRESET errors. This raises limits via SKYRL_HTTP_CONNECTION_LIMIT (default 50K),
sets keepalive_timeout=2s on the client side, adds retry helpers for transient
connection errors, and increases uvicorn backlog on the router and vLLM server.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: SumanthRH <sumanthrh99@gmail.com>
x
Signed-off-by: SumanthRH <sumanthrh99@gmail.com>
x
Signed-off-by: SumanthRH <sumanthrh99@gmail.com>
x
Signed-off-by: SumanthRH <sumanthrh99@gmail.com>
…ence

Signed-off-by: SumanthRH <sumanthrh99@gmail.com>
x
Signed-off-by: SumanthRH <sumanthrh99@gmail.com>
Signed-off-by: SumanthRH <sumanthrh99@gmail.com>
x
Signed-off-by: SumanthRH <sumanthrh99@gmail.com>
x
Signed-off-by: SumanthRH <sumanthrh99@gmail.com>
x
Signed-off-by: SumanthRH <sumanthrh99@gmail.com>
@SumanthRH SumanthRH marked this pull request as ready for review March 14, 2026 17:30
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request effectively addresses concurrency limitations in the new inference codepath by introducing several key improvements. These include adding retry mechanisms with backoff for transient network errors in both the RemoteInferenceClient and InferenceRouter, increasing HTTP connection limits, and properly handling stale connections. A semaphore is also added to RemoteInferenceClient to control the concurrency of generation requests, preventing the server from being overwhelmed. A comprehensive load testing script has been added to validate these changes. The fixes are well-implemented and address the described issues. I have a couple of minor suggestions for improvement.

logger.warning(f"Proxy retry {attempt + 1}/{_PROXY_RETRIES} for {path}: {e}")
continue
except Exception as e:
logger.info(f"Encountered an exception while proxying a request to path {path}: {e}")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

For unexpected exceptions caught by a broad except Exception:, it's better to use logger.exception or at least logger.error. logger.info might not be visible enough for what could be a critical problem. logger.exception will also automatically include stack trace information, which is very helpful for debugging.

Suggested change
logger.info(f"Encountered an exception while proxying a request to path {path}: {e}")
logger.exception(f"Encountered an unexpected exception while proxying a request to path {path}: {e}")

return client, router, server_group


def shutdown_servers(client, router, server_group):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The client parameter in shutdown_servers is not used within the function. It can be removed to simplify the function signature. Remember to update the call to this function in main as well.

Suggested change
def shutdown_servers(client, router, server_group):
def shutdown_servers(router, server_group):

Copy link
Contributor

@devin-ai-integration devin-ai-integration bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Devin Review: No Issues Found

Devin Review analyzed this PR and found no potential bugs to report.

View in Devin Review to see 4 additional findings.

Open in Devin Review

@SumanthRH SumanthRH requested a review from kouroshHakha March 14, 2026 18:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[train] Training with the new inference codepath hangs

1 participant