[train][fix] Fix concurrency limitations in the new inference codepath by SumanthRH · Pull Request #1320 · NovaSky-AI/SkyRL

SumanthRH · 2026-03-13T07:04:48Z

What does this PR do?

Fixes #1307 .

This PR fixes concurrency limitations in the new inference codepath.

Summary

Training with the GSM8K script at examples/train/gsm8k/run_gsm8k.sh hangs with the new inference codepath. To investigate the issue, I ran some load tests to concurrently fire a bunch of requests at different scales (100 -> 10k). Note that for the configuration in examples/train/gsm8k/run_gsm8k.sh, we actually end up running generate requests at the concurrency of ~ 5K during training, so this is a reasonable test. At scales of 50K, you'll start hitting OS limits of open ports, so the test sweeps concurrencies from 500-10,000.

There are three parts in the new inference stack:

RemoteInferenceClient  -> InferenceEngineRouter -> vLLM Server

The load test tests concurrency limitations ablating on the different components.

Load test

Issues with`Router + vLLM Server`

httpcore.ReadError: Transient errors fixed by adding in retries.

Issues with`RemoteInferenceClient + Router + vLLM Server`

Firstly, we use a single shared session for re-use but with the default connection limit of 100 for the aiohttp.TCPConnector. This will make generation extremely slow. Other pending tasks in the generation loop wait for a long time. This was fixed by raising the connection limit to 50K (very generous).

Now, the load test script progressed faster, but there were failures even at low concurrency (~ 500). There were mainly two errors I saw:

Connection reset by peer error: There were two causes for this:
a. Closing stale connections from a previous event loop
a. Increasing keep alive timeouts on the aiohttp client
ReadError for different requests: These are transient errors fixed by adding in retries

These fixes were enough to get the load test script to run without issues. I also added retries on the proxy to ensure that transient failures do not affect the router layer.

E2E testing

After load testing script ran successfully, I revisted the GSM8k training script again. I noticed that training now progressed but failed with a large number of connection errors (basically hitting 3/3 retries for a number of generation samples) after a 4 steps of training. I finally solved the issue by introducing a semaphore to guard against overwhelming the http server with too many concurrent requests. This solved the issue and training ran to completion for the GSM8K script.

Summary of Changes

Introduces a new load testing script to load test concurrency at different layers of the new HTTP stack
Introduces retries at the router and the RemoteInferenceClient layer to handle transient read / connection errors
Removes stale connections from previous event loop in RemoteInferenceClient
Raises the connection limit for the aiohttp session in RemoteInferenceClient
Adds a concurrency limit per engine to avoid overwhelming the http server with too many concurrent requests.

…rency inference Under high concurrency (batch_size * n_samples_per_prompt requests), the default connection limits (aiohttp/httpx: 100) and stale keep-alive connections caused ECONNRESET errors. This raises limits via SKYRL_HTTP_CONNECTION_LIMIT (default 50K), sets keepalive_timeout=2s on the client side, adds retry helpers for transient connection errors, and increases uvicorn backlog on the router and vLLM server. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Signed-off-by: SumanthRH <sumanthrh99@gmail.com>

…ence Signed-off-by: SumanthRH <sumanthrh99@gmail.com>

Signed-off-by: SumanthRH <sumanthrh99@gmail.com>

gemini-code-assist

Code Review

This pull request effectively addresses concurrency limitations in the new inference codepath by introducing several key improvements. These include adding retry mechanisms with backoff for transient network errors in both the RemoteInferenceClient and InferenceRouter, increasing HTTP connection limits, and properly handling stale connections. A semaphore is also added to RemoteInferenceClient to control the concurrency of generation requests, preventing the server from being overwhelmed. A comprehensive load testing script has been added to validate these changes. The fixes are well-implemented and address the described issues. I have a couple of minor suggestions for improvement.

gemini-code-assist · 2026-03-14T17:32:47Z

skyrl/backends/skyrl_train/inference_servers/router.py

+                logger.warning(f"Proxy retry {attempt + 1}/{_PROXY_RETRIES} for {path}: {e}")
+                continue
+            except Exception as e:
+                logger.info(f"Encountered an exception while proxying a request to path {path}: {e}")


For unexpected exceptions caught by a broad except Exception:, it's better to use logger.exception or at least logger.error. logger.info might not be visible enough for what could be a critical problem. logger.exception will also automatically include stack trace information, which is very helpful for debugging.

Suggested change

logger.info(f"Encountered an exception while proxying a request to path {path}: {e}")

logger.exception(f"Encountered an unexpected exception while proxying a request to path {path}: {e}")

gemini-code-assist · 2026-03-14T17:32:47Z

skyrl/benchmarks/load_test_concurrency.py

+    return client, router, server_group
+
+
+def shutdown_servers(client, router, server_group):


The client parameter in shutdown_servers is not used within the function. It can be removed to simplify the function signature. Remember to update the call to this function in main as well.

Suggested change

def shutdown_servers(client, router, server_group):

def shutdown_servers(router, server_group):

devin-ai-integration

✅ Devin Review: No Issues Found

Devin Review analyzed this PR and found no potential bugs to report.

View in Devin Review to see 4 additional findings.

SumanthRH and others added 11 commits March 11, 2026 05:20

add semaphore based throttling; gsm8k finally working

b00c218

Signed-off-by: SumanthRH <sumanthrh99@gmail.com>

x

ac3bf0b

Signed-off-by: SumanthRH <sumanthrh99@gmail.com>

x

434d17e

Signed-off-by: SumanthRH <sumanthrh99@gmail.com>

x

2696089

Signed-off-by: SumanthRH <sumanthrh99@gmail.com>

Merge remote-tracking branch 'origin/main' into fix-hanging-new-infer…

88a113a

…ence Signed-off-by: SumanthRH <sumanthrh99@gmail.com>

x

e6d9f84

Signed-off-by: SumanthRH <sumanthrh99@gmail.com>

lint

1f3eb05

Signed-off-by: SumanthRH <sumanthrh99@gmail.com>

x

0fce9fd

Signed-off-by: SumanthRH <sumanthrh99@gmail.com>

x

940fa2d

Signed-off-by: SumanthRH <sumanthrh99@gmail.com>

x

642d988

Signed-off-by: SumanthRH <sumanthrh99@gmail.com>

SumanthRH marked this pull request as ready for review March 14, 2026 17:30

gemini-code-assist bot reviewed Mar 14, 2026

View reviewed changes

devin-ai-integration bot reviewed Mar 14, 2026

View reviewed changes

SumanthRH requested a review from kouroshHakha March 14, 2026 18:14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[train][fix] Fix concurrency limitations in the new inference codepath#1320

[train][fix] Fix concurrency limitations in the new inference codepath#1320
SumanthRH wants to merge 11 commits intomainfrom
fix-hanging-new-inference

SumanthRH commented Mar 13, 2026 •

edited by devin-ai-integration bot

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Mar 14, 2026

Uh oh!

gemini-code-assist bot Mar 14, 2026

Uh oh!

devin-ai-integration bot left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

	logger.info(f"Encountered an exception while proxying a request to path {path}: {e}")
	logger.exception(f"Encountered an unexpected exception while proxying a request to path {path}: {e}")

		return client, router, server_group


		def shutdown_servers(client, router, server_group):

	def shutdown_servers(client, router, server_group):
	def shutdown_servers(router, server_group):

Conversation

SumanthRH commented Mar 13, 2026 • edited by devin-ai-integration bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Summary

Load test

Issues withRouter + vLLM Server

Issues withRemoteInferenceClient + Router + vLLM Server

E2E testing

Summary of Changes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Mar 14, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Mar 14, 2026

Choose a reason for hiding this comment

Uh oh!

devin-ai-integration bot left a comment

Choose a reason for hiding this comment

✅ Devin Review: No Issues Found

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

SumanthRH commented Mar 13, 2026 •

edited by devin-ai-integration bot

Loading

Issues with`Router + vLLM Server`

Issues with`RemoteInferenceClient + Router + vLLM Server`