Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ollama HTTP 50x errors, and Ollama timeouts #702

Open
esnible opened this issue Mar 6, 2025 · 3 comments
Open

Ollama HTTP 50x errors, and Ollama timeouts #702

esnible opened this issue Mar 6, 2025 · 3 comments
Labels
bug Something isn't working

Comments

@esnible
Copy link
Member

esnible commented Mar 6, 2025

Describe the bug
When running examples/gsm8k/gsm8.pdl with the full 1319 iterations, PDL tries to submit all 1319 completions at nearly the same time.

Sometimes Ollama logs 503, which is "Service Unavailable"
[GIN] 2025/03/06 - 11:42:21 | 503 | 42.044125ms | 127.0.0.1 | POST "/api/generate"
[GIN] 2025/03/06 - 11:42:21 | 503 | 43.173125ms | 127.0.0.1 | POST "/api/generate"
[GIN] 2025/03/06 - 11:42:21 | 503 | 44.790209ms | 127.0.0.1 | POST "/api/generate"
[GIN] 2025/03/06 - 11:42:21 | 503 | 45.941833ms | 127.0.0.1 | POST "/api/generate"

Also, PDL logs the following message:

gsm8.pdl:26 - Error during 'ollama/granite3.2:8b' model call: litellm.APIConnectionError: OllamaException - litellm.Timeout: Connection timed out after 600.0 seconds.
Failure generating the trace: Error during 'ollama/granite3.2:8b' model call: litellm.APIConnectionError: OllamaException - litellm.Timeout: Connection timed out after 600.0 seconds.

This suggests that LiteLLM or Ollama limits us to 10 minutes for a response, even for the 1319th entry, which won't be ready until the other 1318 entries were processed -- taking over an hour.

Also, Ollama logs the following message:

[GIN] 2025/03/06 - 12:10:19 | 500 |         9m59s |       127.0.0.1 | POST     "/api/generate"
time=2025-03-06T12:10:20.053-05:00 level=INFO source=server.go:727 msg="aborting completion request due to client closing the connection"

when running with 256 iterations, suggesting that LiteLLM or PDL gives up after 10 minutes and does not accept the response that Ollama finally generates.

To Reproduce
Edit gsm8.pdl to have MAX_ITERATIONS: 1319 and run gsm8.pdl.

Expected behavior
Perhaps PDL or LiteLLM should retry 503s after some delay?

Desktop (please complete the following information):

  • OS: Mac M3
  • Version Ollama 0.5.13
@esnible esnible added the bug Something isn't working label Mar 6, 2025
@starpit
Copy link
Member

starpit commented Mar 6, 2025

fwiw, ollama has the following default limits; we could adjust these...

OLLAMA_MAX_LOADED_MODELS - The maximum number of models that can be loaded concurrently provided they fit in available memory. The default is 3 * the number of GPUs or 3 for CPU inference.
OLLAMA_NUM_PARALLEL - The maximum number of parallel requests each model will process at the same time. The default will auto-select either 4 or 1 based on available memory.
OLLAMA_MAX_QUEUE - The maximum number of requests Ollama will queue when busy before rejecting additional requests. The default is 512

@esnible esnible changed the title Ollama HTTP 503 error, also known as the "Service Unavailable" Ollama HTTP 503 errors, and Ollama timeouts Mar 6, 2025
@esnible esnible changed the title Ollama HTTP 503 errors, and Ollama timeouts Ollama HTTP 50x errors, and Ollama timeouts Mar 6, 2025
@esnible
Copy link
Member Author

esnible commented Mar 7, 2025

@mandel @vazirim Any thoughts about an approach for this?

It would be straightforward to introduce a global data structure with N "tickets" to use LiteLLM; where the (N+1)th model request to use LiteLLM blocks until another model invocation completes.

Perhaps the limit is per-provider, with a default of e.g. 128 for Ollama and 500 for Replicate?

Perhaps PDL's interpreter doesn't handle this at all, but we introduce a library for excluding more than N readers, and developers explicitly request and release permission to read within PDL loops?

@starpit
Copy link
Member

starpit commented Mar 7, 2025

should we be using a thread pool executor?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants