Long-running llama-server behavior #10076

JoeStech · 2024-10-28T22:06:12Z

JoeStech
Oct 28, 2024

I have a llama-server demo that's been running for about a month on an AWS Graviton instance (c8g.24xlarge), and over time the performance went from 60 TPS generation to about 5 TPS. I restarted the server process and performance went back to 60 TPS.

I checked log files and didn't notice any excessive single-file log use or anything like that. Basically the only thing I noticed was that resting CPU was about 11% before I restarted llama-server, at which point CPU use dropped back to near zero resting.

Has anyone here run llama-server for a significant period of uptime? Are there any known processes/files that snowball?

ggerganov · 2024-10-29T08:29:36Z

ggerganov
Oct 29, 2024
Maintainer

Do you use 1 or multiple slots?

1 reply

JoeStech Oct 31, 2024
Author

We don't explicitly set additional slots, so I believe it defaults to 1. Most settings are at default, here's the command we're running:

./llama-server -m dolphin-2.9.4-llama3.1-8b-Q4_0_4_8.gguf --port 8080 --threads-http 8

We also run 7 other instances of llama-server simultaneously. Machine has 96 cores.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Long-running llama-server behavior #10076

{{title}}

Replies: 1 comment 1 reply

{{title}}

{{title}}

Select a reply

Long-running llama-server behavior #10076

JoeStech Oct 28, 2024

Replies: 1 comment · 1 reply

ggerganov Oct 29, 2024 Maintainer

JoeStech Oct 31, 2024 Author

JoeStech
Oct 28, 2024

Replies: 1 comment 1 reply

ggerganov
Oct 29, 2024
Maintainer

JoeStech Oct 31, 2024
Author