Memory leak in ruby process? #368

flori · 2021-11-30T10:24:45Z

I had one of these killed recently by the linux OOM killer – after 87 days running on the google cloud:

[7595615.399139] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=b9bd611bc9ee699608d0bfe919b90f987e94e924fbf57748f078433c8c4b1e20,mems_allowed=0,global_oom,task_memcg=/docker/b9bd611bc9ee699608d0bfe919b90f987e94e924fbf57748f078433c8c4b1e20,task=ruby,pid=1237,uid=0
[7595615.424623] Out of memory: Killed process 1237 (ruby) total-vm:1254608kB, anon-rss:614712kB, file-rss:0kB, shmem-rss:0kB, UID:0 pgtables:1940kB oom_score_adj:0
[7595615.528767] oom_reaper: reaped process 1237 (ruby), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB

There seems to be a memory leak somewhere.

The text was updated successfully, but these errors were encountered:

jkschulz · 2021-12-29T22:26:57Z

Hey, thanks for reporting this! Can you provide some more detail about your environment and agent so that we can try to reproduce?

flori · 2021-12-30T09:49:46Z

This is a single Compute Engine VM running on GCP's europe-west1-b zone. I cannot find out which ruby it is running or which gems it is using, only that it is currently running cos-stable-89-16108-470-16. The leaking seems to be still an issue, the fluentd ruby already takes up one third of the VM's memory at present. See this metrics graph I made for the vm memory used:

elbakerino · 2022-01-16T14:17:26Z

Just had a failure of the whole cluster, because of google-fluentd ruby memory issues.

Infrastructure is complelty logged with google-fluentd and prometheus, ubuntu/centos servers for docker hosts+nginx load balancers and some custom apps. Error recognized because the docker hosts where not able to run some containers (OOM), nginx still manged to work with 2% free memory. Don't know when it started, had a similiar issue a 2-3 months ago where I thought docker logs where the problem and during that time a reboot-fix was all i had time for.

For most servers a sudo systemctl restart google-fluentd fixed it, the ruby process restarted and memory was freed.

Log of /var/log/google-fluentd/google-fluentd.log till restart, on a server where it worked:

2022-01-16 03:26:02 +0000 [info]: #0 flushing all buffer forcedly
2022-01-16 03:26:02 +0000 [info]: #0 detected rotation of /var/log/messages; waiting 5 seconds
2022-01-16 03:26:02 +0000 [info]: #0 following tail of /var/log/messages
2022-01-16 13:04:57 +0000 [info]: Received graceful stop
2022-01-16 13:04:58 +0000 [info]: #0 fluentd worker is now stopping worker=0
2022-01-16 13:04:58 +0000 [info]: #0 shutting down fluentd worker worker=0

On some servers, the restart/stop took ages, and i've just killed the google-fluentd service, the ruby server was still existing and I've needed to kill it with SIGKILL kill -9 <pid> to free the memory, without -9 worked on some, but not on all.

Log on such a server only shows the latest stop (this nginx servers timezone is somehow germany, not in utc like the other one), value of object replaced dont know if maybe sensible:

2022-01-16 03:42:01 +0100 [info]: #0 flushing all buffer forcedly
2022-01-16 10:18:44 +0100 [warn]: #0 buffer flush took longer time than slow_flush_log_threshold: elapsed_time=85.15702900663018 slow_flush_log_threshold=20.0 plugin_id="object:aaaaaaa"
2022-01-16 11:52:57 +0100 [warn]: #0 buffer flush took longer time than slow_flush_log_threshold: elapsed_time=83.46101826429367 slow_flush_log_threshold=20.0 plugin_id="object:aaaaaaa"
2022-01-16 14:10:53 +0100 [info]: Received graceful stop
2022-01-16 14:10:54 +0100 [info]: #0 fluentd worker is now stopping worker=0
2022-01-16 14:10:54 +0100 [info]: #0 shutting down fluentd worker worker=0

Commands used:

sudo systemctl restart google-fluentd
top # ram was not free
sudo systemctl restart google-fluentd # took ages
sudo systemctl stop google-fluentd # took ages
sudo systemctl kill google-fluentd
sudo systemctl start google-fluentd
top # ram was not free
sudo kill -9 <pid-or-ruby>
top # ram was free
sudo reboot # just to be sure everything was/is running as it should

What further infos would help and how could I get them easily?

EDeijl · 2022-02-21T15:16:12Z

We are running into the same issue:

It is installed on a VM instance running COS 89 with the metadata tag google-logging-enabled: true

restarting the stackdriver-logging-agent mitigates the issue, but it will come back in a while.

the specific image with the memory leak is: gcr.io/stackdriver-agents/stackdriver-logging-agent:1.8.4

An older version of that image is not having the same memory issue: gcr.io/stackdriver-agents/stackdriver-logging-agent:0.2-1.5.33-1-1

igorpeshansky · 2022-02-24T02:38:55Z

Is this the same as fluent/fluentd#3401?

[EDIT] Actually, fluent-plugin-systemd was upgraded to 1.0.5 in 1.9.1, but #368 (comment) mentions 1.8.4 leaking as well.

[ANOTHER EDIT] Mystery potentially solved: the fluent-plugin-systemd misbehavior with read_from_head set, diagnosed in the referenced issue, was already present in version 1.0.2 (used by google-fluentd 1.8.4).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Memory leak in ruby process? #368

Memory leak in ruby process? #368

flori commented Nov 30, 2021

jkschulz commented Dec 29, 2021

flori commented Dec 30, 2021

elbakerino commented Jan 16, 2022

EDeijl commented Feb 21, 2022

igorpeshansky commented Feb 24, 2022 •

edited

Loading

Memory leak in ruby process? #368

Memory leak in ruby process? #368

Comments

flori commented Nov 30, 2021

jkschulz commented Dec 29, 2021

flori commented Dec 30, 2021

elbakerino commented Jan 16, 2022

EDeijl commented Feb 21, 2022

igorpeshansky commented Feb 24, 2022 • edited Loading

igorpeshansky commented Feb 24, 2022 •

edited

Loading