Ingesters stopped triggering tsdb compaction #6672

jakirpatel · 2025-03-26T05:30:53Z

jakirpatel
Mar 26, 2025

Describe the bug
Ingesters stopped triggering tsdb compactions causing the OOM issue and data loss because of no push to remote storage (google cloud storage)

To Reproduce

Consul restart due to OOM killed
Ingester Ring became unhealthy

Expected behavior

Ingester should not stop triggering the tsdb compaction.

Environment:

Infrastructure: Kubernetes v1.26.7, Cortex v1.15.3
Deployment tool: Kustomize

Additional Context
Server logs of consul

[Mon Mar 24 09:33:13 2025] Code: Bad RIP value.
[Mon Mar 24 09:33:13 2025] RSP: 002b:000000c00009df18 EFLAGS: 00010202
[Mon Mar 24 09:33:13 2025] RAX: 0000000000000000 RBX: 0000000000004e20 RCX: 00000000004698dd
[Mon Mar 24 09:33:13 2025] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 000000c00009df18
[Mon Mar 24 09:33:13 2025] RBP: 000000c00009df28 R08: 000000007645c2a4 R09: 00007ffea5d690b0
[Mon Mar 24 09:33:13 2025] R10: 0000000000000000 R11: 0000000000000202 R12: 0000000000439c60
[Mon Mar 24 09:33:13 2025] R13: 0000000000000000 R14: 00000000036e71dc R15: 0000000000000000
[Mon Mar 24 09:33:13 2025] Task in /kubepods/burstable/pod19144e2d-5344-4ea2-a161-fd1e4e57fab1/1f289fc88a99539f34d90c61b7eade3a341bd8fa0fe870c2f6f0f8001949efc4 killed as a result of limit of /kubepods/burstable/pod19144e2d-5344-4ea2-a161-fd1e4e57fab1
[Mon Mar 24 09:33:13 2025] memory: usage 524288kB, limit 524288kB, failcnt 1913986
[Mon Mar 24 09:33:13 2025] memory+swap: usage 524204kB, limit 9007199254740988kB, failcnt 0
[Mon Mar 24 09:33:13 2025] kmem: usage 21224kB, limit 9007199254740988kB, failcnt 0
[Mon Mar 24 09:33:13 2025] Memory cgroup stats for /kubepods/burstable/pod19144e2d-5344-4ea2-a161-fd1e4e57fab1: cache:0KB rss:0KB rss_huge:0KB shmem:0KB mapped_file:0KB dirty:0KB writeback:0KB swap:0KB inactive_anon:0KB active_anon:0KB inactive_file:0KB active_file:0KB unevictable:0KB
[Mon Mar 24 09:33:13 2025] Memory cgroup stats for /kubepods/burstable/pod19144e2d-5344-4ea2-a161-fd1e4e57fab1/17b14f8338505345e097052aa04c04b3a0db60980bda3fe253e4cd58dcccff24: cache:0KB rss:0KB rss_huge:0KB shmem:0KB mapped_file:0KB dirty:0KB writeback:0KB swap:0KB inactive_anon:0KB active_anon:36KB inactive_file:0KB active_file:0KB unevictable:0KB
[Mon Mar 24 09:33:13 2025] Memory cgroup stats for /kubepods/burstable/pod19144e2d-5344-4ea2-a161-fd1e4e57fab1/3d01ee86d45aa5dc52c06cd2144b02dd652c5828c55b4a62c070c1cc766468ed: cache:227528KB rss:0KB rss_huge:0KB shmem:228068KB mapped_file:50688KB dirty:0KB writeback:0KB swap:0KB inactive_anon:3976KB active_anon:223684KB inactive_file:0KB active_file:0KB unevictable:0KB
[Mon Mar 24 09:33:13 2025] Memory cgroup stats for /kubepods/burstable/pod19144e2d-5344-4ea2-a161-fd1e4e57fab1/1f289fc88a99539f34d90c61b7eade3a341bd8fa0fe870c2f6f0f8001949efc4: cache:2616KB rss:271908KB rss_huge:0KB shmem:2196KB mapped_file:660KB dirty:0KB writeback:0KB swap:0KB inactive_anon:96KB active_anon:273872KB inactive_file:1076KB active_file:152KB unevictable:0KB
[Mon Mar 24 09:33:13 2025] Tasks state (memory values in pages):
[Mon Mar 24 09:33:13 2025] [  pid  ]   uid  tgid total_vm      rss pgtables_bytes swapents oom_score_adj name
[Mon Mar 24 09:33:13 2025] [  25913]     0 25913      242        1    28672        0          -998 pause
[Mon Mar 24 09:33:13 2025] [   2779]     0  2779       52        2    20480        0           985 docker-entrypoi
[Mon Mar 24 09:33:13 2025] [   2797]   100  2797   331313    79132  1064960        0           985 consul
[Mon Mar 24 09:33:13 2025] [  20375]     0 20375      397       16    45056        0           985 sh
[Mon Mar 24 09:33:13 2025] [  20609]     0 20609     1181       16    40960        0           985 curl
[Mon Mar 24 09:33:13 2025] [  20631]     0 20631      394       13    32768        0           985 grep
[Mon Mar 24 09:33:13 2025] [  20971]     0 20971      394        2    32768        0           985 sh
[Mon Mar 24 09:33:13 2025] Memory cgroup out of memory: Kill process 2797 (consul) score 1590 or sacrifice child
[Mon Mar 24 09:33:13 2025] Killed process 2797 (consul) total-vm:1325252kB, anon-rss:265244kB, file-rss:0kB, shmem-rss:51284kB
[Mon Mar 24 09:33:13 2025] oom_reaper: reaped process 2797 (consul), now anon-rss:0kB, file-rss:0kB, shmem-rss:51284kB
[Mon Mar 24 09:33:17 2025] TCP: request_sock_TCP: Possible SYN flooding on port 8500. Sending cookies.  Check SNMP counters.
[Mon Mar 24 14:58:40 2025] IPv6: ADDRCONF(NETDEV_UP): cali350c831b699: link is not ready
[Mon Mar 24 14:58:40 2025] IPv6: ADDRCONF(NETDEV_CHANGE): cali350c831b699: link becomes ready
[Mon Mar 24 16:28:41 2025] IPv6: ADDRCONF(NETDEV_UP): cali28c7cc3caa3: link is not ready
[Mon Mar 24 16:28:41 2025] IPv6: ADDRCONF(NETDEV_CHANGE): cali28c7cc3caa3: link becomes ready
[Mon Mar 24 16:58:39 2025] IPv6: ADDRCONF(NETDEV_UP): cali9cc360364f7: link is not ready
[Mon Mar 24 16:58:39 2025] IPv6: ADDRCONF(NETDEV_CHANGE): cali9cc360364f7: link becomes ready
[Mon Mar 24 18:28:42 2025] IPv6: ADDRCONF(NETDEV_UP): cali42981617b56: link is not ready
[Mon Mar 24 18:28:42 2025] IPv6: ADDRCONF(NETDEV_CHANGE): cali42981617b56: link becomes ready
[Mon Mar 24 19:58:42 2025] IPv6: ADDRCONF(NETDEV_UP): califadadf0982a: link is not ready
[Mon Mar 24 19:58:42 2025] IPv6: ADDRCONF(NETDEV_CHANGE): califadadf0982a: link becomes ready
[Mon Mar 24 20:28:41 2025] IPv6: ADDRCONF(NETDEV_UP): cali21ba95b6eca: link is not ready
[Mon Mar 24 20:28:41 2025] IPv6: ADDRCONF(NETDEV_CHANGE): cali21ba95b6eca: link becomes ready
[Mon Mar 24 22:28:43 2025] IPv6: ADDRCONF(NETDEV_UP): caliba27763a131: link is not ready
[Mon Mar 24 22:28:43 2025] IPv6: ADDRCONF(NETDEV_CHANGE): caliba27763a131: link becomes ready
[Mon Mar 24 22:58:39 2025] IPv6: ADDRCONF(NETDEV_UP): cali06cb01d420f: link is not ready
[Mon Mar 24 22:58:39 2025] IPv6: ADDRCONF(NETDEV_CHANGE): cali06cb01d420f: link becomes ready
[Mon Mar 24 22:58:40 2025] IPv6: ADDRCONF(NETDEV_UP): cali745f5a04bdc: link is not ready
[Mon Mar 24 22:58:40 2025] IPv6: ADDRCONF(NETDEV_CHANGE): cali745f5a04bdc: link becomes ready
[Tue Mar 25 00:58:41 2025] IPv6: ADDRCONF(NETDEV_UP): calif0e472cf564: link is not ready
[Tue Mar 25 00:58:41 2025] IPv6: ADDRCONF(NETDEV_CHANGE): calif0e472cf564: link becomes ready

friedrichg · 2025-03-26T11:08:57Z

friedrichg
Mar 26, 2025
Maintainer

If you don't get any samples, it makes sense there is nothing to compact.
cortex_ingester_ingestion_rate_samples_per_second should tell you if ingester were getting samples. If the ring was not healthy, it makes sense that there was no samples.

It's been a while since I ran consul, but there is a metric called "consul_raft_leader"... if you are getting prometheus metrics from consul. That metric tells you if the consul cluster has a leader. If after the crash, consul had no leader, it makes sense ingesters were unhealthy. That's a possible explanation of what happened.

Consider switching to memberlist, consul was a pain point for me until I stopped using it. Historically consul was never recommended to be run as a cluster for cortex, because clustering can have these issues, where it loses the leader.

0 replies

jakirpatel · 2025-03-27T03:25:42Z

jakirpatel
Mar 27, 2025
Author

Thank you @friedrichg for the reply.

I can confirm the ingesters were getting the samples,

yes, I think it is better for us to move to memberlist.

0 replies

jakirpatel · 2025-04-02T04:32:33Z

jakirpatel
Apr 2, 2025
Author

Thank you for your reply. after the investigation we found,
The issue might have happened either:

CompactionLoop:

cortex/pkg/ingester/ingester.go

Line 2297 in 21e8366

func (i *Ingester) compactionLoop(ctx context.Context) error {

Or CompactBlock
https://github.com/cortexproject/cortex/blob/21e83660515e7831f4081b6b98f53f9fd43560f3/pkg/ingester/ingester.go#L2318C20-L2318C33

We know it never reached to

cortex/pkg/ingester/ingester.go

Line 2346 in 21e8366

i.TSDBState.compactionsTriggered.Inc()

Unfortunately it never reached for increasing trigger.

As of now we are making sure consul stability. and we ll have plan to migrate to memberlist soon.

Slack Thread:
(https://cloud-native.slack.com/archives/CCYDASBLP/p1743142699562759?thread_ts=1742969498.469779&cid=CCYDASBLP)
In CompactBlock specifically it interacts with the the ring.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Ingesters stopped triggering tsdb compaction #6672

Uh oh!

{{title}}

Uh oh!

Replies: 3 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Ingesters stopped triggering tsdb compaction #6672

Uh oh!

jakirpatel Mar 26, 2025

Replies: 3 comments

Uh oh!

friedrichg Mar 26, 2025 Maintainer

Uh oh!

jakirpatel Mar 27, 2025 Author

Uh oh!

jakirpatel Apr 2, 2025 Author

jakirpatel
Mar 26, 2025

friedrichg
Mar 26, 2025
Maintainer

jakirpatel
Mar 27, 2025
Author

jakirpatel
Apr 2, 2025
Author