Trino 476 on AWS EKS - Thread exhaustion on coordinator causes crash #26229
Unanswered
jonroquet2
asked this question in
Q&A
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
We're deploying our Trino app on an EKS cluster in AWS. We're able to utilize this cluster to make queries, but when we scale up the load to 12 threads of user requests, we get the following error in the coordinator logs:
[714.377s][warning][os,thread] Failed to start thread "Unknown thread" - pthread_create failed (EAGAIN) for attributes: stacksize: 1024k, guardsize: 0k, detached.[714.377s][warning][os,thread] Failed to start the native thread for java.lang.Thread "remote-task-callback-240"[714.377s][error ][jvmti ] Posting Resource Exhausted event: unable to create native thread: possibly out of memory or process/resource limits reachedResourceExhausted: unable to create native thread: possibly out of memory or process/resource limits reached: killing current process![714.378s][warning][os,thread] Failed to start thread "Unknown thread" - pthread_create failed (EAGAIN) for attributes: stacksize: 1024k, guardsize: 0k, detached.Our assumption is that the coordinator pod is running out of native threads - this isn't a JVM problem, its a native image problem. When we shell into the coordinator and check the limits, we see this:
real-time non-blocking time (microseconds, -R) unlimitedcore file size (blocks, -c) unlimiteddata seg size (kbytes, -d) unlimitedscheduling priority (-e) 0file size (blocks, -f) unlimitedpending signals (-i) 30446max locked memory (kbytes, -l) unlimitedmax memory size (kbytes, -m) unlimitedopen files (-n) 1024pipe size (512 bytes, -p) 8POSIX message queues (bytes, -q) 819200real-time priority (-r) 0stack size (kbytes, -s) 10240cpu time (seconds, -t) unlimitedmax user processes (-u) 1024virtual memory (kbytes, -v) unlimitedfile locks (-x) unlimitedWe're assuming the low -n and -u limits are causing the issues. However, every attempt at overriding these limits have failed - via custom image creation, via helm chart override, and via EC2 configuration override. Can anyone help us understand where we're going wrong?
Beta Was this translation helpful? Give feedback.
All reactions