[bitnami/mongodb-sharded] fix(mongos): use exact match in liveness pr…#36488
Open
delthas wants to merge 2 commits intobitnami:mainfrom
Open
[bitnami/mongodb-sharded] fix(mongos): use exact match in liveness pr…#36488delthas wants to merge 2 commits intobitnami:mainfrom
delthas wants to merge 2 commits intobitnami:mainfrom
Conversation
…obe pgrep The mongos liveness probe uses `pgrep mongos` to check if the mongos process is running. However, `pgrep` performs substring matching by default, so `pgrep mongos` also matches `mongosh` processes (since "mongos" is a substring of "mongosh"). This causes a critical failure mode during startup: when all MongoDB sharded StatefulSets start simultaneously, mongos may attempt to connect to configsvr before its replica set is fully initialized. The Bitnami entrypoint script runs `mongosh --host configsvr` to verify configsvr availability, but this call blocks indefinitely when configsvr's port is open but the replica set hasn't completed primary election or auth user creation. Since the `mongosh` call has no timeout, it hangs forever, and the actual `mongos` router process is never started. At this point, the liveness probe should detect that mongos is not running and restart the container. However, `pgrep mongos` matches the hung `mongosh` child process, so the liveness probe passes. The container remains stuck permanently: liveness passes (due to the substring match on mongosh), readiness fails (correctly, since mongos never started), and Kubernetes never restarts it. Meanwhile, shard data nodes complete their own init, stop mongod, and try to register with mongos. Since mongos never started, shards loop on "timeout reached before the port went into state inuse" and get killed by their own liveness probes, entering CrashLoopBackOff. The fix adds the `-x` flag to `pgrep`, which requires an exact match on the process name. With this change, `pgrep -x mongos` matches only the real `mongos` process and not `mongosh`, so if the entrypoint hangs, the liveness probe correctly fails and Kubernetes restarts the container. On retry, configsvr is typically ready, and startup succeeds. Signed-off-by: delthas <delthas@dille.cc>
0b2751b to
252dc80
Compare
Signed-off-by: Bitnami Bot <bitnami.bot@broadcom.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description of the change
The mongos liveness probe uses
pgrep mongosto check if the mongos router process is running. However,pgrepperforms substring matching by default, sopgrep mongosalso matchesmongoshprocesses (since "mongos" is a substring of "mongosh").This causes a critical failure mode during startup: when all MongoDB sharded StatefulSets start simultaneously, mongos may attempt to connect to configsvr before its replica set is fully initialized. The Bitnami entrypoint script runs
mongosh --host configsvrto verify configsvr availability, but this call can block indefinitely when configsvr's port is open but the replica set hasn't completed primary election or auth user creation. Since the actualmongosprocess is never started, the liveness probe should fail and trigger a container restart — butpgrep mongosmatches the hungmongoshchild process, so the probe passes and the container remains stuck permanently.The fix adds the
-xflag topgrep, requiring an exact match on the process name. With this change,pgrep -x mongosmatches only the realmongosprocess and notmongosh.Benefits
mongosprocess has not started, even if amongoshprocess is running in the container.Possible drawbacks
None. The
-xflag restrictspgrepto exact process name matching, which is strictly more correct than substring matching. The liveness probe is intended to check for themongosprocess, notmongosh.Applicable issues
%
Additional information
The failure sequence in detail:
wait-for-porton configsvr:27017 — succeeds because the TCP port is openmongosh --host configsvr -u root -p ... adminto verify configsvr — this blocks forever because the replica set isn't initialized yet (no primary, no auth users)mongosrouter is never started. The only processes are the bash entrypoint (PID 1, blocked) and the hungmongoshchildpgrep mongosmatchesmongosh(substring) → liveness probe passes → Kubernetes never restarts the containerWith
pgrep -x mongos, step 5 correctly fails, Kubernetes restarts the container, and on retry configsvr is ready.Verified by exec'ing into a running mongos container:
Checklist
Chart.yamlaccording to semver. This is not necessary when the changes only affect README.md files.README.mdusing readme-generator-for-helm