Skip to content

[bitnami/mongodb-sharded] fix(mongos): use exact match in liveness pr…#36488

Open
delthas wants to merge 2 commits intobitnami:mainfrom
delthas:fix-pgrep-mongodb
Open

[bitnami/mongodb-sharded] fix(mongos): use exact match in liveness pr…#36488
delthas wants to merge 2 commits intobitnami:mainfrom
delthas:fix-pgrep-mongodb

Conversation

@delthas
Copy link
Contributor

@delthas delthas commented Mar 24, 2026

Description of the change

The mongos liveness probe uses pgrep mongos to check if the mongos router process is running. However, pgrep performs substring matching by default, so pgrep mongos also matches mongosh processes (since "mongos" is a substring of "mongosh").

This causes a critical failure mode during startup: when all MongoDB sharded StatefulSets start simultaneously, mongos may attempt to connect to configsvr before its replica set is fully initialized. The Bitnami entrypoint script runs mongosh --host configsvr to verify configsvr availability, but this call can block indefinitely when configsvr's port is open but the replica set hasn't completed primary election or auth user creation. Since the actual mongos process is never started, the liveness probe should fail and trigger a container restart — but pgrep mongos matches the hung mongosh child process, so the probe passes and the container remains stuck permanently.

The fix adds the -x flag to pgrep, requiring an exact match on the process name. With this change, pgrep -x mongos matches only the real mongos process and not mongosh.

Benefits

  • The mongos liveness probe now correctly detects when the mongos process has not started, even if a mongosh process is running in the container.
  • Eliminates a permanent deadlock where mongos hangs forever during startup due to a race condition with configsvr initialization. On restart, configsvr is typically ready and startup succeeds.
  • Also prevents shard data nodes from entering CrashLoopBackOff, since they depend on mongos being available to register themselves.

Possible drawbacks

None. The -x flag restricts pgrep to exact process name matching, which is strictly more correct than substring matching. The liveness probe is intended to check for the mongos process, not mongosh.

Applicable issues

%

Additional information

The failure sequence in detail:

  1. All StatefulSets (configsvr, mongos, shards) start simultaneously
  2. The mongos entrypoint runs wait-for-port on configsvr:27017 — succeeds because the TCP port is open
  3. The entrypoint runs mongosh --host configsvr -u root -p ... admin to verify configsvr — this blocks forever because the replica set isn't initialized yet (no primary, no auth users)
  4. The mongos router is never started. The only processes are the bash entrypoint (PID 1, blocked) and the hung mongosh child
  5. pgrep mongos matches mongosh (substring) → liveness probe passes → Kubernetes never restarts the container
  6. Shard data nodes finish init, try to register with mongos, fail, and enter CrashLoopBackOff

With pgrep -x mongos, step 5 correctly fails, Kubernetes restarts the container, and on retry configsvr is ready.

Verified by exec'ing into a running mongos container:

$ pgrep -xa mongos    # exact: only matches the real mongos process
1 /opt/bitnami/mongodb/bin/mongos --config=...

$ pgrep -a mongos     # substring: also matches mongosh
1 /opt/bitnami/mongodb/bin/mongos --config=...
379 mongosh mongodb://127.0.0.1

Checklist

  • Chart version bumped in Chart.yaml according to semver. This is not necessary when the changes only affect README.md files.
  • Variables are documented in the values.yaml and added to the README.md using readme-generator-for-helm
  • Title of the pull request follows this pattern [bitnami/<name_of_the_chart>] Descriptive title
  • All commits signed off and in agreement of Developer Certificate of Origin (DCO)

…obe pgrep

The mongos liveness probe uses `pgrep mongos` to check if the mongos
process is running. However, `pgrep` performs substring matching by
default, so `pgrep mongos` also matches `mongosh` processes (since
"mongos" is a substring of "mongosh").

This causes a critical failure mode during startup: when all MongoDB
sharded StatefulSets start simultaneously, mongos may attempt to connect
to configsvr before its replica set is fully initialized. The Bitnami
entrypoint script runs `mongosh --host configsvr` to verify configsvr
availability, but this call blocks indefinitely when configsvr's port is
open but the replica set hasn't completed primary election or auth user
creation. Since the `mongosh` call has no timeout, it hangs forever,
and the actual `mongos` router process is never started.

At this point, the liveness probe should detect that mongos is not
running and restart the container. However, `pgrep mongos` matches the
hung `mongosh` child process, so the liveness probe passes. The
container remains stuck permanently: liveness passes (due to the
substring match on mongosh), readiness fails (correctly, since mongos
never started), and Kubernetes never restarts it.

Meanwhile, shard data nodes complete their own init, stop mongod, and
try to register with mongos. Since mongos never started, shards loop on
"timeout reached before the port went into state inuse" and get killed
by their own liveness probes, entering CrashLoopBackOff.

The fix adds the `-x` flag to `pgrep`, which requires an exact match on
the process name. With this change, `pgrep -x mongos` matches only the
real `mongos` process and not `mongosh`, so if the entrypoint hangs, the
liveness probe correctly fails and Kubernetes restarts the container.
On retry, configsvr is typically ready, and startup succeeds.

Signed-off-by: delthas <delthas@dille.cc>
@delthas delthas force-pushed the fix-pgrep-mongodb branch from 0b2751b to 252dc80 Compare March 24, 2026 08:35
Signed-off-by: Bitnami Bot <bitnami.bot@broadcom.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants