Nomad: recommendations for singleton deployments #1473

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Open

tgross wants to merge 1 commit into main from tgross-nomad-1039

Member

tgross commented Dec 5, 2025 •

edited

Loading

Many users have a requirement to run exactly one instance of a given allocation because it requires exclusive access to some cluster-wide resource, which we'll refer to here as a "singleton allocation". This is challenging to implement, so this document is intended to describe an accepted design to publish as a how-to/tutorial.

Links

Jira: https://hashicorp.atlassian.net/browse/NMD-1039
Deploy previews: https://unified-docs-frontend-preview-qx8ebllbe-hashicorp.vercel.app/nomad/docs/job-declare/strategy/singleton

Contributor checklists

Review urgency:

ASAP: Bug fixes, broken content, imminent releases
3 days: Small changes, easy reviews
1 week: Default expectation
Best effort: No urgency

Pull request:

Verify that the PR is set to merge into the correct base branch
Verify that all status checks passed
Verify that the preview environment deployed successfully
Add additional reviewers if they are not part of assigned groups

Content:

I added redirects for any moved or removed pages
I followed the Education style guide
I looked at the local or Vercel build to make sure the content rendered correctly

Reviewer checklist

This PR is set to merge into the correct base branch.
The content does not contain technical inaccuracies.
The content follows the Education content and style guides.
I have verified and tested changes to instructions for end users.

tgross added the Nomad label

github-actions bot added the Runtime label

tgross changed the title ~~Nomad: recommendations for singleton deployments~~ Nomad: recommendations for singleton deployments [DRAFT]

Contributor

github-actions bot commented Dec 5, 2025 •

edited

Loading

Vercel Previews Deployed

Name	Status	Preview	Updated (UTC)
Dev Portal	✅ Ready (Inspect)	Visit Preview	Thu Dec 11 21:19:41 UTC 2025
Unified Docs API	✅ Ready (Inspect)	Visit Preview	Thu Dec 11 21:15:13 UTC 2025

tgross force-pushed the tgross-nomad-1039 branch from 46f2009 to 09ca115 Compare

December 5, 2025 19:36

Contributor

github-actions bot commented Dec 5, 2025 •

edited

Loading

Broken Link Checker

No broken links found! 🎉

tgross force-pushed the tgross-nomad-1039 branch 2 times, most recently from 7a8b08f to 3e86c40 Compare

December 11, 2025 20:31


          Nomad: recommendations for singleton deployments

aefaab2

Many users have a requirement to run exactly one instance of a given allocation
because it requires exclusive access to some cluster-wide resource, which we'll
refer to here as a "singleton allocation". This is challenging to implement, so
this document is intended to describe an accepted design to publish as a
how-to/tutorial.

tgross force-pushed the tgross-nomad-1039 branch from 3e86c40 to aefaab2 Compare

December 11, 2025 20:58

tgross marked this pull request as ready for review

December 11, 2025 21:21

tgross requested review from a team as code owners

December 11, 2025 21:21

tgross requested a review from arodd

December 11, 2025 21:21

tgross changed the title ~~Nomad: recommendations for singleton deployments [DRAFT]~~ Nomad: recommendations for singleton deployments

aimeeu reviewed

View reviewed changes

Contributor

aimeeu left a comment

Thanks for the great technical content! I left some style guide and presentation suggestions.

content/nomad/v1.11.x/content/docs/job-declare/strategy/singleton.mdx

+              workload needs exclusive access to a remote resource like a data store. Nomad
+              does not support singleton deployments as a built-in feature. Your workloads
+              continue to run even when the Nomad client agent has crashed, so ensuring
+              there's at most one allocation for a given workload some cooperation from the

Contributor

aimeeu Dec 11, 2025

Suggested change

      
            there's at most one allocation for a given workload some cooperation from the
          
            there's at most one allocation for a given workload requires some cooperation from the

missing a verb between "workload" and "some" so guessing "requires"??

content/nomad/v1.11.x/content/docs/job-declare/strategy/singleton.mdx


		## Design Goals

		The configuration described here meets two primary design goals:

Contributor

aimeeu Dec 11, 2025

Suggested change

      
            The configuration described here meets two primary design goals:
          
            The configuration described here meets these primary design goals:

you have more than 2 bullet points...

content/nomad/v1.11.x/content/docs/job-declare/strategy/singleton.mdx

Comment on lines +23 to +30

+              * The design will prevent a specific process with a task from running if there
+                is another instance of that task running anywhere else on the Nomad cluster.
+              * Nomad should be able to recover from failure of the task or the node on which
+                the task is running with minimal downtime, where "recovery" means that the
+                original task should be stopped and that Nomad should schedule a replacement
+                task.
+              * Nomad should minimize false positive detection of failures to avoid
+                unnecessary downtime during the cutover.

Contributor

aimeeu Dec 11, 2025

Suggested change

      
            * The design will prevent a specific process with a task from running if there
          
              is another instance of that task running anywhere else on the Nomad cluster.
          
            * Nomad should be able to recover from failure of the task or the node on which
          
              the task is running with minimal downtime, where "recovery" means that the
          
              original task should be stopped and that Nomad should schedule a replacement
          
              task.
          
            * Nomad should minimize false positive detection of failures to avoid
          
              unnecessary downtime during the cutover.
          
            - The design prevents a specific process with a task from running if there
          
              is another instance of that task running anywhere else on the Nomad cluster.
          
            - Nomad should be able to recover from failure of the task or the node on which
          
              the task is running with minimal downtime, where "recovery" means that Nomad should stop the
          
              original task and schedule a replacement
          
              task.
          
            - Nomad should minimize false positive detection of failures to avoid
          
              unnecessary downtime during the cutover.

style nits: - instead of * for unordered lists; present tense, active voice

content/nomad/v1.11.x/content/docs/job-declare/strategy/singleton.mdx

Comment on lines +33 to +34

		faster you make Nomad attempt to recover from failure, the more likely that a
		transient failure causes a replacement to be scheduled and a subsequent

Contributor

aimeeu Dec 11, 2025

Suggested change

      
            faster you make Nomad attempt to recover from failure, the more likely that a
          
            transient failure causes a replacement to be scheduled and a subsequent
          
            faster you make Nomad attempt to recover from failure, the more likely that a
          
            transient failure causes Nomad to schedule a replacement and a subsequent

active voice nit

content/nomad/v1.11.x/content/docs/job-declare/strategy/singleton.mdx

Comment on lines +38 to +40

+              allocation in a distributed system. This design will err on the side of
+              correctness: having 0 or 1 allocations running rather than the incorrect 1 or 2
+              allocations running.

Contributor

aimeeu Dec 11, 2025

Suggested change

      
            allocation in a distributed system. This design will err on the side of
          
            correctness: having 0 or 1 allocations running rather than the incorrect 1 or 2
          
            allocations running.
          
            allocation in a distributed system. This design errs on the side of
          
            correctness: having zero or one allocations running rather than the incorrect one or two
          
            allocations running.

nit spell out numbers zero thru nine

content/nomad/v1.11.x/content/docs/job-declare/strategy/singleton.mdx

Comment on lines +190 to +224

+              ```hcl
+              job "example" {
+                group "group" {
+                  disconnect {
+                    stop_on_client_after = "1m"
+                  }
+                  task "lock" {
+                    leader = true
+                    config {
+                      driver   = "raw_exec"
+                      command  = "/opt/lock-script.sh"
+                      pid_mode = "host"
+                    }
+                    identity {
+                      env = true # make NOMAD_TOKEN available to lock command
+                    }
+                  }
+                  task "application" {
+                    lifecycle {
+                      hook = "poststart"
+                      sidecar = true
+                    }
+                    config {
+                      driver  = "docker"
+                      image   = "example/app:1"
+                    }
+                  }
+                }
+              }
+              ```

Contributor

aimeeu Dec 12, 2025

Suggested change

      
            ```hcl
          
            job "example" {
          
              group "group" {
          
                disconnect {
          
                  stop_on_client_after = "1m"
          
                }
          
                task "lock" {
          
                  leader = true
          
                  config {
          
                    driver   = "raw_exec"
          
                    command  = "/opt/lock-script.sh"
          
                    pid_mode = "host"
          
                  }
          
                  identity {
          
                    env = true # make NOMAD_TOKEN available to lock command
          
                  }
          
                }
          
                task "application" {
          
                  lifecycle {
          
                    hook = "poststart"
          
                    sidecar = true
          
                  }
          
                  config {
          
                    driver  = "docker"
          
                    image   = "example/app:1"
          
                  }
          
                }
          
              }
          
            }
          
            ```
          
            <CodeBlockConfig lineNumbers highlight="9">
          
            ```hcl
          
            job "example" {
          
              group "group" {
          
                disconnect {
          
                  stop_on_client_after = "1m"
          
                }
          
                task "lock" {
          
                  leader = true
          
                  config {
          
                    driver   = "raw_exec"
          
                    command  = "/opt/lock-script.sh"
          
                    pid_mode = "host"
          
                  }
          
                  identity {
          
                    env = true # make NOMAD_TOKEN available to lock command
          
                  }
          
                }
          
                task "application" {
          
                  lifecycle {
          
                    hook = "poststart"
          
                    sidecar = true
          
                  }
          
                  config {
          
                    driver  = "docker"
          
                    image   = "example/app:1"
          
                  }
          
                }
          
              }
          
            }
          
            ```
          
            </CodeBlockConfig>

add code highlight

content/nomad/v1.11.x/content/docs/job-declare/strategy/singleton.mdx

Comment on lines +107 to +109

+              The easiest way to implement the locking logic is to use `nomad var lock` as a
+              shim in your task. The jobspec below assumes there's a Nomad binary in the
+              container image.

Contributor

aimeeu Dec 12, 2025

Suggested change

      
            The easiest way to implement the locking logic is to use `nomad var lock` as a
          
            shim in your task. The jobspec below assumes there's a Nomad binary in the
          
            container image.
          
            We recommend implementing the locking logic with `nomad var lock` as a shim in
          
            your task. This example jobspec assumes there's a Nomad binary in the container
          
            image.

content/nomad/v1.11.x/content/docs/job-declare/strategy/singleton.mdx

+              You set this policy on the job with `nomad acl policy apply -namespace prod -job
+              example example-lock ./policy.hcl`.
+              ### Using `nomad var lock`

Contributor

aimeeu Dec 12, 2025

Suggested change

      
            ### Using `nomad var lock`
          
            ## Implementation
          
            ### Use `nomad var lock`

Add an H2 so we have Overview and Implementation in the right-hand page TOC

content/nomad/v1.11.x/content/docs/job-declare/strategy/singleton.mdx

Comment on lines +183 to +188

+              ### Sidecar Lock
+              If cannot implement the lock logic in your application or with a shim such as
+              `nomad var lock`, you'rll need to implement it such that the task you are
+              locking is running as a sidecar of the locking task, which has
+              [`task.leader=true`][] set.

Contributor

aimeeu Dec 12, 2025

Suggested change

      
            ### Sidecar Lock
          
            If cannot implement the lock logic in your application or with a shim such as
          
            `nomad var lock`, you'rll need to implement it such that the task you are
          
            locking is running as a sidecar of the locking task, which has
          
            [`task.leader=true`][] set.
          
            ### Sidecar lock
          
            If you cannot implement the lock logic in your application or with a shim such
          
            as `nomad var lock`, you need to implement it such that the task you are locking
          
            is running as a sidecar of the locking task, which has [`task.leader=true`][]
          
            set.

content/nomad/v1.11.x/content/docs/job-declare/strategy/singleton.mdx

Comment on lines +228 to +291

+              * The locking task must be in the same group as the task being locked.
+              * The locking task must be able to terminate the task being locked without the
+                Nomad client being up (i.e. they share the same PID namespace, or the locking
+                task is privileged).
+              * The locking task must have a way of signalling the task being locked that it
+                is safe to start. For example, the locking task can write a sentinel file into
+                the /alloc directory, which the locked task tries to read on startup and
+                blocks until it exists.
+              If the third requirement cannot be met, then you’ll need to split the lock
+              acquisition and lock heartbeat into separate tasks:
+              ```hcl
+              job "example" {
+                group "group" {
+                  disconnect {
+                    stop_on_client_after = "1m"
+                  }
+                  task "acquire" {
+                    lifecycle {
+                      hook = "prestart"
+                      sidecar = false
+                    }
+                    config {
+                      driver   = "raw_exec"
+                      command  = "/opt/lock-acquire-script.sh"
+                    }
+                    identity {
+                      env = true # make NOMAD_TOKEN available to lock command
+                    }
+                  }
+                  task "heartbeat" {
+                    leader = true
+                    config {
+                      driver   = "raw_exec"
+                      command  = "/opt/lock-heartbeat-script.sh"
+                      pid_mode = "host"
+                    }
+                    identity {
+                      env = true # make NOMAD_TOKEN available to lock command
+                    }
+                  }
+                  task "application" {
+                    lifecycle {
+                      hook = "poststart"
+                      sidecar = true
+                    }
+                    config {
+                      driver  = "docker"
+                      image   = "example/app:1"
+                    }
+                  }
+                }
+              }
+              ```
+              If the primary task is configured to [`restart`][], the task should be able to
+              restart within the lock TTL in order to minimize flapping on restart. This
+              improves availability but isn't required for correctness.

Contributor

aimeeu Dec 12, 2025

Suggested change

      
            * The locking task must be in the same group as the task being locked.
          
            * The locking task must be able to terminate the task being locked without the
          
              Nomad client being up (i.e. they share the same PID namespace, or the locking
          
              task is privileged).
          
            * The locking task must have a way of signalling the task being locked that it
          
              is safe to start. For example, the locking task can write a sentinel file into
          
              the /alloc directory, which the locked task tries to read on startup and
          
              blocks until it exists.
          
            If the third requirement cannot be met, then you’ll need to split the lock
          
            acquisition and lock heartbeat into separate tasks:
          
            ```hcl
          
            job "example" {
          
              group "group" {
          
                disconnect {
          
                  stop_on_client_after = "1m"
          
                }
          
                task "acquire" {
          
                  lifecycle {
          
                    hook = "prestart"
          
                    sidecar = false
          
                  }
          
                  config {
          
                    driver   = "raw_exec"
          
                    command  = "/opt/lock-acquire-script.sh"
          
                  }
          
                  identity {
          
                    env = true # make NOMAD_TOKEN available to lock command
          
                  }
          
                }
          
                task "heartbeat" {
          
                  leader = true
          
                  config {
          
                    driver   = "raw_exec"
          
                    command  = "/opt/lock-heartbeat-script.sh"
          
                    pid_mode = "host"
          
                  }
          
                  identity {
          
                    env = true # make NOMAD_TOKEN available to lock command
          
                  }
          
                }
          
                task "application" {
          
                  lifecycle {
          
                    hook = "poststart"
          
                    sidecar = true
          
                  }
          
                  config {
          
                    driver  = "docker"
          
                    image   = "example/app:1"
          
                  }
          
                }
          
              }
          
            }
          
            ```
          
            If the primary task is configured to [`restart`][], the task should be able to
          
            restart within the lock TTL in order to minimize flapping on restart. This
          
            improves availability but isn't required for correctness.
          
            - Must be in the same group as the task being locked.
          
            - Must be able to terminate the task being locked without the Nomad client being
          
              up. For example, they share the same PID namespace, or the locking task is
          
              privileged.
          
            - Must have a way of signalling the task being locked that it is safe to start.
          
              For example, the locking task can write a Sentinel file into the `/alloc`
          
              directory, which the locked task tries to read on startup and blocks until it
          
              exists.
          
            If you cannot meet the third requirement, then you need to split the lock
          
            acquisition and lock heartbeat into separate tasks.
          
            <CodeBlockConfig lineNumbers highlight="8-20,22-32">
          
            ```hcl
          
            job "example" {
          
              group "group" {
          
                disconnect {
          
                  stop_on_client_after = "1m"
          
                }
          
                task "acquire" {
          
                  lifecycle {
          
                    hook = "prestart"
          
                    sidecar = false
          
                  }
          
                  config {
          
                    driver   = "raw_exec"
          
                    command  = "/opt/lock-acquire-script.sh"
          
                  }
          
                  identity {
          
                    env = true # make NOMAD_TOKEN available to lock command
          
                  }
          
                }
          
                task "heartbeat" {
          
                  leader = true
          
                  config {
          
                    driver   = "raw_exec"
          
                    command  = "/opt/lock-heartbeat-script.sh"
          
                    pid_mode = "host"
          
                  }
          
                  identity {
          
                    env = true # make NOMAD_TOKEN available to lock command
          
                  }
          
                }
          
                task "application" {
          
                  lifecycle {
          
                    hook = "poststart"
          
                    sidecar = true
          
                  }
          
                  config {
          
                    driver  = "docker"
          
                    image   = "example/app:1"
          
                  }
          
                }
          
              }
          
            }
          
            ```
          
            </CodeBlockConfig>
          
            If you configured the primary task to [`restart`][], the task should be able to
          
            restart within the lock TTL in order to minimize flapping on restart. This
          
            improves availability but isn't required for correctness.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels