Skip to content

Commit 46f2009

Browse files
committed
Nomad: recommendations for singleton deployments
Many users have a requirement to run exactly one instance of a given allocation because it requires exclusive access to some cluster-wide resource, which we'll refer to here as a "singleton allocation". This is challenging to implement, so this document is intended to describe an accepted design to publish as a how-to/tutorial.
1 parent 0300018 commit 46f2009

File tree

1 file changed

+282
-0
lines changed
  • content/nomad/v1.11.x/content/docs/job-declare/strategy

1 file changed

+282
-0
lines changed
Lines changed: 282 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,282 @@
1+
---
2+
layout: docs
3+
page_title: Configure singleton deployments
4+
description: |-
5+
Declare a job that guarantees only a single instance can run at a time, with
6+
minimal downtime.
7+
---
8+
9+
# Configure singleton deployments
10+
11+
A singleton deployment is one where there is at most one instance of a given
12+
allocation running on the cluster at one time. You might need this if the
13+
workload needs exclusive access to a remote resource like a data store. Nomad
14+
does not support singleton deployments as a built-in feature. Your workloads
15+
continue to run even when the Nomad client agent has crashed, so ensuring
16+
there's at most one allocation for a given workload some cooperation from the
17+
job. This document describes how to implement singleton deployments.
18+
19+
## Design Goals
20+
21+
The configuration described here meets two primary design goals:
22+
23+
* The design will prevent a specific process with a task from running if there
24+
is another instance of that task running anywhere else on the Nomad cluster.
25+
* Nomad should be able to recover from failure of the task or the node on which
26+
the task is running with minimal downtime, where "recovery" means that the
27+
original task should be stopped and that Nomad should schedule a replacement
28+
task.
29+
* Nomad should minimize false positive detection of failures to avoid
30+
unnecessary downtime during the cutover.
31+
32+
There's a tradeoff between between recovery speed and false positives. The
33+
faster you make Nomad attempt to recover from failure, the more likely that a
34+
transient failure causes a replacement to be scheduled and a subsequent
35+
downtime.
36+
37+
Note that it's not possible to design a perfectly zero-downtime singleton
38+
allocation in a distributed system. This design will err on the side of
39+
correctness: having 0 or 1 allocations running rather than the incorrect 1 or 2
40+
allocations running.
41+
42+
## Overview
43+
44+
There are several options available for some details of the implementation, but
45+
all of them include the following:
46+
47+
* You must have a distributed lock with a TTL that's refreshed from the
48+
allocation. The process that sets and refreshes the lock must have its
49+
lifecycle tied to the main task. It can be either in-process, in-task with
50+
supervision, or run as a sidecar. If the allocation cannot obtain the lock,
51+
then it must not start whatever process or operations is intended to be a
52+
singleton. After a configurable window without obtaining the lock, the
53+
allocation must fail.
54+
* You must set the [`group.disconnect.stop_on_client_after`][] field. This
55+
forces a Nomad client that's disconnected from the server to stop the
56+
singleton allocation, which in turn releases the lock or allows its TTL to
57+
expire.
58+
59+
The values for the three timers (the lock TTL, the time it takes the alloc to
60+
give up, and the `stop_on_client_after` duration) are the values that can be
61+
tuned to reduce the maximum amount of downtime the application can have.
62+
63+
The Nomad [Locks API][] can support the operations needed. In psuedo-code these
64+
operations are:
65+
66+
* `PUT /v1/var/:path?lock-acquire`
67+
* On success: start heartbeat every 1/2 TTL
68+
* On conflict or failure: retry with backoff and timeout.
69+
* Once out of attempts, exit the process with error code.
70+
* To heartbeat, `PUT /v1/var/:path?lock-renew`
71+
* On success: continue
72+
* On conflict: exit the process with error code
73+
* On failure: retry with backoff up to TTL.
74+
* If TTL expires, attempt to revoke lock, then exit the process with error code.
75+
76+
The allocation can safely use the Nomad [Task API][] socket to write to the
77+
locks API, rather than communicating with the server directly. This reduces load
78+
on the server and speeds up detection of failed client nodes because the
79+
disconnected client cannot forward the Task API requests to the leader.
80+
81+
The [`nomad var lock`][] command implements this logic and can be used to shim
82+
the process being locked.
83+
84+
### ACLs
85+
86+
Allocations cannot write to Variables by default. You must configure a
87+
[workload-associated ACL policy][] that allows write access in the
88+
[`namespace.variables`][] block. For example, the following ACL policy allows
89+
access to write a lock on the path `nomad/jobs/myjob/lock` in the `prod`
90+
namespace:
91+
92+
```
93+
namespace "prod" {
94+
variables {
95+
path "nomad/jobs/myjob/lock" {
96+
capabilities = ["write", "read", "list"]
97+
}
98+
}
99+
}
100+
```
101+
102+
You set this policy on the job with `nomad acl policy apply -namespace prod -job
103+
myjob myjob-lock ./policy.hcl`.
104+
105+
### Using `nomad var lock`
106+
107+
The easiest way to implement the locking logic is to use `nomad var lock` as a
108+
shim in your task. The jobspec below assumes there's a Nomad binary in the
109+
container image.
110+
111+
```hcl
112+
job "example" {
113+
group "group" {
114+
115+
disconnect {
116+
stop_on_client_after = "1m"
117+
}
118+
119+
task "primary" {
120+
config {
121+
driver = "docker"
122+
image = "example/app:1"
123+
command = "nomad"
124+
args = [
125+
"var", "lock", "nomad/jobs/example/lock", # lock
126+
"busybox", "httpd", # application
127+
"-vv", "-f", "-p", "8001", "-h", "/local" # application args
128+
]
129+
}
130+
}
131+
}
132+
}
133+
```
134+
135+
If you don't want to ship a Nomad binary in the container image you can make a
136+
read-only mount of the binary from a host volume. This will only work in cases
137+
where the Nomad binary has been statically linked or you have glibc in the
138+
container image.
139+
140+
```hcl
141+
job "example" {
142+
group "group" {
143+
144+
disconnect {
145+
stop_on_client_after = "1m"
146+
}
147+
148+
volume "binaries" {
149+
type = "host"
150+
source = "binaries"
151+
read_only = true
152+
}
153+
154+
task "primary" {
155+
config {
156+
driver = "docker"
157+
image = "example/app:1"
158+
command = "/opt/bin/nomad"
159+
args = [
160+
"var", "lock", "nomad/jobs/example/lock", # lock
161+
"busybox", "httpd", # application
162+
"-vv", "-f", "-p", "8001", "-h", "/local" # application args
163+
]
164+
}
165+
166+
volume_mount {
167+
volume = "binaries"
168+
destination = "/opt/bin"
169+
}
170+
}
171+
}
172+
}
173+
```
174+
175+
### Sidecar Lock
176+
177+
If cannot implement the lock logic in your application or with a shim such as
178+
`nomad var lock`, you'rll need to implement it such that the task you are
179+
locking is running as a sidecar of the locking task, which has
180+
[`task.leader=true`][] set.
181+
182+
```hcl
183+
job "example" {
184+
group "group" {
185+
186+
disconnect {
187+
stop_on_client_after = "1m"
188+
}
189+
190+
task "lock" {
191+
leader = true
192+
config {
193+
driver = "raw_exec"
194+
command = "/opt/lock-script.sh"
195+
pid_mode = "host"
196+
}
197+
}
198+
199+
task "primary" {
200+
lifecycle {
201+
hook = "poststart"
202+
sidecar = true
203+
}
204+
205+
config {
206+
driver = "docker"
207+
image = "example/app:1"
208+
}
209+
}
210+
}
211+
}
212+
```
213+
214+
The locking task has the following requirements:
215+
216+
* The locking task must be in the same group as the task being locked.
217+
* The locking task must be able to terminate the task being locked without the
218+
Nomad client being up (i.e. they share the same PID namespace, or the locking
219+
task is privileged).
220+
* The locking task must have a way of signalling the task being locked that it
221+
is safe to start. For example, the locking task can write a sentinel file into
222+
the /alloc directory, which the locked task tries to read on startup and
223+
blocks until it exists.
224+
225+
If the third requirement cannot be met, then you’ll need to split the lock
226+
acquisition and lock heartbeat into separate tasks:
227+
228+
```hcl
229+
job "example" {
230+
group "group" {
231+
232+
disconnect {
233+
stop_on_client_after = "1m"
234+
}
235+
236+
task "acquire" {
237+
lifecycle {
238+
hook = "prestart"
239+
sidecar = false
240+
}
241+
config {
242+
driver = "raw_exec"
243+
command = "/opt/lock-acquire-script.sh"
244+
}
245+
}
246+
247+
task "heartbeat" {
248+
leader = true
249+
config {
250+
driver = "raw_exec"
251+
command = "/opt/lock-heartbeat-script.sh"
252+
pid_mode = "host"
253+
}
254+
}
255+
256+
task "primary" {
257+
lifecycle {
258+
hook = "poststart"
259+
sidecar = true
260+
}
261+
262+
config {
263+
driver = "docker"
264+
image = "example/app:1"
265+
}
266+
}
267+
}
268+
}
269+
```
270+
271+
If the primary task is configured to [`restart`][], the task should be able to
272+
restart within the lock TTL in order to minimize flapping on restart. This
273+
improves availability but isn't required for correctness.
274+
275+
[`group.disconnect.stop_on_client_after`]: /nomad/docs/job-specification/disconnect#stop_on_client_after
276+
[Locks API]: /nomad/api-docs/variables/locks
277+
[Task API]: /nomad/api-docs/task-api
278+
[`nomad var lock`]: /nomad/commands/var/lock
279+
[workload-associated ACL policy]: /nomad/docs/concepts/workload-identity#workload-associated-acl-policies
280+
[`namespace.variables`]: /nomad/docs/other-specifications/acl-policy#variables
281+
[`task.leader=true`]: /nomad/docs/job-specification/task#leader
282+
[`restart`]: /nomad/docs/job-specification/restart

0 commit comments

Comments
 (0)