Nomad
Configure reschedule
Tasks can sometimes fail due to network, CPU or memory issues on the node
running the task. In such situations, Nomad can reschedule the task on another
node. The reschedule stanza can be used to configure how Nomad
should try placing failed tasks on another node in the cluster. Reschedule
attempts have a delay between each attempt, and the delay can be configured to
increase between each rescheduling attempt according to a configurable
delay_function. Consult the reschedule stanza documentation for more
information.
Service jobs are configured by default to have unlimited reschedule attempts. You should use the reschedule stanza to ensure that failed tasks are automatically reattempted on another node without needing operator intervention.
The following CLI example shows job and allocation statuses for a task being rescheduled by Nomad. The CLI shows the number of previous attempts if there is a limit on the number of reschedule attempts. The CLI also shows when the next reschedule will be attempted.
$ nomad job status demo
ID            = demo
Name          = demo
Submit Date   = 2018-04-12T15:48:37-05:00
Type          = service
Priority      = 50
Datacenters   = dc1
Status        = pending
Periodic      = false
Parameterized = false
Summary
Task Group  Queued  Starting  Running  Failed  Complete  Lost
demo        0       0         0        2       0         0
Future Rescheduling Attempts
Task Group  Eval ID   Eval Time
demo        ee3de93f  5s from now
Allocations
ID        Node ID   Task Group  Version  Desired  Status  Created  Modified
39d7823d  f2c2eaa6  demo        0        run      failed  5s ago   5s ago
fafb011b  f2c2eaa6  demo        0        run      failed  11s ago  10s ago
$ nomad alloc status 3d0b
ID                     = 3d0bbdb1
Eval ID                = 79b846a9
Name                   = demo.demo[0]
Node ID                = 8a184f31
Job ID                 = demo
Job Version            = 0
Client Status          = failed
Client Description     = <none>
Desired Status         = run
Desired Description    = <none>
Created                = 15s ago
Modified               = 15s ago
Reschedule Attempts    = 3/5
Reschedule Eligibility = 25s from now
Task "demo" is "dead"
Task Resources
CPU      Memory   Disk     Addresses
100 MHz  300 MiB  300 MiB  p1: 127.0.0.1:27646
Task Events:
Started At     = 2018-04-12T20:44:25Z
Finished At    = 2018-04-12T20:44:25Z
Total Restarts = 0
Last Restart   = N/A
Recent Events:
Time                       Type            Description
2018-04-12T15:44:25-05:00  Not Restarting  Policy allows no restarts
2018-04-12T15:44:25-05:00  Terminated      Exit Code: 127
2018-04-12T15:44:25-05:00  Started         Task started by client
2018-04-12T15:44:25-05:00  Task Setup      Building Task Directory
2018-04-12T15:44:25-05:00  Received        Task received by client