Overview
Autopilot is a set of features to allow for automatic operator-friendly management of Nomad servers. It includes automated upgrades, monitoring the state of the Raft cluster, and stable server introduction. This allows the Nomad upgrade process to be simplified.
See the Autopilot tutorial(opens in new tab) page for a basic tutorial and see below for more detail.
Prerequisites
- Only if you are running Nomad 1.3 or older, set raft_protocolto 3 on all server nodes because the default raft versions were less than 2. This may also apply if you have upgraded from an older version, but never upgradedraft_protocolto 3. If you are running an older Nomad version with the parameter set to 2 or lower, then work to upgrade Nomad Enterprise. Refer to the Raft Protocol Version Compatibility(opens in new tab) documentation for more details.
- Add autopilot{}block and its parameters to the server node configuration before bootstrapping the cluster. Use thenomad operator autopilot set-configcommand as an alternative, or use the/v1/operator/autopilot/configurationAPI endpoint. Refer to the Autopilot documentation(opens in new tab) for more information on how to enable Autopilot.
We recommend you to have a build pipeline with Packer to create golden images, deploy the machines with Terraform, and a configuration management tool such as Ansible to perform any OS level operations after the VM provisioning.
Use the pipeline to deploy Nomad VMs with the new Nomad version or to do any changes to the OS or Nomad configuration. See the Using Terraform to Configure Nomad section for more details and recommendations on the deployment process.
Example pipeline
Below is an example pipeline of what the upgrade process may look like with Autopilot, Packer, and HCP Terraform.
- Packer builds a new Nomad Enterprise machine image and stores it at an image registry. This example shows AWS EC2 AMI, but the same principles apply for vSphere Content Library, Azure Compute Gallery.
- HCP Terraform uses a new workspace to deploy a new set of Nomad server VMs.
 Figure 1: Terraform uses a new workspace to deploy a new set of Nomad servers.
Figure 1: Terraform uses a new workspace to deploy a new set of Nomad servers.
- The new 1.8 nodes are joining the 1.7 cluster.
 Figure 2: The new nodes are joining the cluster.
Figure 2: The new nodes are joining the cluster.
- Autopilot then automatically demotes the old server nodes to non-voting members so they are no longer participating in the quorum.
 Figure 3: Autopilot automatically demotes the old server nodes.
Figure 3: Autopilot automatically demotes the old server nodes.
Run nomad operator raft list-peers to confirm the Nomad nodes from Workspace A does not have a leader and are not set to voting members. Once you confirmed the Nomad instances deployed from Workspace A are non-voting, run terraform destroy to remove the old instances.
Autopilot recommended configuration options
Cleanup dead server
The cleanup_dead_servers parameter ensures that dead servers are automatically removed from the cluster. This is crucial for maintaining a healthy cluster state.
autopilot {
    cleanup_dead_servers = true
}
Tip
Always set this to `true` to prevent dead servers from causing issues in the cluster.Last contact threshold
The last_contact_threshold parameter defines the maximum allowed time since the last contact with a server before Nomad Enterprise marks it unhealthy.
autopilot {
    last_contact_threshold = "200ms"
}
Tip
Set this to a low value (for example `200ms`) to detect and handle network partitions or server failures.Server stabilization time
The server_stabilization_time parameter defines the time a server must be stable before Nomad Enterprise marks it as healthy.
autopilot {
     server_stabilization_time = "10s"
}
Tip
Set this to a reasonable value (for example `10s`).Max trailing logs
The max_trailing_logs parameter specifies the maximum number of log entries a server can trail behind the leader before it is considered unhealthy.
autopilot {
    max_trailing_logs = 250
}
Tip
Start with 250 and adjust this value based on your cluster's workload and log generation rate. A higher value (1000) may be necessary for high-throughput environments. Monitor the cluster's performance and adjust the value based on your observations. If servers are frequently marked as unhealthy due to exceeding the max trailing logs, consider increasing the value.
Server stabilization time
The server_stabilization_time parameter specifies the minimum duration a server must be stable before Nomad Enterprise adds it to the cluster. The appropriate value depends on your server hardware and startup time.
autopilot {
    server_stabilization_time = "10s"
}
Tip
Use a lower value such as 10s for servers with fast startup times and stable hardware.
For servers with slower startup times or if you want to provide more time for servers to stabilize before joining the cluster, use a higher value such as 30s.
Observe the cluster's behavior during server additions and adjust if necessary.
Enable redundancy zones
The enable_redundancy_zones parameter allows you to enable redundancy zones, which improves the fault tolerance of your Nomad cluster by distributing voting servers across distinct failure domains. We recommend naming the redundancy_zone in your server configuration the same as the underlying availability zone.
autopilot {
    enable_redundancy_zones = true
}
Tip
Enable this feature to improve the resilience of your cluster by distributing servers across different failure domains. When enabling redundancy zones, ensure that you have sufficient servers distributed across multiple zones to maintain quorum and avoid data loss.
Disable upgrade migration
Set the disable_upgrade_migration parameter to false to allow automatic upgrades.
autopilot {
    disable_upgrade_migration = false
}
Disable upgrade migration and enable custom upgrades
disable_upgrade_migration and enable_custom_upgrades
Tip
`false` for both, unless you have specific upgrade requirements.By default, Nomad Enterprise enables Autopilot's upgrade migration strategy, and disables custom upgrades.
This allows Autopilot to automatically manage the upgrade process. If you have specific upgrade requirements or want to manually control the upgrade process due to just configuration changes without a newer version, you can set disable_upgrade_migration to true and enable_custom_upgrades to true.
This allows you to implement your own upgrade logic.
For most cases, we recommend to leverage Autopilot's default upgrade migration strategy for a seamless and automated upgrade experience.
If you choose to enable custom upgrades, ensure that you have upgrade_version set to your specific versioning semantics to allow you to increment the version tag on future images.
Summary
Note
Optimal settings may vary based on your specific environment, workload, and requirements. It is crucial to monitor your Nomad Enterprise cluster's performance, stability, and behavior and make adjustments as needed. Conduct thorough testing and gradually roll changes out to Autopilot parameters in a staging environment before applying them to your production cluster. This allows you to validate the settings and ensure a smooth operation.