Overview

Autopilot is a set of features to allow for automatic operator-friendly management of Nomad servers. It includes automated upgrades, monitoring the state of the Raft cluster, and stable server introduction. This allows the Nomad upgrade process to be simplified.

See the Autopilot tutorial(opens in new tab) page for a basic tutorial and see below for more detail.

Prerequisites

Only if you are running Nomad 1.3 or older, set raft_protocol to 3 on all server nodes because the default raft versions were less than 2. This may also apply if you have upgraded from an older version, but never upgraded raft_protocol to 3. If you are running an older Nomad version with the parameter set to 2 or lower, then work to upgrade Nomad Enterprise. Refer to the Raft Protocol Version Compatibility(opens in new tab) documentation for more details.
Add autopilot{} block and its parameters to the server node configuration before bootstrapping the cluster. Use the nomad operator autopilot set-config command as an alternative, or use the /v1/operator/autopilot/configuration API endpoint. Refer to the Autopilot documentation(opens in new tab) for more information on how to enable Autopilot.

We recommend you to have a build pipeline with Packer to create golden images, deploy the machines with Terraform, and a configuration management tool such as Ansible to perform any OS level operations after the VM provisioning.

Use the pipeline to deploy Nomad VMs with the new Nomad version or to do any changes to the OS or Nomad configuration. See the Using Terraform to Configure Nomad section for more details and recommendations on the deployment process.

Example pipeline

Below is an example pipeline of what the upgrade process may look like with Autopilot, Packer, and HCP Terraform.

Packer builds a new Nomad Enterprise machine image and stores it at an image registry. This example shows AWS EC2 AMI, but the same principles apply for vSphere Content Library, Azure Compute Gallery.
HCP Terraform uses a new workspace to deploy a new set of Nomad server VMs.

Terraform uses a new workspace to deploy a new set of Nomad servers. Figure 1: Terraform uses a new workspace to deploy a new set of Nomad servers.

The new 1.8 nodes are joining the 1.7 cluster.

The new nodes are joining the cluster Figure 2: The new nodes are joining the cluster.

Autopilot then automatically demotes the old server nodes to non-voting members so they are no longer participating in the quorum.

Autopilot automatically demotes the old server nodes Figure 3: Autopilot automatically demotes the old server nodes.

Run nomad operator raft list-peers to confirm the Nomad nodes from Workspace A does not have a leader and are not set to voting members. Once you confirmed the Nomad instances deployed from Workspace A are non-voting, run terraform destroy to remove the old instances.

Autopilot recommended configuration options

Cleanup dead server

The cleanup_dead_servers parameter ensures that dead servers are automatically removed from the cluster. This is crucial for maintaining a healthy cluster state.

autopilot {
    cleanup_dead_servers = true
}

Tip

Always set this to `true` to prevent dead servers from causing issues in the cluster.

Last contact threshold

The last_contact_threshold parameter defines the maximum allowed time since the last contact with a server before Nomad Enterprise marks it unhealthy.

autopilot {
    last_contact_threshold = "200ms"
}

Tip

Set this to a low value (for example `200ms`) to detect and handle network partitions or server failures.

Server stabilization time

The server_stabilization_time parameter defines the time a server must be stable before Nomad Enterprise marks it as healthy.

autopilot {
     server_stabilization_time = "10s"
}

Tip

Set this to a reasonable value (for example `10s`).

Max trailing logs

The max_trailing_logs parameter specifies the maximum number of log entries a server can trail behind the leader before it is considered unhealthy.

autopilot {
    max_trailing_logs = 250
}

Tip

Start with 250 and adjust this value based on your cluster's workload and log generation rate. A higher value (1000) may be necessary for high-throughput environments. Monitor the cluster's performance and adjust the value based on your observations. If servers are frequently marked as unhealthy due to exceeding the max trailing logs, consider increasing the value.

Server stabilization time

The server_stabilization_time parameter specifies the minimum duration a server must be stable before Nomad Enterprise adds it to the cluster. The appropriate value depends on your server hardware and startup time.

autopilot {
    server_stabilization_time = "10s"
}

Tip

Use a lower value such as 10s for servers with fast startup times and stable hardware. For servers with slower startup times or if you want to provide more time for servers to stabilize before joining the cluster, use a higher value such as 30s. Observe the cluster's behavior during server additions and adjust if necessary.

Enable redundancy zones

The enable_redundancy_zones parameter allows you to enable redundancy zones, which improves the fault tolerance of your Nomad cluster by distributing voting servers across distinct failure domains. We recommend naming the redundancy_zone in your server configuration the same as the underlying availability zone.

autopilot {
    enable_redundancy_zones = true
}

Tip

Enable this feature to improve the resilience of your cluster by distributing servers across different failure domains. When enabling redundancy zones, ensure that you have sufficient servers distributed across multiple zones to maintain quorum and avoid data loss.

Disable upgrade migration

Set the disable_upgrade_migration parameter to false to allow automatic upgrades.

autopilot {
    disable_upgrade_migration = false
}

Disable upgrade migration and enable custom upgrades

disable_upgrade_migration and enable_custom_upgrades

Tip

`false` for both, unless you have specific upgrade requirements.

By default, Nomad Enterprise enables Autopilot's upgrade migration strategy, and disables custom upgrades. This allows Autopilot to automatically manage the upgrade process. If you have specific upgrade requirements or want to manually control the upgrade process due to just configuration changes without a newer version, you can set disable_upgrade_migration to true and enable_custom_upgrades to true. This allows you to implement your own upgrade logic.

For most cases, we recommend to leverage Autopilot's default upgrade migration strategy for a seamless and automated upgrade experience. If you choose to enable custom upgrades, ensure that you have upgrade_version set to your specific versioning semantics to allow you to increment the version tag on future images.

Summary

Note

Optimal settings may vary based on your specific environment, workload, and requirements. It is crucial to monitor your Nomad Enterprise cluster's performance, stability, and behavior and make adjustments as needed. Conduct thorough testing and gradually roll changes out to Autopilot parameters in a staging environment before applying them to your production cluster. This allows you to validate the settings and ensure a smooth operation.

Identity & Access Management

Backups