Vault
Automate Integrated Storage management
Vault 1.2 first introduced an internal storage backend, Integrated Storage as a technical preview, and the feature became generally available in Vault 1.4. Using the Integrated Storage, data gets replicated to all the nodes in the cluster using the raft consensus protocol. The management of the nodes in the cluster was a manual process.
Vault 1.7 introduced autopilot to simplify and automate the cluster management for Integrated Storage. The autopilot includes:
- Cluster node health check
- Server stabilization: prevent disruption to raft quorum due to an unstable new node- Monitor newly added node health for a period and decide promotion to voter status
 
- Dead server cleanup - periodic, automatic clean-up of failed servers
Vault enables Autopilot by default upon upgrading to Vault 1.7. Server stabilization works by default, but you must explicitly enable dead server cleanup. You will learn more about dead server cleanup the autopilot configuration section.
Prerequisites
This tutorial requires Vault, sudo access, and additional configuration to create the cluster.
Scenario setup
To try the autopilot feature, you will use a terminal session and a shell script to start 6 Vault servers. Each server listens to a different port as shown in the diagram below.

- Initialize and unseal vault_1 (http://127.0.0.1:8100). The root token creates a transit key that enables the other Vaults auto-unseal. This Vault server is not a part of the cluster.
- Initialize and unseal vault_2 (http://127.0.0.1:8200). This Vault starts as the cluster leader. The script creates an example K/V-V2 secret.
- Start vault_3 (http://127.0.0.1:8300). It automatically joins the cluster viaretry_join.
- Start vault_4 (http://127.0.0.1:8400). It automatically joins the cluster viaretry_join.
- Start vault_5 (http://127.0.0.1:8500). It automatically joins the cluster viaretry_join.
- Start vault_6 (http://127.0.0.1:8600). It automatically joins the cluster viaretry_join.
Tip
If this is your first time setting up a Vault cluster with integrated storage, you're encouraged to complete the Vault HA Cluster with Integrated Storage tutorial first, and then return to this tutorial.
- Retrieve the configuration by cloning the - hashicorp-education/learn-vault-raftrepository from GitHub.- $ git clone https://github.com/hashicorp-education/learn-vault-raft
- Change the working directory to - learn-vault-raft/raft-autopilot/local.- $ cd learn-vault-raft/raft-autopilot/local
- Set the - run-all.shfile to executable.- $ chmod +x run_all.sh
- Execute the - run_all.shscript to spin up a Vault cluster with 5 nodes.- $ ./run_all.sh [vault_1] Creating configuration - creating /learn-vault-raft/raft-autopilot/local/config-vault_1.hcl [vault_2] Creating configuration - creating /learn-vault-raft/raft-autopilot/local/config-vault_2.hcl - creating /learn-vault-raft/raft-autopilot/local/raft-vault_2 ...snip... [vault_5] starting Vault server @ http://127.0.0.1:8500 Using [vault_1] root token (s.5rjDMzU5Kj9bImUVaqPpihAo) to retrieve transit key for auto-unseal [vault_6] starting Vault server @ http://127.0.0.1:8600 Using [vault_1] root token (s.5rjDMzU5Kj9bImUVaqPpihAo) to retrieve transit key for auto-unseal- You can find the server configuration files and the log files in the working directory. 
- Export the - VAULT_ADDRenvironment variable to address the- vault_2server:- $ export VAULT_ADDR=http://127.0.0.1:8200
- Verify the cluster. - $ vault operator raft list-peers Node Address State Voter ---- ------- ----- ----- vault_2 127.0.0.1:8201 leader true vault_3 127.0.0.1:8301 follower true vault_4 127.0.0.1:8401 follower true vault_5 127.0.0.1:8501 follower true vault_6 127.0.0.1:8601 follower true- The - vault_2is the leader.
Understand the autopilot behavior
- Print the help message for the - vault operator raft autopilotcommand.- $ vault operator raft autopilot -help This command is accessed by using one of the subcommands below. Subcommands: get-config Returns the configuration of the autopilot subsystem under integrated storage set-config Modify the configuration of the autopilot subsystem under integrated storage state Displays the state of the raft cluster under integrated storage as seen by autopilot
- Display the current cluster autopilot state. - $ vault operator raft autopilot state Healthy: true Failure Tolerance: 2 Leader: vault_2 Voters: vault_2 vault_3 vault_4 vault_5 vault_6 Servers: vault_2 Name: vault_2 Address: 127.0.0.1:8201 Status: leader Node Status: alive Healthy: true Last Contact: 0s Last Term: 3 Last Index: 118 vault_3 Name: vault_3 Address: 127.0.0.1:8301 Status: voter Node Status: alive Healthy: true Last Contact: 1.73895338s Last Term: 3 Last Index: 118 vault_4 Name: vault_4 Address: 127.0.0.1:8401 Status: voter Node Status: alive Healthy: true Last Contact: 4.68575147s Last Term: 3 Last Index: 118 vault_5 Name: vault_5 Address: 127.0.0.1:8501 Status: voter Node Status: alive Healthy: true Last Contact: 2.630693989s Last Term: 3 Last Index: 118 vault_6 Name: vault_6 Address: 127.0.0.1:8601 Status: voter Node Status: alive Healthy: true Last Contact: 579.174724ms Last Term: 3 Last Index: 118
This displays the cluster health, and its failure tolerance.
The current leader node is vault_2. The Failure Tolerance is 2.
This means you can lose up to 2 nodes and still keep quorum.
Note that the healthy parameter value is true for all nodes
in the cluster.
Refer to the deployment table for the quorum size and failure tolerance for various cluster sizes.
Stop one of the nodes
- Set the - cluster.shfile to executable.- $ chmod +x cluster.sh
- Stop - vault_6.- $ ./cluster.sh stop vault_6 Found 1 Vault service(s) matching that name [vault_6] stopping- Optional: You can verify that - vault_6is not running.- $ ps | grep vault 41873 ttys009 0:34.57 vault server -log-level=trace -config <path>/config-vault_1.hcl 41919 ttys009 11:07.38 vault server -log-level=trace -config <path>/config-vault_2.hcl 41966 ttys009 1:50.94 vault server -log-level=trace -config <path>/config-vault_3.hcl 41982 ttys009 1:52.26 vault server -log-level=trace -config <path>/config-vault_4.hcl 41998 ttys009 1:50.86 vault server -log-level=trace -config <path>/config-vault_5.hcl 45834 ttys009 0:00.01 grep --color=auto --exclude-dir=.bzr --exclude-dir=CVS --exclude-dir=.git --exclude-dir=.hg --exclude-dir=.svn --exclude-dir=.idea --exclude-dir=.tox vault
- Check the cluster health. - $ vault operator raft autopilot state- Notice that the Healthy state of the cluster is - false, and the Failure Tolerance is now- 1.- Healthy: false Failure Tolerance: 1 Leader: vault_2 Voters: vault_2 vault_3 vault_4 vault_5 vault_6 ...snip...- Now, the Healthy parameter value is - falseon the cluster, and the Failure Tolerance is- 1. The Healthy state of the- vault_6is- false, so you know it has failed.- ...snip... vault_6 Name: vault_6 Address: 127.0.0.1:8601 Status: voter Node Status: alive Healthy: false Last Contact: 55.577082309s Last Term: 3 Last Index: 154- Although - vault_6is no longer running, it's still a cluster member at this point.
Autopilot configuration
Check the autopilot settings to learn the default behavior.
| Parameter | Description | 
|---|---|
| Cleanup Dead Servers ( bool) | Specifies automatic removal of dead server nodes periodically. | 
| Last Contact Threshold ( string) | Limit the amount of time a server can go without leader contact before it's considered unhealthy. | 
| Dead Server Last Contact Threshold ( string) | Limit the amount of time a server can go without leader contact before it's considered failed. This should be typically a high duration, such as a day, else it being too low may result in removal of nodes that aren't actually dead. | 
| Server Stabilization Time | Minimum amount of time a server must be stable in the 'healthy' state before it's added to the cluster. | 
| Min Quorum ( int) | Minimum number of servers allowed in a cluster before autopilot can prune dead servers. | 
| Max Trailing Logs ( int) | Maximum number of log entries in the Raft log that a server can be behind its leader before it's considered unhealthy. | 
| Disable Upgrade Migration ( bool) | Specifies whether to perform automatic upgrade migration | 
- Check the current autopilot configuration. - $ vault operator raft autopilot get-config Key Value --- ----- Cleanup Dead Servers false Last Contact Threshold 10s Dead Server Last Contact Threshold 24h0m0s Server Stabilization Time 10s Min Quorum 0 Max Trailing Logs 1000 Disable Upgrade Migration false- The Cleanup Dead Servers parameter is - false.
- Update the autopilot configuration to enable the dead server cleanup. For demonstration in this tutorial, set Dead Server Last Contact Threshold to 1 minute, and the Server Stabilization Time to 30 seconds. Note that Dead Server Last Contact Threshold especially should be set to a much larger value in production, and should generally be in the time-scale of days or hours, not minutes. The consequences of this being too low could have the drastic consequences of removing a node that wasn't actually dead. - Warning - These example values are for use just in this tutorial. These values should not be used in a production environment. - $ vault operator raft autopilot set-config \ -dead-server-last-contact-threshold=1m \ -server-stabilization-time=30 \ -cleanup-dead-servers=true \ -min-quorum=3
- Verify the updated configuration. - $ vault operator raft autopilot get-config Key Value --- ----- Cleanup Dead Servers true Last Contact Threshold 10s Dead Server Last Contact Threshold 1m0s Server Stabilization Time 30s Min Quorum 3 Max Trailing Logs 1000 Disable Upgrade Migration false
- Check the cluster health. - $ vault operator raft autopilot state Healthy: true Failure Tolerance: 1 Leader: vault_2 Voters: vault_2 vault_3 vault_4 vault_5 Servers: ...snip...- The cluster's Healthy parameter value is back to - true. Notice that- vault_6is no longer listed. The Voters parameter lists- vault_2through- vault_5.
- List cluster peers to learn about the cluster's health, and leader status. - $ vault operator raft list-peers Node Address State Voter ---- ------- ----- ----- vault_2 127.0.0.1:8201 leader true vault_3 127.0.0.1:8301 follower true vault_4 127.0.0.1:8401 follower true vault_5 127.0.0.1:8501 follower true
Add a new node to the cluster
Explore how the autopilot configuration settings influence the cluster when you add a new node.
- Add a new node ( - vault_7) to the cluster.- $ ./cluster.sh setup vault_7 [vault_7] starting Vault server @ http://127.0.0.1:8700 Using [vault_1] root token (s.wsEIMfqTipb0mZT051TNbcYJ) to retrieve transit key for auto-unseal
- List the cluster members. - $ vault operator raft list-peers Node Address State Voter ---- ------- ----- ----- vault_2 127.0.0.1:8201 leader true vault_3 127.0.0.1:8301 follower true vault_4 127.0.0.1:8401 follower true vault_5 127.0.0.1:8501 follower true vault_7 127.0.0.1:8701 follower false- Notice that the - vault_7server is a non-voter. (The Voter parameter value is- false.)
- Check the cluster health. - $ vault operator raft autopilot state Healthy: true Failure Tolerance: 1 Leader: vault_2 Voters: vault_2 vault_3 vault_4 vault_5 Servers: ...snip... vault_7 Name: vault_7 Address: 127.0.0.1:8701 Status: non-voter Node Status: alive Healthy: true Last Contact: 2.580581282s Last Term: 3 Last Index: 78- The - vault_7server joins the cluster as a non-voter until the Server Stabilization Time of 30 seconds elapses.
- Wait for 30 seconds and check the cluster peers. - $ vault operator raft list-peers Node Address State Voter ---- ------- ----- ----- vault_2 127.0.0.1:8201 leader true vault_3 127.0.0.1:8301 follower true vault_4 127.0.0.1:8401 follower true vault_5 127.0.0.1:8501 follower true vault_7 127.0.0.1:8701 follower true- Now, the - vault_7server should be a voter. This is a part of the server stabilization mechanism of the autopilot.
Vault Enterprise
The explicit non-voter nodes behave the same way as before and remain non-voters as designed. If you enable dead server cleanup, it prunes failed non-voters.
Configure the state change interval
By default, the autopilot picks up any state change an interval of 10 seconds.
To change the default, set the autopilot_reconcile_interval parameter inside
the storage stanza in the server configuration file.
Example: The following server configuration file sets the autopilot to picks up state change an interval of 15 seconds.
storage "raft" {
  path = "/path/to/raft/data"
  node_id = "raft_node_1"
  # overwrite the default interval
  autopilot_reconcile_interval = "15s"
}
listener "tcp" {
  address     = "127.0.0.1:8200"
  tls_disable = true
}
cluster_addr = "http://127.0.0.1:8201"
Clean up
The cluster.sh script provides a clean operation that removes all services,
configuration, and modifications to your local system.
Clean up your local workstation.
$ ./cluster.sh clean
Found 1 Vault service(s) matching that name
[vault_1] stopping
...snip...
Removing log file /learn-vault-raft/raft-autopilot/local/vault_5.log
Removing log file /learn-vault-raft/raft-autopilot/local/vault_6.log
Clean complete
Help and reference
- Integrated Storage Internal documentation
- Integrated Storage Concepts documentation
- Automate Upgrades with Vault Enterprise
- Commands (CLI) - operator raft autopilot
- API docs - sys/storage/raft/autopilot
- Vault HA Cluster with Integrated Storage tutorial
- Migration checklist
Vault Enterprise Replication
If you are running Vault Enterprise with replication enabled, read the Replication section in the Autopilot documentation for additional information.
The following tutorials walk through the Enterprise Replication setup: