Vault
Set up fault tolerance with Vault redundancy zones
Enterprise Only
The functionality described in this tutorial is available only in Vault Enterprise.
In this tutorial, you will configure fault resiliency for your Vault cluster using redundancy zones. Redundancy zones is a Vault autopilot feature that makes it possible to run one voter and any number of non-voters in each defined zone.
For this tutorial, you will use one voter and one non-voter in three zones, for a total of six servers. If an entire availability zone is completely lost, both the voter and non-voter will be lost; however, the cluster will remain available. If only the voter is lost in an availability zone, the autopilot will promote the non-voter to a voter automatically, putting the hot standby server into service quickly.
Prerequisites
This tutorial requires Vault Enterprise, sudo access, and additional configuration to create the cluster.
You will also need a text
editor, the curl executable to test the API endpoints, and optionally the jq
command to format the output for curl.
Configure Vault for redundancy zones
To demonstrate the new autopilot redundancy zones, you will start a cluster with 3 nodes each defined in a separate zone. Then, add additional node to each zone.
You will run a script to start a cluster.

- vault_1 (http://127.0.0.1:8100) is initialized and unsealed. The root token creates a transit key that enables the other Vaults auto-unseal. This Vault server is not a part of the cluster.
- vault_2 (http://127.0.0.1:8200) is initialized and unsealed. This Vault starts as the cluster leader and defined in azone-a.
- vault_3 (http://127.0.0.1:8300) is started and automatically joins the cluster viaretry_joinand defined in azone-b.
- vault_4 (http://127.0.0.1:8400) is started and automatically joins the cluster viaretry_joinand defined in azone-c.
Disclaimer
For the purpose of demonstration, a script will run Vault servers locally. In practice, the redundancy zone may map to the availability zones or similar.
Setup a cluster
- Retrieve the configuration by cloning the - hashicorp/learn-vault-raftrepository from GitHub.- $ git clone https://github.com/hashicorp-education/learn-vault-raft- This repository contains supporting content for all of the Vault learn tutorials. The content specific to this tutorial can be found within a sub-directory. 
- Change the working directory to - learn-vault-raft/raft-redundancy-zones/local.- $ cd learn-vault-raft/raft-redundancy-zones/local
- Set the - setup_1.shfile to executable.- $ chmod +x setup_1.sh
- Execute the - setup_1.shscript to spin up a Vault cluster.- $ ./setup_1.sh [vault_1] Creating configuration - creating /git/learn-vault-raft/raft-autopilot/local/config-vault_1.hcl [vault_2] Creating configuration - creating /git/learn-vault-raft/raft-autopilot/local/config-vault_2.hcl - creating /git/learn-vault-raft/raft-autopilot/local/raft-vault_2 ...snip... [vault_3] starting Vault server @ http://127.0.0.1:8300 Using [vault_1] root token (hvs.tqKc9An04pQY5H1uysw02Xn6) to retrieve transit key for auto-unseal [vault_4] starting Vault server @ http://127.0.0.1:8400 Using [vault_1] root token (hvs.tqKc9An04pQY5H1uysw02Xn6) to retrieve transit key for auto-unseal- You can find the server configuration files and the log files in the working directory. 
- Use your preferred text editor and open the configuration files to examine the generated server configuration for - vault_2,- vault_3and- vault_4.- Notice that - autopilot_redundancy_zoneparameter is set to- zone-ainside the- storagestanza. This is an optional string that specifies Vault's redundancy zone. This is reported to autopilot and is used to enhance scaling and resiliency.- config-vault_2.hcl - storage "raft" { path = "/learn-vault-raft/raft-redundancy-zones/local/raft-vault_2/" node_id = "vault_2" autopilot_redundancy_zone = "zone-a" } listener "tcp" { address = "127.0.0.1:8200" cluster_address = "127.0.0.1:8201" tls_disable = true } seal "transit" { address = "http://127.0.0.1:8100" # token is read from VAULT_TOKEN env # token = "" disable_renewal = "false" key_name = "unseal_key" mount_path = "transit/" } disable_mlock = true cluster_addr = "http://127.0.0.1:8201"
- Export an environment variable for the - vaultCLI to address the- vault_2server.- $ export VAULT_ADDR=http://127.0.0.1:8200
- List the peers. - $ vault operator raft list-peers Node Address State Voter ---- ------- ----- ----- vault_2 127.0.0.1:8201 leader true vault_3 127.0.0.1:8301 follower false vault_4 127.0.0.1:8401 follower false
- Verify the cluster members. You see one node per redundancy zone. - $ vault operator members Host Name API Address ... Redundancy Zone Last Echo --------- ----------- ... --------------- --------- C02DVAMJML85 http://127.0.0.1:8200 ... zone-a n/a C02DVAMJML85 http://127.0.0.1:8300 ... zone-b 2022-06-14T15:40:04-07:00 C02DVAMJML85 http://127.0.0.1:8400 ... zone-c 2022-06-14T15:40:01-07:00
- View the autopilot's redundancy zones settings. - $ vault operator raft autopilot state- Output: - The overall failure tolerance is - 1; however, the zone-level failure tolerance is- 0.- Healthy: true Failure Tolerance: 1 Leader: vault_2 Voters: vault_2 vault_3 vault_4 Optimistic Failure Tolerance: 1 ...snip... Redundancy Zones: zone-a Servers: vault_2 Voters: vault_2 Failure Tolerance: 0 zone-b Servers: vault_3 Voters: vault_3 Failure Tolerance: 0 zone-c Servers: vault_4 Voters: vault_4 Failure Tolerance: 0 ...snip...
Add additional node to each zone
- Use your preferred text editor and open the configuration files to examine the generated server configuration for - vault_5,- vault_6and- vault_7.- The - autopilot_redundancy_zoneparameter is set to- zone-ainside the- storagestanza. The same zone as- vault_2.- config-vault_5.hcl - storage "raft" { path = "/learn-vault-raft/raft-redundancy-zones/local/raft-vault_5/" node_id = "vault_5" autopilot_redundancy_zone = "zone-a" retry_join { leader_api_addr = "http://127.0.0.1:8200" } } ...snip...
- Set the - setup_2.shfile to executable.- $ chmod +x setup_2.sh
- Execute the - setup_2.shscript to add three additional nodes to the cluster.- $ ./setup_2.sh [vault_5] starting Vault server @ http://127.0.0.1:8500 Using [vault_1] root token (hvs.Ks9HlRsyL3CbaetmHL6AJEHi) to retrieve transit key for auto-unseal [vault_6] starting Vault server @ http://127.0.0.1:8600 Using [vault_1] root token (hvs.Ks9HlRsyL3CbaetmHL6AJEHi) to retrieve transit key for auto-unseal [vault_7] starting Vault server @ http://127.0.0.1:8700 Using [vault_1] root token (hvs.Ks9HlRsyL3CbaetmHL6AJEHi) to retrieve transit key for auto-unseal
- Check the redundancy zone memebership as the script executes. - $ vault operator raft autopilot state- Output: - Now, each redundancy zone has a failure tolerance of - 1, and the cluster-level optimistic failure tolerance is- 4since there are six nodes in the cluster.- Healthy: true Failure Tolerance: 1 Leader: vault_2 Voters: vault_2 vault_3 vault_4 Optimistic Failure Tolerance: 4 ...snip... Redundancy Zones: zone-a Servers: vault_2, vault_5 Voters: vault_2 Failure Tolerance: 1 zone-b Servers: vault_3, vault_6 Voters: vault_3 Failure Tolerance: 1 zone-c Servers: vault_4, vault_7 Voters: vault_4 Failure Tolerance: 1 ...snip...
- List the peers. - $ vault operator raft list-peers Node Address State Voter ---- ------- ----- ----- vault_2 127.0.0.1:8201 leader true vault_3 127.0.0.1:8301 follower true vault_4 127.0.0.1:8401 follower true vault_5 127.0.0.1:8501 follower false vault_6 127.0.0.1:8601 follower false vault_7 127.0.0.1:8701 follower false- The - vault_5,- vault_6and- vault_7nodes have joined the cluster as non-voters.- Note - There is only one voter node per zone. 
- Verify the cluster members. - $ vault operator members Host Name API Address ... Redundancy Zone Last Echo --------- ----------- ... --------------- --------- C02DVAMJML85 http://127.0.0.1:8200 ... zone-a n/a C02DVAMJML85 http://127.0.0.1:8300 ... zone-b 2022-06-14T15:54:04-07:00 C02DVAMJML85 http://127.0.0.1:8400 ... zone-c 2022-06-14T15:54:06-07:00 C02DVAMJML85 http://127.0.0.1:8500 ... zone-a 2022-06-14T15:54:06-07:00 C02DVAMJML85 http://127.0.0.1:8600 ... zone-b 2022-06-14T15:54:08-07:00 C02DVAMJML85 http://127.0.0.1:8700 ... zone-c 2022-06-14T15:54:05-07:00
Test fault tolerance
Stop vault_3 to mock server failure to see how autopilot behaves.
- Stop the - vault_3node.- $ ./cluster.sh stop vault_3 Found 1 Vault service(s) matching that name [vault_3] stopping
- Verify that the non-voter nodes have been removed from the cluster. - $ vault operator raft list-peers- Wait until Vault promotes - vault_6to become a voter in absense of- vault_3.- Output: - Since - vault_3is not running,- vault_6became the voter node.- Node Address State Voter ---- ------- ----- ----- vault_2 127.0.0.1:8201 leader true vault_3 127.0.0.1:8301 follower false vault_4 127.0.0.1:8401 follower true vault_5 127.0.0.1:8501 follower false vault_6 127.0.0.1:8601 follower true vault_7 127.0.0.1:8701 follower false
- Check the redundancy zone memebership as the script executes. - $ vault operator raft autopilot state- Output: - The cluster's optimistic failure tolerance is down to 3, and - zone-band- zone-chas zero fault tolerance.- Healthy: false Failure Tolerance: 1 Leader: vault_2 Voters: vault_2 vault_4 vault_6 Optimistic Failure Tolerance: 3 Servers: vault_2 Name: vault_2 Address: 127.0.0.1:8201 Status: leader Node Status: alive Healthy: true Last Contact: 0s Last Term: 3 Last Index: 356 Version: 1.11.0 Upgrade Version: 1.11.0 Redundancy Zone: zone-a Node Type: zone-voter vault_3 Name: vault_3 Address: 127.0.0.1:8301 Status: non-voter Node Status: alive Healthy: false Last Contact: 3m56.800681779s Last Term: 3 Last Index: 256 Version: 1.11.0 Upgrade Version: 1.11.0 Redundancy Zone: zone-b Node Type: zone-standby ...snip... Redundancy Zones: zone-a Servers: vault_2, vault_5 Voters: vault_2 Failure Tolerance: 1 zone-b Servers: vault_3, vault_6 Voters: vault_6 Failure Tolerance: 0 zone-c Servers: vault_4, vault_7 Voters: vault_4 Failure Tolerance: 1 ...snip... 
Note
 If you want to see the cluster behavior when vault_3 become
operational again, run ./cluster.sh start vault_3 to start the node. This
should bring the cluster back to its healthy state. Alternatively, you can stop
other nodes using the ./cluster stop <node_name> to stop other servers to
watch how autopilot behaves.
Post-test discussion
Although you stopped the vault_3 node to mimic server failure, it is still
listed as a peer. In reality, the node failure could be temporal, and they may
become operational again. Therefore, the node remain as cluster member unless
you remove them.
If the node is not recoverable, you can do one of the following:
Option 1: Manually remove nodes
Run the remove-peer command to remove the failed server.
$ vault operator raft remove-peer vault_3
Option 2: Enable dead server cleanup
Configure the dead server cleanup to automatically remove nodes deemed unhealthy. By default, the feature is disabled.
Example: The following command enables dead server cleanup. When a node remains unhealthy for 300 seconds (the default is 24 hours), Vault removes the node from the cluster.
$ vault operator raft autopilot set-config \
    -dead-server-last-contact-threshold=300 \
    -server-stabilization-time=10 \
    -cleanup-dead-servers=true \
    -min-quorum=3
See the Integrated Storage Autopilot tutorial to learn more.
Clean up
The cluster.sh script provides a clean operation that removes all services,
configuration, and modifications to your local system.
Clean up your local workstation.
$ ./cluster.sh clean
Found 1 Vault service(s) matching that name
[vault_1] stopping
...snip...
Removing log file /git/learn-vault-raft/raft-autopilot/local/vault_5.log
Removing log file /git/learn-vault-raft/raft-autopilot/local/vault_6.log
Clean complete
Help and Reference
For additional information, refer to the following tutorials and documentation.
