From a045575c7e5fb181d008e800aeabc51120982c50 Mon Sep 17 00:00:00 2001 From: Krastin Krastev Date: Tue, 2 Dec 2025 17:36:46 +0100 Subject: [PATCH 1/6] docs/manage/scale/autopilot v1.22.x update --- .../content/docs/manage/scale/autopilot.mdx | 177 ++++-------------- .../content/docs/manage/scale/index.mdx | 2 +- 2 files changed, 34 insertions(+), 145 deletions(-) diff --git a/content/consul/v1.22.x/content/docs/manage/scale/autopilot.mdx b/content/consul/v1.22.x/content/docs/manage/scale/autopilot.mdx index da8cb1e113..fc3321b50c 100644 --- a/content/consul/v1.22.x/content/docs/manage/scale/autopilot.mdx +++ b/content/consul/v1.22.x/content/docs/manage/scale/autopilot.mdx @@ -1,51 +1,26 @@ --- layout: docs -page_title: Consul autopilot +page_title: Consul Autopilot description: >- Use Autopilot features to monitor the Raft cluster, introduce stable servers, and clean up dead servers. --- -# Consul autopilot +# Consul Autopilot -This page describes Consul autopilot, which supports automatic, operator-friendly management of Consul -servers. It includes cleanup of dead servers, monitoring the state of the Raft -cluster, and stable server introduction. +This page describes Consul Autopilot, which supports automatic, operator-friendly management of Consul servers. Autopilot helps maintain the health and stability of the Consul server cluster by monitoring server health, introducing stable servers, and cleaning up dead servers. Furthermore, Consul Enterprise customers can leverage two additional Autopilot features, namely redundancy zones and automated upgrades to enhance datacenter resiliency and simplify operations. -To use autopilot features (with the exception of dead server cleanup), the -[`raft_protocol`](/consul/docs/reference/agent/configuration-file/raft#raft_protocol) -setting in the Consul agent configuration must be set to 3 or higher on all -servers. In Consul `0.8` this setting defaults to 2; in Consul `1.0` it will -default to 3. For more information, check the [Version Upgrade -section](/consul/docs/upgrade/version-specific) on Raft protocol -versions in Consul `1.0`. +## Overview -In this tutorial you will learn how Consul tracks the stability of servers, how -to tune those conditions, and get some details on the other autopilot's features. +Autopilot includes the following features: +- [Server health checking](#server-health-checking) +- [Server stabilization time](#server-stabilization-time) +- [Dead server cleanup](#dead-server-cleanup) +- [Redundancy zones (only available in Consul Enterprise)](#redundancy-zones) +- [Automated upgrades (only available in Consul Enterprise)](#automated-upgrades) -- Server Stabilization -- Dead server cleanup -- Redundancy zones (only available in Consul Enterprise) -- Automated upgrades (only available in Consul Enterprise) +### Default configuration -Note, in this tutorial we are using examples from a Consul `1.7` datacenter, we -are starting with Autopilot enabled by default. - -## Default configuration - -The configuration of Autopilot is loaded by the leader from the agent's -[autopilot settings](/consul/docs/reference/agent/configuration-file/general#autopilot) -when initially bootstrapping the datacenter. Since autopilot and its features -are already enabled, you only need to update the configuration to disable them. - -All Consul servers should have Autopilot and its features either enabled or -disabled to ensure consistency across servers in case of a failure. -Additionally, Autopilot must be enabled to use any of the features, but the -features themselves can be configured independently. Meaning you can enable or -disable any of the features separately, at any time. - -You can check the default values using the `consul operator` CLI command or -using the [`/v1/operator/autopilot` -endpoint](/consul/api-docs/operator/autopilot) +You can check the default Autopilot values using the `consul operator` CLI command or using the [`/v1/operator/autopilot` endpoint](/consul/api-docs/operator/autopilot) @@ -90,36 +65,20 @@ $ curl http://127.0.0.1:8500/v1/operator/autopilot/configuration -### Autopilot and Consul snapshots +Changes to the Autopilot configuration are persisted in the Raft database maintained by the Consul servers. This means that Autopilott configuration will be included in the Consul snapshot data. -Changes to the autopilot configuration are persisted in the Raft database -maintained by the Consul servers. This means that autopilot configuration will -be included in the Consul snapshot data. Any snapshot taken prior to autopilot -configuration changes will contain the old configuration, and should be -considered unsafe to restore since they will remove the change and cause -unpredictable behaviors for the automations that might rely on the new -configuration. +## Workflow -We recommend that you take a snapshot after any changes to the autopilot -configuration, and consider that as the last safe point in time to roll-back in -case a restore is needed. +### Server health checking -## Server health checking - -An internal health check runs on the leader to track the stability of servers. - -A server is considered healthy if all of the following conditions are true. +An internal health check runs on the leader to track the stability of servers. A server is considered healthy if all of the following conditions are true. - It has a SerfHealth status of 'Alive'. -- The time since its last contact with the current leader is below - `LastContactThreshold` (that by default is `200ms`). +- The time since its last contact with the current leader is below `LastContactThreshold` (that by default is `200ms`). - Its latest Raft term matches the leader's term. -- The number of Raft log entries it trails the leader by does not exceed - `MaxTrailingLogs` (that by default is `250`). +- The number of Raft log entries it trails the leader by does not exceed `MaxTrailingLogs` (that by default is `250`). -The status of these health checks can be viewed through the -`/v1/operator/autopilot/health` HTTP endpoint, with a top level `Healthy` field -indicating the overall status of the datacenter: +The status of these health checks can be viewed through the `/v1/operator/autopilot/health` HTTP endpoint, with a top level `Healthy` field indicating the overall status of the datacenter: ```shell-session $ curl localhost:8500/v1/operator/autopilot/health | jq . @@ -172,21 +131,12 @@ $ curl localhost:8500/v1/operator/autopilot/health | jq . ## Server stabilization time -When a new server is added to the datacenter, there is a waiting period where it -must be healthy and stable for a certain amount of time before being promoted to -a full, voting member. This is defined by the `ServerStabilizationTime` -autopilot's parameter and by default is 10 seconds. +When a new server is added to the datacenter, there is a waiting period where it must be healthy and stable for a certain amount of time before being promoted to a full, voting member. This is defined by the `ServerStabilizationTime`, Autopilot's parameter, and by default is 10 seconds. -In case your configuration require a different amount of time for the node to -get ready, for example in case you have some extra VM checks at startup that -might affect node resource availability, you can tune the parameter and assign -it a different duration. +In case your configuration requires a different amount of time, you can tune the parameter and assign it a different duration. ```shell-session $ consul operator autopilot set-config -server-stabilization-time=15s -``` - -```plaintext hideClipboard Configuration updated! ``` @@ -194,9 +144,6 @@ Use the `get-config` command to check the configuration. ```shell-session $ consul operator autopilot get-config -``` - -```plaintext hideClipboard CleanupDeadServers = true LastContactThreshold = 200ms MaxTrailingLogs = 250 @@ -207,92 +154,34 @@ DisableUpgradeMigration = false UpgradeVersionTag = "" ``` -## Dead server cleanup - -If autopilot is disabled, it will take 72 hours for dead servers to be -automatically reaped or an operator must write a script to `consul force-leave`. -If another server failure occurred it could jeopardize the quorum, even if the -failed Consul server had been automatically replaced. Autopilot helps prevent -these kinds of outages by quickly removing failed servers as soon as a -replacement Consul server comes online. When servers are removed by the cleanup -process they will enter the "left" state. - -With Autopilot's dead server cleanup enabled, dead servers will periodically be -cleaned up and removed from the Raft peer set to prevent them from interfering -with the quorum size and leader elections. The cleanup process will also be -automatically triggered whenever a new server is successfully added to the -datacenter. - -We suggest leaving the feature enabled to avoid introducing manual steps in -the Consul management to make sure the faulty nodes are not remaining in the -Raft pool for too long without the need for manual pruning. In test scenarios or -in environments where you want to delegate the faulty node pruning to an -external tool or system you can disable the dead server cleanup feature using -the `consul operator` command. - -```shell-session -$ consul operator autopilot set-config -cleanup-dead-servers=false -``` - -```plaintext hideClipboard -Configuration updated! -``` +### Dead server cleanup -Use the `get-config` command to check the configuration. +If Autopilot is disabled, it will take 72 hours for dead servers to be automatically reaped, or an operator must manually issue the `consul force-leave ` command. If another server failure occurred it could jeopardize the quorum, even if the failed Consul server had been automatically replaced. Autopilot helps prevent these kinds of outages by quickly removing failed servers as soon as a replacement Consul server comes online. When servers are removed by the cleanup process they will enter the "left" state. -```shell-session -$ consul operator autopilot get-config -``` +With Autopilot's dead server cleanup enabled, dead servers will periodically be cleaned up and removed from the Raft peer set to prevent them from interfering with the quorum size and leader elections. The cleanup process will also be automatically triggered whenever a new server is successfully added to the datacenter. -```plaintext hideClipboard -CleanupDeadServers = false -LastContactThreshold = 200ms -MaxTrailingLogs = 250 -MinQuorum = 0 -ServerStabilizationTime = 10s -RedundancyZoneTag = "" -DisableUpgradeMigration = false -UpgradeVersionTag = "" -``` +We suggest leaving the feature enabled to avoid faulty nodes remaining in the Raft pool for too long without the need for manual pruning. In test scenarios or in environments you can disable the faulty node pruning by using the `consul operator autopilot set-config -cleanup-dead-servers=false` command. ## Enterprise features -Consul Enterprise customer can take advantage of two more features of autopilot -to further strengthen and automate Consul operations. +To further strengthen and automate Consul operations, there are two more Autopilot features in Consul Enterprise that customers can take advantage of to improve their datacenter resiliency and to simplify operations. ### Redundancy zones -Consul’s redundancy zones provide high availability in the case of server -failure through the Enterprise feature of autopilot. Autopilot allows you to add -read replicas to your datacenter that will be promoted to the "voting" status in -case of voting server failure. +Consul’s redundancy zones provide high availability in the case of server failure through the Enterprise feature of Autopilot. Autopilot allows you to add read replicas to your datacenter that will be promoted to the "voting" status in case of voting server failure. -You can use this tutorial to implement isolated failure domains such as AWS -Availability Zones (AZ) to obtain redundancy within an AZ without having to -sustain the overhead of a large quorum. +You can utilise redundancy zones to implement isolated failure domains such as AWS Availability Zones (AZ), and therefore to obtain redundancy within an AZ without having to sustain the overhead of a large quorum. -Check [provide fault tolerance with redundancy zones](/consul/tutorials/operate-consul/redundancy-zones) -to learn more on the functionality. +Check [provide fault tolerance with redundancy zones](/consul/docs/manage/scale/redundancy-zone) to learn more on the functionality. ### Automated upgrades -Consul’s automatic upgrades provide a simplified way to upgrade existing Consul -datacenter. This functionally is provided through the Enterprise feature of -autopilot. Autopilot allows you to add new servers directly to the datacenter -and waits until you have enough servers running the new version to perform a -leadership change and demote the old servers as "non-voters". +Consul’s automatic upgrades provide a simplified way to upgrade existing Consul datacenter. This functionally is provided through the Enterprise feature of Autopilot. Autopilot allows you to add new servers directly to the datacenter and waits until you have enough servers running the new version to perform a leadership change and demote the old servers as "non-voters". -Check [automate upgrades with Consul Enterprise](/consul/tutorials/datacenter-operations/upgrade-automation) -to learn more on the functionality. +Check [automate upgrades with Consul Enterprise](/consul/docs/upgrade/automated) to learn more on the functionality. ## Next steps -In this tutorial you got an overview of the autopilot features and got examples -on how and when tune the default values. +To read further about [Autopilot](/consul/docs/manage/scale/autopilot) functionality, check the [read replicas](/consul/docs/manage/scale/read-replica) and [redundancy zones](/consul/docs/manage/scale/redundancy-zone) pages. -To learn more about the Autopilot settings you did not configure in this tutorial, -[last_contact_threshold](/consul/docs/reference/agent/configuration-file/general#last_contact_threshold) -and -[max_trailing_logs](/consul/docs/reference/agent/configuration-file/general#max_trailing_logs), -either read the agent configuration documentation or use the help flag with the -operator autopilot `consul operator autopilot set-config -h`. +To learn more about operational Autopilot settings regarding stability, check the [last_contact_threshold](/consul/docs/reference/agent/configuration-file/bootstrap#last_contact_threshold) and [max_trailing_logs](/consul/docs/reference/agent/configuration-file/bootstrap#max_trailing_logs) parameters in the Consul agent configuration documentation. \ No newline at end of file diff --git a/content/consul/v1.22.x/content/docs/manage/scale/index.mdx b/content/consul/v1.22.x/content/docs/manage/scale/index.mdx index 82ecc5aa59..47d66a8680 100644 --- a/content/consul/v1.22.x/content/docs/manage/scale/index.mdx +++ b/content/consul/v1.22.x/content/docs/manage/scale/index.mdx @@ -324,7 +324,7 @@ Enterprise customers might also rely on [automated backups](/consul/docs/manage/ We do not recommend automated scaling of Consul server nodes based on load or usage unless it is coupled by some logic that prevents the cluster from losing quorum. -One way to improve your datacenter resiliency and to leverage automatic scaling is to use [read replicas](/consul/docs/manage/scale/read-replica) and [redundancy zones](/consul/docs/manage/scale/redundancy-zone). +One way to improve your datacenter resiliency and to leverage automatic scaling is to use [autopilot](/consul/docs/manage/scale/autopilot), [read replicas](/consul/docs/manage/scale/read-replica) and [redundancy zones](/consul/docs/manage/scale/redundancy-zone). These features provide support for read-heavy workload periods without risking the stability of the overall cluster. From f12b1c024e39702c3d118258a28261ad75cfb830 Mon Sep 17 00:00:00 2001 From: Krastin Krastev Date: Tue, 2 Dec 2025 17:40:00 +0100 Subject: [PATCH 2/6] docs/manage/scale/autopilot v1.21.x update --- .../content/docs/manage/scale/autopilot.mdx | 177 ++++-------------- .../content/docs/manage/scale/index.mdx | 5 +- 2 files changed, 34 insertions(+), 148 deletions(-) diff --git a/content/consul/v1.21.x/content/docs/manage/scale/autopilot.mdx b/content/consul/v1.21.x/content/docs/manage/scale/autopilot.mdx index da8cb1e113..fc3321b50c 100644 --- a/content/consul/v1.21.x/content/docs/manage/scale/autopilot.mdx +++ b/content/consul/v1.21.x/content/docs/manage/scale/autopilot.mdx @@ -1,51 +1,26 @@ --- layout: docs -page_title: Consul autopilot +page_title: Consul Autopilot description: >- Use Autopilot features to monitor the Raft cluster, introduce stable servers, and clean up dead servers. --- -# Consul autopilot +# Consul Autopilot -This page describes Consul autopilot, which supports automatic, operator-friendly management of Consul -servers. It includes cleanup of dead servers, monitoring the state of the Raft -cluster, and stable server introduction. +This page describes Consul Autopilot, which supports automatic, operator-friendly management of Consul servers. Autopilot helps maintain the health and stability of the Consul server cluster by monitoring server health, introducing stable servers, and cleaning up dead servers. Furthermore, Consul Enterprise customers can leverage two additional Autopilot features, namely redundancy zones and automated upgrades to enhance datacenter resiliency and simplify operations. -To use autopilot features (with the exception of dead server cleanup), the -[`raft_protocol`](/consul/docs/reference/agent/configuration-file/raft#raft_protocol) -setting in the Consul agent configuration must be set to 3 or higher on all -servers. In Consul `0.8` this setting defaults to 2; in Consul `1.0` it will -default to 3. For more information, check the [Version Upgrade -section](/consul/docs/upgrade/version-specific) on Raft protocol -versions in Consul `1.0`. +## Overview -In this tutorial you will learn how Consul tracks the stability of servers, how -to tune those conditions, and get some details on the other autopilot's features. +Autopilot includes the following features: +- [Server health checking](#server-health-checking) +- [Server stabilization time](#server-stabilization-time) +- [Dead server cleanup](#dead-server-cleanup) +- [Redundancy zones (only available in Consul Enterprise)](#redundancy-zones) +- [Automated upgrades (only available in Consul Enterprise)](#automated-upgrades) -- Server Stabilization -- Dead server cleanup -- Redundancy zones (only available in Consul Enterprise) -- Automated upgrades (only available in Consul Enterprise) +### Default configuration -Note, in this tutorial we are using examples from a Consul `1.7` datacenter, we -are starting with Autopilot enabled by default. - -## Default configuration - -The configuration of Autopilot is loaded by the leader from the agent's -[autopilot settings](/consul/docs/reference/agent/configuration-file/general#autopilot) -when initially bootstrapping the datacenter. Since autopilot and its features -are already enabled, you only need to update the configuration to disable them. - -All Consul servers should have Autopilot and its features either enabled or -disabled to ensure consistency across servers in case of a failure. -Additionally, Autopilot must be enabled to use any of the features, but the -features themselves can be configured independently. Meaning you can enable or -disable any of the features separately, at any time. - -You can check the default values using the `consul operator` CLI command or -using the [`/v1/operator/autopilot` -endpoint](/consul/api-docs/operator/autopilot) +You can check the default Autopilot values using the `consul operator` CLI command or using the [`/v1/operator/autopilot` endpoint](/consul/api-docs/operator/autopilot) @@ -90,36 +65,20 @@ $ curl http://127.0.0.1:8500/v1/operator/autopilot/configuration -### Autopilot and Consul snapshots +Changes to the Autopilot configuration are persisted in the Raft database maintained by the Consul servers. This means that Autopilott configuration will be included in the Consul snapshot data. -Changes to the autopilot configuration are persisted in the Raft database -maintained by the Consul servers. This means that autopilot configuration will -be included in the Consul snapshot data. Any snapshot taken prior to autopilot -configuration changes will contain the old configuration, and should be -considered unsafe to restore since they will remove the change and cause -unpredictable behaviors for the automations that might rely on the new -configuration. +## Workflow -We recommend that you take a snapshot after any changes to the autopilot -configuration, and consider that as the last safe point in time to roll-back in -case a restore is needed. +### Server health checking -## Server health checking - -An internal health check runs on the leader to track the stability of servers. - -A server is considered healthy if all of the following conditions are true. +An internal health check runs on the leader to track the stability of servers. A server is considered healthy if all of the following conditions are true. - It has a SerfHealth status of 'Alive'. -- The time since its last contact with the current leader is below - `LastContactThreshold` (that by default is `200ms`). +- The time since its last contact with the current leader is below `LastContactThreshold` (that by default is `200ms`). - Its latest Raft term matches the leader's term. -- The number of Raft log entries it trails the leader by does not exceed - `MaxTrailingLogs` (that by default is `250`). +- The number of Raft log entries it trails the leader by does not exceed `MaxTrailingLogs` (that by default is `250`). -The status of these health checks can be viewed through the -`/v1/operator/autopilot/health` HTTP endpoint, with a top level `Healthy` field -indicating the overall status of the datacenter: +The status of these health checks can be viewed through the `/v1/operator/autopilot/health` HTTP endpoint, with a top level `Healthy` field indicating the overall status of the datacenter: ```shell-session $ curl localhost:8500/v1/operator/autopilot/health | jq . @@ -172,21 +131,12 @@ $ curl localhost:8500/v1/operator/autopilot/health | jq . ## Server stabilization time -When a new server is added to the datacenter, there is a waiting period where it -must be healthy and stable for a certain amount of time before being promoted to -a full, voting member. This is defined by the `ServerStabilizationTime` -autopilot's parameter and by default is 10 seconds. +When a new server is added to the datacenter, there is a waiting period where it must be healthy and stable for a certain amount of time before being promoted to a full, voting member. This is defined by the `ServerStabilizationTime`, Autopilot's parameter, and by default is 10 seconds. -In case your configuration require a different amount of time for the node to -get ready, for example in case you have some extra VM checks at startup that -might affect node resource availability, you can tune the parameter and assign -it a different duration. +In case your configuration requires a different amount of time, you can tune the parameter and assign it a different duration. ```shell-session $ consul operator autopilot set-config -server-stabilization-time=15s -``` - -```plaintext hideClipboard Configuration updated! ``` @@ -194,9 +144,6 @@ Use the `get-config` command to check the configuration. ```shell-session $ consul operator autopilot get-config -``` - -```plaintext hideClipboard CleanupDeadServers = true LastContactThreshold = 200ms MaxTrailingLogs = 250 @@ -207,92 +154,34 @@ DisableUpgradeMigration = false UpgradeVersionTag = "" ``` -## Dead server cleanup - -If autopilot is disabled, it will take 72 hours for dead servers to be -automatically reaped or an operator must write a script to `consul force-leave`. -If another server failure occurred it could jeopardize the quorum, even if the -failed Consul server had been automatically replaced. Autopilot helps prevent -these kinds of outages by quickly removing failed servers as soon as a -replacement Consul server comes online. When servers are removed by the cleanup -process they will enter the "left" state. - -With Autopilot's dead server cleanup enabled, dead servers will periodically be -cleaned up and removed from the Raft peer set to prevent them from interfering -with the quorum size and leader elections. The cleanup process will also be -automatically triggered whenever a new server is successfully added to the -datacenter. - -We suggest leaving the feature enabled to avoid introducing manual steps in -the Consul management to make sure the faulty nodes are not remaining in the -Raft pool for too long without the need for manual pruning. In test scenarios or -in environments where you want to delegate the faulty node pruning to an -external tool or system you can disable the dead server cleanup feature using -the `consul operator` command. - -```shell-session -$ consul operator autopilot set-config -cleanup-dead-servers=false -``` - -```plaintext hideClipboard -Configuration updated! -``` +### Dead server cleanup -Use the `get-config` command to check the configuration. +If Autopilot is disabled, it will take 72 hours for dead servers to be automatically reaped, or an operator must manually issue the `consul force-leave ` command. If another server failure occurred it could jeopardize the quorum, even if the failed Consul server had been automatically replaced. Autopilot helps prevent these kinds of outages by quickly removing failed servers as soon as a replacement Consul server comes online. When servers are removed by the cleanup process they will enter the "left" state. -```shell-session -$ consul operator autopilot get-config -``` +With Autopilot's dead server cleanup enabled, dead servers will periodically be cleaned up and removed from the Raft peer set to prevent them from interfering with the quorum size and leader elections. The cleanup process will also be automatically triggered whenever a new server is successfully added to the datacenter. -```plaintext hideClipboard -CleanupDeadServers = false -LastContactThreshold = 200ms -MaxTrailingLogs = 250 -MinQuorum = 0 -ServerStabilizationTime = 10s -RedundancyZoneTag = "" -DisableUpgradeMigration = false -UpgradeVersionTag = "" -``` +We suggest leaving the feature enabled to avoid faulty nodes remaining in the Raft pool for too long without the need for manual pruning. In test scenarios or in environments you can disable the faulty node pruning by using the `consul operator autopilot set-config -cleanup-dead-servers=false` command. ## Enterprise features -Consul Enterprise customer can take advantage of two more features of autopilot -to further strengthen and automate Consul operations. +To further strengthen and automate Consul operations, there are two more Autopilot features in Consul Enterprise that customers can take advantage of to improve their datacenter resiliency and to simplify operations. ### Redundancy zones -Consul’s redundancy zones provide high availability in the case of server -failure through the Enterprise feature of autopilot. Autopilot allows you to add -read replicas to your datacenter that will be promoted to the "voting" status in -case of voting server failure. +Consul’s redundancy zones provide high availability in the case of server failure through the Enterprise feature of Autopilot. Autopilot allows you to add read replicas to your datacenter that will be promoted to the "voting" status in case of voting server failure. -You can use this tutorial to implement isolated failure domains such as AWS -Availability Zones (AZ) to obtain redundancy within an AZ without having to -sustain the overhead of a large quorum. +You can utilise redundancy zones to implement isolated failure domains such as AWS Availability Zones (AZ), and therefore to obtain redundancy within an AZ without having to sustain the overhead of a large quorum. -Check [provide fault tolerance with redundancy zones](/consul/tutorials/operate-consul/redundancy-zones) -to learn more on the functionality. +Check [provide fault tolerance with redundancy zones](/consul/docs/manage/scale/redundancy-zone) to learn more on the functionality. ### Automated upgrades -Consul’s automatic upgrades provide a simplified way to upgrade existing Consul -datacenter. This functionally is provided through the Enterprise feature of -autopilot. Autopilot allows you to add new servers directly to the datacenter -and waits until you have enough servers running the new version to perform a -leadership change and demote the old servers as "non-voters". +Consul’s automatic upgrades provide a simplified way to upgrade existing Consul datacenter. This functionally is provided through the Enterprise feature of Autopilot. Autopilot allows you to add new servers directly to the datacenter and waits until you have enough servers running the new version to perform a leadership change and demote the old servers as "non-voters". -Check [automate upgrades with Consul Enterprise](/consul/tutorials/datacenter-operations/upgrade-automation) -to learn more on the functionality. +Check [automate upgrades with Consul Enterprise](/consul/docs/upgrade/automated) to learn more on the functionality. ## Next steps -In this tutorial you got an overview of the autopilot features and got examples -on how and when tune the default values. +To read further about [Autopilot](/consul/docs/manage/scale/autopilot) functionality, check the [read replicas](/consul/docs/manage/scale/read-replica) and [redundancy zones](/consul/docs/manage/scale/redundancy-zone) pages. -To learn more about the Autopilot settings you did not configure in this tutorial, -[last_contact_threshold](/consul/docs/reference/agent/configuration-file/general#last_contact_threshold) -and -[max_trailing_logs](/consul/docs/reference/agent/configuration-file/general#max_trailing_logs), -either read the agent configuration documentation or use the help flag with the -operator autopilot `consul operator autopilot set-config -h`. +To learn more about operational Autopilot settings regarding stability, check the [last_contact_threshold](/consul/docs/reference/agent/configuration-file/bootstrap#last_contact_threshold) and [max_trailing_logs](/consul/docs/reference/agent/configuration-file/bootstrap#max_trailing_logs) parameters in the Consul agent configuration documentation. \ No newline at end of file diff --git a/content/consul/v1.21.x/content/docs/manage/scale/index.mdx b/content/consul/v1.21.x/content/docs/manage/scale/index.mdx index 11879cf026..47d66a8680 100644 --- a/content/consul/v1.21.x/content/docs/manage/scale/index.mdx +++ b/content/consul/v1.21.x/content/docs/manage/scale/index.mdx @@ -77,13 +77,10 @@ Consul server agents are an important part of Consul’s architecture. This sect Consul servers can be deployed on a few different runtimes: -- **HashiCorp Cloud Platform (HCP) Consul (Managed)**. These Consul servers are deployed in a hosted environment managed by HCP. To get started with HCP Consul servers in Kubernetes or VM deployments, refer to the [Deploy HCP Consul tutorial](/consul/tutorials/get-started-hcp/hcp-gs-deploy). - **VMs or bare metal servers (Self-managed)**. To get started with Consul on VMs or bare metal servers, refer to the [Deploy Consul server tutorial](/consul/tutorials/get-started-vms/virtual-machine-gs-deploy). For a full list of configuration options, refer to [Agents Overview](/consul/docs/fundamentals/agent). - **Kubernetes (Self-managed)**. To get started with Consul on Kubernetes, refer to the [Deploy Consul on Kubernetes tutorial](/consul/tutorials/get-started-kubernetes/kubernetes-gs-deploy). - **Other container environments, including Docker, Rancher, and Mesos (Self-managed)**. -@include 'alerts/hcp-dedicated-eol.mdx' - When operating Consul at scale, self-managed VM or bare metal server deployments offer the most flexibility. Some Consul Enterprise features that can enhance fault tolerance and read scalability, such as [redundancy zones](/consul/docs/manage/scale/redundancy-zone) and [read replicas](/consul/docs/manage/scale/read-replica), are not available to server agents on Kubernetes runtimes. To learn more, refer to [Consul Enterprise feature availability by runtime](/consul/docs/enterprise#feature-availability-by-runtime). ### Number of Consul servers @@ -327,7 +324,7 @@ Enterprise customers might also rely on [automated backups](/consul/docs/manage/ We do not recommend automated scaling of Consul server nodes based on load or usage unless it is coupled by some logic that prevents the cluster from losing quorum. -One way to improve your datacenter resiliency and to leverage automatic scaling is to use [read replicas](/consul/docs/manage/scale/read-replica) and [redundancy zones](/consul/docs/manage/scale/redundancy-zone). +One way to improve your datacenter resiliency and to leverage automatic scaling is to use [autopilot](/consul/docs/manage/scale/autopilot), [read replicas](/consul/docs/manage/scale/read-replica) and [redundancy zones](/consul/docs/manage/scale/redundancy-zone). These features provide support for read-heavy workload periods without risking the stability of the overall cluster. From 71fb0b3c6febbac06dc5b2ea6d204602784b974c Mon Sep 17 00:00:00 2001 From: Krastin Krastev Date: Thu, 11 Dec 2025 15:38:59 +0100 Subject: [PATCH 3/6] Apply suggestions from code review Co-authored-by: Jeff Boruszak <104028618+boruszak@users.noreply.github.com> --- .../content/docs/manage/scale/autopilot.mdx | 48 ++++++++++--------- .../content/docs/manage/scale/index.mdx | 2 +- 2 files changed, 27 insertions(+), 23 deletions(-) diff --git a/content/consul/v1.21.x/content/docs/manage/scale/autopilot.mdx b/content/consul/v1.21.x/content/docs/manage/scale/autopilot.mdx index fc3321b50c..a5cb7d71fd 100644 --- a/content/consul/v1.21.x/content/docs/manage/scale/autopilot.mdx +++ b/content/consul/v1.21.x/content/docs/manage/scale/autopilot.mdx @@ -7,20 +7,20 @@ description: >- # Consul Autopilot -This page describes Consul Autopilot, which supports automatic, operator-friendly management of Consul servers. Autopilot helps maintain the health and stability of the Consul server cluster by monitoring server health, introducing stable servers, and cleaning up dead servers. Furthermore, Consul Enterprise customers can leverage two additional Autopilot features, namely redundancy zones and automated upgrades to enhance datacenter resiliency and simplify operations. +This page describes Consul Autopilot, a set of features that provide operator-friendly management automations for Consul servers. ## Overview -Autopilot includes the following features: +Consul autopilot helps you maintain the health and stability of the Consul server cluster. It includes the following features: - [Server health checking](#server-health-checking) - [Server stabilization time](#server-stabilization-time) - [Dead server cleanup](#dead-server-cleanup) - [Redundancy zones (only available in Consul Enterprise)](#redundancy-zones) - [Automated upgrades (only available in Consul Enterprise)](#automated-upgrades) -### Default configuration +## Default configuration -You can check the default Autopilot values using the `consul operator` CLI command or using the [`/v1/operator/autopilot` endpoint](/consul/api-docs/operator/autopilot) +To check the default autopilot values, use the `consul operator` CLI command or the [`/v1/operator/autopilot` endpoint](/consul/api-docs/operator/autopilot). @@ -65,20 +65,20 @@ $ curl http://127.0.0.1:8500/v1/operator/autopilot/configuration -Changes to the Autopilot configuration are persisted in the Raft database maintained by the Consul servers. This means that Autopilott configuration will be included in the Consul snapshot data. +Consul servers maintain changes to the autopilot configuration in the Raft database. As a result, autopilot configurations are included in the Consul snapshot data. ## Workflow -### Server health checking +## Server health checking An internal health check runs on the leader to track the stability of servers. A server is considered healthy if all of the following conditions are true. - It has a SerfHealth status of 'Alive'. -- The time since its last contact with the current leader is below `LastContactThreshold` (that by default is `200ms`). +- The time since its last contact with the current leader is below `LastContactThreshold`. The default value is `200ms`. - Its latest Raft term matches the leader's term. -- The number of Raft log entries it trails the leader by does not exceed `MaxTrailingLogs` (that by default is `250`). +- The number of Raft log entries it trails the leader by does not exceed `MaxTrailingLogs`. The default value is `250`. -The status of these health checks can be viewed through the `/v1/operator/autopilot/health` HTTP endpoint, with a top level `Healthy` field indicating the overall status of the datacenter: +To return the status of these health checks, use the `/v1/operator/autopilot/health` HTTP endpoint. The `Healthy` field at the top indicates the overall status of the datacenter: ```shell-session $ curl localhost:8500/v1/operator/autopilot/health | jq . @@ -131,9 +131,9 @@ $ curl localhost:8500/v1/operator/autopilot/health | jq . ## Server stabilization time -When a new server is added to the datacenter, there is a waiting period where it must be healthy and stable for a certain amount of time before being promoted to a full, voting member. This is defined by the `ServerStabilizationTime`, Autopilot's parameter, and by default is 10 seconds. +When a new server joins the datacenter, there is an initial waiting period where it must stay healthy and stable before it can become a voting member. This duration is configured by the `ServerStabilizationTime` parameter. By default it is 10 seconds. -In case your configuration requires a different amount of time, you can tune the parameter and assign it a different duration. +If you need a different amount of time, you can tune the parameter to set a different duration. The following example extends the waiting period to 15 seconds: ```shell-session $ consul operator autopilot set-config -server-stabilization-time=15s @@ -156,32 +156,36 @@ UpgradeVersionTag = "" ### Dead server cleanup -If Autopilot is disabled, it will take 72 hours for dead servers to be automatically reaped, or an operator must manually issue the `consul force-leave ` command. If another server failure occurred it could jeopardize the quorum, even if the failed Consul server had been automatically replaced. Autopilot helps prevent these kinds of outages by quickly removing failed servers as soon as a replacement Consul server comes online. When servers are removed by the cleanup process they will enter the "left" state. +When autopilot is disabled, it takes 72 hours for Consul to automatically reap dead servers. The alternative would be for an operator to manually issue the `consul force-leave ` command for each dead server. -With Autopilot's dead server cleanup enabled, dead servers will periodically be cleaned up and removed from the Raft peer set to prevent them from interfering with the quorum size and leader elections. The cleanup process will also be automatically triggered whenever a new server is successfully added to the datacenter. +In this situation, another server failure could jeopardize the cluster's quorum. The Consul cluster still considers the missing server a member of the datacenter, even if the failed Consul server was automatically replaced. -We suggest leaving the feature enabled to avoid faulty nodes remaining in the Raft pool for too long without the need for manual pruning. In test scenarios or in environments you can disable the faulty node pruning by using the `consul operator autopilot set-config -cleanup-dead-servers=false` command. +Autopilot helps prevent these kinds of outages from becoming outages. It quickly removes failed servers as soon as a replacement Consul server comes online. When servers are removed by the cleanup process, they enter the "left" state and are not considered for the datacenter's quorum. + +Autopilot also triggers the cleanup process automatically whenever a new server successfully joins the datacenter. + +We recommend leaving autopilot enabled to avoid issues with faulty nodes that require manual pruning. In test scenarios and dev environments you can disable the faulty node pruning with the `consul operator autopilot set-config -cleanup-dead-servers=false` command. ## Enterprise features To further strengthen and automate Consul operations, there are two more Autopilot features in Consul Enterprise that customers can take advantage of to improve their datacenter resiliency and to simplify operations. -### Redundancy zones +## Redundancy zones (Enterprise) -Consul’s redundancy zones provide high availability in the case of server failure through the Enterprise feature of Autopilot. Autopilot allows you to add read replicas to your datacenter that will be promoted to the "voting" status in case of voting server failure. +Redundancy zones provide high availability in case of server failure. With Consul Enterprise, autopilot helps you create redundancy zones by adding read replicas to your datacenter that will be promoted to the "voting" status if a voting server fails. -You can utilise redundancy zones to implement isolated failure domains such as AWS Availability Zones (AZ), and therefore to obtain redundancy within an AZ without having to sustain the overhead of a large quorum. +You can set up redundancy zones to implement isolated failure domains. For example, deploying a server and a read replica in each AWS Availability Zones (AZ) provides additional protection against failure within a region. -Check [provide fault tolerance with redundancy zones](/consul/docs/manage/scale/redundancy-zone) to learn more on the functionality. +To learn more, refer to [provide fault tolerance with redundancy zones](/consul/docs/manage/scale/redundancy-zone). ### Automated upgrades -Consul’s automatic upgrades provide a simplified way to upgrade existing Consul datacenter. This functionally is provided through the Enterprise feature of Autopilot. Autopilot allows you to add new servers directly to the datacenter and waits until you have enough servers running the new version to perform a leadership change and demote the old servers as "non-voters". +Automatic upgrades are an Enterprise feature that helps you upgrade existing Consul datacenter. With autopilot, you can add new servers running a new Consul version directly to the datacenter. Then when you have enough servers running the new version, you can perform a leadership change and demote the old servers to "non-voters". -Check [automate upgrades with Consul Enterprise](/consul/docs/upgrade/automated) to learn more on the functionality. +To learn more, refer to [automate upgrades with Consul Enterprise](/consul/docs/upgrade/automated). ## Next steps -To read further about [Autopilot](/consul/docs/manage/scale/autopilot) functionality, check the [read replicas](/consul/docs/manage/scale/read-replica) and [redundancy zones](/consul/docs/manage/scale/redundancy-zone) pages. +To learn more about the autopilot features described on this page, refer to [read replicas](/consul/docs/manage/scale/read-replica) and [redundancy zones](/consul/docs/manage/scale/redundancy-zone). -To learn more about operational Autopilot settings regarding stability, check the [last_contact_threshold](/consul/docs/reference/agent/configuration-file/bootstrap#last_contact_threshold) and [max_trailing_logs](/consul/docs/reference/agent/configuration-file/bootstrap#max_trailing_logs) parameters in the Consul agent configuration documentation. \ No newline at end of file +For agent specifications related to autopilot settings for stability, refer to the [last_contact_threshold](/consul/docs/reference/agent/configuration-file/bootstrap#last_contact_threshold) and [max_trailing_logs](/consul/docs/reference/agent/configuration-file/bootstrap#max_trailing_logs) parameters in the Consul agent configuration documentation. \ No newline at end of file diff --git a/content/consul/v1.21.x/content/docs/manage/scale/index.mdx b/content/consul/v1.21.x/content/docs/manage/scale/index.mdx index 47d66a8680..6af498fc2b 100644 --- a/content/consul/v1.21.x/content/docs/manage/scale/index.mdx +++ b/content/consul/v1.21.x/content/docs/manage/scale/index.mdx @@ -324,7 +324,7 @@ Enterprise customers might also rely on [automated backups](/consul/docs/manage/ We do not recommend automated scaling of Consul server nodes based on load or usage unless it is coupled by some logic that prevents the cluster from losing quorum. -One way to improve your datacenter resiliency and to leverage automatic scaling is to use [autopilot](/consul/docs/manage/scale/autopilot), [read replicas](/consul/docs/manage/scale/read-replica) and [redundancy zones](/consul/docs/manage/scale/redundancy-zone). +One way to improve your datacenter resiliency and to leverage automatic scaling is to use [autopilot](/consul/docs/manage/scale/autopilot), as well as Enterprise features such as [read replicas](/consul/docs/manage/scale/read-replica) and [redundancy zones](/consul/docs/manage/scale/redundancy-zone). These features provide support for read-heavy workload periods without risking the stability of the overall cluster. From ac27c88440e1c27878b8cad4160e1518a9cfb4e4 Mon Sep 17 00:00:00 2001 From: Krastin Krastev Date: Thu, 11 Dec 2025 16:52:51 +0100 Subject: [PATCH 4/6] update manage/scale/autopilot for v1.21 and v1.22 --- .../content/docs/manage/scale/autopilot.mdx | 20 +++--- .../content/docs/manage/scale/autopilot.mdx | 62 +++++++++++-------- 2 files changed, 49 insertions(+), 33 deletions(-) diff --git a/content/consul/v1.21.x/content/docs/manage/scale/autopilot.mdx b/content/consul/v1.21.x/content/docs/manage/scale/autopilot.mdx index a5cb7d71fd..30cd72b78e 100644 --- a/content/consul/v1.21.x/content/docs/manage/scale/autopilot.mdx +++ b/content/consul/v1.21.x/content/docs/manage/scale/autopilot.mdx @@ -12,6 +12,7 @@ This page describes Consul Autopilot, a set of features that provide operator-fr ## Overview Consul autopilot helps you maintain the health and stability of the Consul server cluster. It includes the following features: + - [Server health checking](#server-health-checking) - [Server stabilization time](#server-stabilization-time) - [Dead server cleanup](#dead-server-cleanup) @@ -65,9 +66,18 @@ $ curl http://127.0.0.1:8500/v1/operator/autopilot/configuration -Consul servers maintain changes to the autopilot configuration in the Raft database. As a result, autopilot configurations are included in the Consul snapshot data. +| Autopilot setting | Type | Default value | Description | +| :------------------------ | :------- | :------------ | :-------------------------------------------------------------------------------------------------------------------------------- | +| `CleanupDeadServers` | Boolean | `true` | Flag to enable periodic dead servers removal from the Raft peer set. | +| `LastContactThreshold` | Duration | `200ms` | Time duration that defines the maximum time since the last contact with the current leader for a server to be considered healthy. | +| `MaxTrailingLogs` | Integer | `250` | Maximum number of Raft log entries that a server can trail the leader by and still be considered healthy. | +| `MinQuorum` | Integer | `0` | Minimum number of healthy voting servers required to maintain quorum in the datacenter. | +| `ServerStabilizationTime` | Duration | `10s` | Time duration that a new server must remain healthy before it can become a voting member. | +| `RedundancyZoneTag` | String | `""` | Tag name used to identify redundancy zones for servers in Consul Enterprise. | +| `DisableUpgradeMigration` | Boolean | `false` | Flag to disable automatic upgrade migrations in Consul Enterprise. | +| `UpgradeVersionTag` | String | `""` | Tag name used to identify server versions for automated upgrades in Consul Enterprise. | -## Workflow +Consul servers maintain changes to the autopilot configuration in the Raft database. As a result, autopilot configurations are included in the Consul snapshot data. ## Server health checking @@ -166,10 +176,6 @@ Autopilot also triggers the cleanup process automatically whenever a new server We recommend leaving autopilot enabled to avoid issues with faulty nodes that require manual pruning. In test scenarios and dev environments you can disable the faulty node pruning with the `consul operator autopilot set-config -cleanup-dead-servers=false` command. -## Enterprise features - -To further strengthen and automate Consul operations, there are two more Autopilot features in Consul Enterprise that customers can take advantage of to improve their datacenter resiliency and to simplify operations. - ## Redundancy zones (Enterprise) Redundancy zones provide high availability in case of server failure. With Consul Enterprise, autopilot helps you create redundancy zones by adding read replicas to your datacenter that will be promoted to the "voting" status if a voting server fails. @@ -178,7 +184,7 @@ You can set up redundancy zones to implement isolated failure domains. For examp To learn more, refer to [provide fault tolerance with redundancy zones](/consul/docs/manage/scale/redundancy-zone). -### Automated upgrades +## Automated upgrades (Enterprise) Automatic upgrades are an Enterprise feature that helps you upgrade existing Consul datacenter. With autopilot, you can add new servers running a new Consul version directly to the datacenter. Then when you have enough servers running the new version, you can perform a leadership change and demote the old servers to "non-voters". diff --git a/content/consul/v1.22.x/content/docs/manage/scale/autopilot.mdx b/content/consul/v1.22.x/content/docs/manage/scale/autopilot.mdx index fc3321b50c..30cd72b78e 100644 --- a/content/consul/v1.22.x/content/docs/manage/scale/autopilot.mdx +++ b/content/consul/v1.22.x/content/docs/manage/scale/autopilot.mdx @@ -7,20 +7,21 @@ description: >- # Consul Autopilot -This page describes Consul Autopilot, which supports automatic, operator-friendly management of Consul servers. Autopilot helps maintain the health and stability of the Consul server cluster by monitoring server health, introducing stable servers, and cleaning up dead servers. Furthermore, Consul Enterprise customers can leverage two additional Autopilot features, namely redundancy zones and automated upgrades to enhance datacenter resiliency and simplify operations. +This page describes Consul Autopilot, a set of features that provide operator-friendly management automations for Consul servers. ## Overview -Autopilot includes the following features: +Consul autopilot helps you maintain the health and stability of the Consul server cluster. It includes the following features: + - [Server health checking](#server-health-checking) - [Server stabilization time](#server-stabilization-time) - [Dead server cleanup](#dead-server-cleanup) - [Redundancy zones (only available in Consul Enterprise)](#redundancy-zones) - [Automated upgrades (only available in Consul Enterprise)](#automated-upgrades) -### Default configuration +## Default configuration -You can check the default Autopilot values using the `consul operator` CLI command or using the [`/v1/operator/autopilot` endpoint](/consul/api-docs/operator/autopilot) +To check the default autopilot values, use the `consul operator` CLI command or the [`/v1/operator/autopilot` endpoint](/consul/api-docs/operator/autopilot). @@ -65,20 +66,29 @@ $ curl http://127.0.0.1:8500/v1/operator/autopilot/configuration -Changes to the Autopilot configuration are persisted in the Raft database maintained by the Consul servers. This means that Autopilott configuration will be included in the Consul snapshot data. +| Autopilot setting | Type | Default value | Description | +| :------------------------ | :------- | :------------ | :-------------------------------------------------------------------------------------------------------------------------------- | +| `CleanupDeadServers` | Boolean | `true` | Flag to enable periodic dead servers removal from the Raft peer set. | +| `LastContactThreshold` | Duration | `200ms` | Time duration that defines the maximum time since the last contact with the current leader for a server to be considered healthy. | +| `MaxTrailingLogs` | Integer | `250` | Maximum number of Raft log entries that a server can trail the leader by and still be considered healthy. | +| `MinQuorum` | Integer | `0` | Minimum number of healthy voting servers required to maintain quorum in the datacenter. | +| `ServerStabilizationTime` | Duration | `10s` | Time duration that a new server must remain healthy before it can become a voting member. | +| `RedundancyZoneTag` | String | `""` | Tag name used to identify redundancy zones for servers in Consul Enterprise. | +| `DisableUpgradeMigration` | Boolean | `false` | Flag to disable automatic upgrade migrations in Consul Enterprise. | +| `UpgradeVersionTag` | String | `""` | Tag name used to identify server versions for automated upgrades in Consul Enterprise. | -## Workflow +Consul servers maintain changes to the autopilot configuration in the Raft database. As a result, autopilot configurations are included in the Consul snapshot data. -### Server health checking +## Server health checking An internal health check runs on the leader to track the stability of servers. A server is considered healthy if all of the following conditions are true. - It has a SerfHealth status of 'Alive'. -- The time since its last contact with the current leader is below `LastContactThreshold` (that by default is `200ms`). +- The time since its last contact with the current leader is below `LastContactThreshold`. The default value is `200ms`. - Its latest Raft term matches the leader's term. -- The number of Raft log entries it trails the leader by does not exceed `MaxTrailingLogs` (that by default is `250`). +- The number of Raft log entries it trails the leader by does not exceed `MaxTrailingLogs`. The default value is `250`. -The status of these health checks can be viewed through the `/v1/operator/autopilot/health` HTTP endpoint, with a top level `Healthy` field indicating the overall status of the datacenter: +To return the status of these health checks, use the `/v1/operator/autopilot/health` HTTP endpoint. The `Healthy` field at the top indicates the overall status of the datacenter: ```shell-session $ curl localhost:8500/v1/operator/autopilot/health | jq . @@ -131,9 +141,9 @@ $ curl localhost:8500/v1/operator/autopilot/health | jq . ## Server stabilization time -When a new server is added to the datacenter, there is a waiting period where it must be healthy and stable for a certain amount of time before being promoted to a full, voting member. This is defined by the `ServerStabilizationTime`, Autopilot's parameter, and by default is 10 seconds. +When a new server joins the datacenter, there is an initial waiting period where it must stay healthy and stable before it can become a voting member. This duration is configured by the `ServerStabilizationTime` parameter. By default it is 10 seconds. -In case your configuration requires a different amount of time, you can tune the parameter and assign it a different duration. +If you need a different amount of time, you can tune the parameter to set a different duration. The following example extends the waiting period to 15 seconds: ```shell-session $ consul operator autopilot set-config -server-stabilization-time=15s @@ -156,32 +166,32 @@ UpgradeVersionTag = "" ### Dead server cleanup -If Autopilot is disabled, it will take 72 hours for dead servers to be automatically reaped, or an operator must manually issue the `consul force-leave ` command. If another server failure occurred it could jeopardize the quorum, even if the failed Consul server had been automatically replaced. Autopilot helps prevent these kinds of outages by quickly removing failed servers as soon as a replacement Consul server comes online. When servers are removed by the cleanup process they will enter the "left" state. +When autopilot is disabled, it takes 72 hours for Consul to automatically reap dead servers. The alternative would be for an operator to manually issue the `consul force-leave ` command for each dead server. -With Autopilot's dead server cleanup enabled, dead servers will periodically be cleaned up and removed from the Raft peer set to prevent them from interfering with the quorum size and leader elections. The cleanup process will also be automatically triggered whenever a new server is successfully added to the datacenter. +In this situation, another server failure could jeopardize the cluster's quorum. The Consul cluster still considers the missing server a member of the datacenter, even if the failed Consul server was automatically replaced. -We suggest leaving the feature enabled to avoid faulty nodes remaining in the Raft pool for too long without the need for manual pruning. In test scenarios or in environments you can disable the faulty node pruning by using the `consul operator autopilot set-config -cleanup-dead-servers=false` command. +Autopilot helps prevent these kinds of outages from becoming outages. It quickly removes failed servers as soon as a replacement Consul server comes online. When servers are removed by the cleanup process, they enter the "left" state and are not considered for the datacenter's quorum. -## Enterprise features +Autopilot also triggers the cleanup process automatically whenever a new server successfully joins the datacenter. -To further strengthen and automate Consul operations, there are two more Autopilot features in Consul Enterprise that customers can take advantage of to improve their datacenter resiliency and to simplify operations. +We recommend leaving autopilot enabled to avoid issues with faulty nodes that require manual pruning. In test scenarios and dev environments you can disable the faulty node pruning with the `consul operator autopilot set-config -cleanup-dead-servers=false` command. -### Redundancy zones +## Redundancy zones (Enterprise) -Consul’s redundancy zones provide high availability in the case of server failure through the Enterprise feature of Autopilot. Autopilot allows you to add read replicas to your datacenter that will be promoted to the "voting" status in case of voting server failure. +Redundancy zones provide high availability in case of server failure. With Consul Enterprise, autopilot helps you create redundancy zones by adding read replicas to your datacenter that will be promoted to the "voting" status if a voting server fails. -You can utilise redundancy zones to implement isolated failure domains such as AWS Availability Zones (AZ), and therefore to obtain redundancy within an AZ without having to sustain the overhead of a large quorum. +You can set up redundancy zones to implement isolated failure domains. For example, deploying a server and a read replica in each AWS Availability Zones (AZ) provides additional protection against failure within a region. -Check [provide fault tolerance with redundancy zones](/consul/docs/manage/scale/redundancy-zone) to learn more on the functionality. +To learn more, refer to [provide fault tolerance with redundancy zones](/consul/docs/manage/scale/redundancy-zone). -### Automated upgrades +## Automated upgrades (Enterprise) -Consul’s automatic upgrades provide a simplified way to upgrade existing Consul datacenter. This functionally is provided through the Enterprise feature of Autopilot. Autopilot allows you to add new servers directly to the datacenter and waits until you have enough servers running the new version to perform a leadership change and demote the old servers as "non-voters". +Automatic upgrades are an Enterprise feature that helps you upgrade existing Consul datacenter. With autopilot, you can add new servers running a new Consul version directly to the datacenter. Then when you have enough servers running the new version, you can perform a leadership change and demote the old servers to "non-voters". -Check [automate upgrades with Consul Enterprise](/consul/docs/upgrade/automated) to learn more on the functionality. +To learn more, refer to [automate upgrades with Consul Enterprise](/consul/docs/upgrade/automated). ## Next steps -To read further about [Autopilot](/consul/docs/manage/scale/autopilot) functionality, check the [read replicas](/consul/docs/manage/scale/read-replica) and [redundancy zones](/consul/docs/manage/scale/redundancy-zone) pages. +To learn more about the autopilot features described on this page, refer to [read replicas](/consul/docs/manage/scale/read-replica) and [redundancy zones](/consul/docs/manage/scale/redundancy-zone). -To learn more about operational Autopilot settings regarding stability, check the [last_contact_threshold](/consul/docs/reference/agent/configuration-file/bootstrap#last_contact_threshold) and [max_trailing_logs](/consul/docs/reference/agent/configuration-file/bootstrap#max_trailing_logs) parameters in the Consul agent configuration documentation. \ No newline at end of file +For agent specifications related to autopilot settings for stability, refer to the [last_contact_threshold](/consul/docs/reference/agent/configuration-file/bootstrap#last_contact_threshold) and [max_trailing_logs](/consul/docs/reference/agent/configuration-file/bootstrap#max_trailing_logs) parameters in the Consul agent configuration documentation. \ No newline at end of file From 72f56272992a4ae544b2c575a07566e439096289 Mon Sep 17 00:00:00 2001 From: Krastin Krastev Date: Fri, 12 Dec 2025 19:03:22 +0100 Subject: [PATCH 5/6] Apply suggestions from code review Co-authored-by: Jeff Boruszak <104028618+boruszak@users.noreply.github.com> --- .../v1.21.x/content/docs/manage/scale/autopilot.mdx | 8 +++++--- 1 file changed, 5 insertions(+), 3 deletions(-) diff --git a/content/consul/v1.21.x/content/docs/manage/scale/autopilot.mdx b/content/consul/v1.21.x/content/docs/manage/scale/autopilot.mdx index 30cd72b78e..3e26460241 100644 --- a/content/consul/v1.21.x/content/docs/manage/scale/autopilot.mdx +++ b/content/consul/v1.21.x/content/docs/manage/scale/autopilot.mdx @@ -66,10 +66,12 @@ $ curl http://127.0.0.1:8500/v1/operator/autopilot/configuration +The following table lists autopilot configuration parameters, their descriptions, and their default values: + | Autopilot setting | Type | Default value | Description | | :------------------------ | :------- | :------------ | :-------------------------------------------------------------------------------------------------------------------------------- | -| `CleanupDeadServers` | Boolean | `true` | Flag to enable periodic dead servers removal from the Raft peer set. | -| `LastContactThreshold` | Duration | `200ms` | Time duration that defines the maximum time since the last contact with the current leader for a server to be considered healthy. | +| `CleanupDeadServers` | Boolean | `true` | Enables periodic dead server removal from the Raft peer set. | +| `LastContactThreshold` | Duration | `200ms` | The interval that can elapse between a server's last contact with the current leader before Consul considers it unhealthy. | | `MaxTrailingLogs` | Integer | `250` | Maximum number of Raft log entries that a server can trail the leader by and still be considered healthy. | | `MinQuorum` | Integer | `0` | Minimum number of healthy voting servers required to maintain quorum in the datacenter. | | `ServerStabilizationTime` | Duration | `10s` | Time duration that a new server must remain healthy before it can become a voting member. | @@ -186,7 +188,7 @@ To learn more, refer to [provide fault tolerance with redundancy zones](/consul/ ## Automated upgrades (Enterprise) -Automatic upgrades are an Enterprise feature that helps you upgrade existing Consul datacenter. With autopilot, you can add new servers running a new Consul version directly to the datacenter. Then when you have enough servers running the new version, you can perform a leadership change and demote the old servers to "non-voters". +Automated upgrades are an Enterprise feature that helps you upgrade existing Consul datacenter. With autopilot, you can add new servers running a new Consul version directly to the datacenter. Then when you have enough servers running the new version, you can perform a leadership change and demote the old servers to "non-voters". To learn more, refer to [automate upgrades with Consul Enterprise](/consul/docs/upgrade/automated). From 160544c660ef2042da0bfdb097c04a7cf2a24665 Mon Sep 17 00:00:00 2001 From: Krastin Krastev Date: Fri, 12 Dec 2025 19:05:02 +0100 Subject: [PATCH 6/6] propagate changes from 1.21.x file to 1.22.x --- .../v1.22.x/content/docs/manage/scale/autopilot.mdx | 10 ++++++---- 1 file changed, 6 insertions(+), 4 deletions(-) diff --git a/content/consul/v1.22.x/content/docs/manage/scale/autopilot.mdx b/content/consul/v1.22.x/content/docs/manage/scale/autopilot.mdx index 30cd72b78e..f051c79bcf 100644 --- a/content/consul/v1.22.x/content/docs/manage/scale/autopilot.mdx +++ b/content/consul/v1.22.x/content/docs/manage/scale/autopilot.mdx @@ -66,10 +66,12 @@ $ curl http://127.0.0.1:8500/v1/operator/autopilot/configuration +The following table lists autopilot configuration parameters, their descriptions, and their default values: + | Autopilot setting | Type | Default value | Description | | :------------------------ | :------- | :------------ | :-------------------------------------------------------------------------------------------------------------------------------- | -| `CleanupDeadServers` | Boolean | `true` | Flag to enable periodic dead servers removal from the Raft peer set. | -| `LastContactThreshold` | Duration | `200ms` | Time duration that defines the maximum time since the last contact with the current leader for a server to be considered healthy. | +| `CleanupDeadServers` | Boolean | `true` | Enables periodic dead server removal from the Raft peer set. | +| `LastContactThreshold` | Duration | `200ms` | The interval that can elapse between a server's last contact with the current leader before Consul considers it unhealthy. | | `MaxTrailingLogs` | Integer | `250` | Maximum number of Raft log entries that a server can trail the leader by and still be considered healthy. | | `MinQuorum` | Integer | `0` | Minimum number of healthy voting servers required to maintain quorum in the datacenter. | | `ServerStabilizationTime` | Duration | `10s` | Time duration that a new server must remain healthy before it can become a voting member. | @@ -106,7 +108,7 @@ $ curl localhost:8500/v1/operator/autopilot/health | jq . "SerfStatus": "alive", "Version": "1.7.2", "Leader": false, - # ... + # ... "Healthy": true, "Voter": true, # ... @@ -186,7 +188,7 @@ To learn more, refer to [provide fault tolerance with redundancy zones](/consul/ ## Automated upgrades (Enterprise) -Automatic upgrades are an Enterprise feature that helps you upgrade existing Consul datacenter. With autopilot, you can add new servers running a new Consul version directly to the datacenter. Then when you have enough servers running the new version, you can perform a leadership change and demote the old servers to "non-voters". +Automated upgrades are an Enterprise feature that helps you upgrade existing Consul datacenter. With autopilot, you can add new servers running a new Consul version directly to the datacenter. Then when you have enough servers running the new version, you can perform a leadership change and demote the old servers to "non-voters". To learn more, refer to [automate upgrades with Consul Enterprise](/consul/docs/upgrade/automated).