Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions content/well-architected-framework/data/docs-nav-data.json
Original file line number Diff line number Diff line change
Expand Up @@ -592,6 +592,10 @@
{
"title": "Scale and tune performance",
"path": "design-resilient-systems/scale-and-tune-performance"
},
{
"title": "Design control, management, and data planes",
"path": "design-resilient-systems/design-control-data-management-plane"
}
]
},
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,254 @@
---
page_title: Design control, management, and data planes for resilient infrastructure
description: Implement control, management, and data planes that enhance the resilience of your infrastructure by isolating failures and ensuring continuous operation.
---

# Design control, management, and data planes for resilient infrastructure

Infrastructure planes represent the foundation for your infrastructure
architecture. Infrastructure planes define how:

- Your systems make decisions.
- Operators interact with the infrastructure and services.
- Workloads execute and data flows.

Understanding and properly designing each plane is critical for building
resilient, scalable, and secure infrastructure that can meet operational and
compliance requirements.

## What are infrastructure planes

Modern infrastructure operates across three distinct architectural layers, each
serving a specific purpose in your overall system design:

- **Control plane:** Makes decisions about workload placement, routing, service
health, and system state. Examples include Kubernetes schedulers, Consul
servers, and Nomad server clusters.
- **Management plane:** Provides interfaces for operators and automation to
configure, monitor, and administer infrastructure. Examples
infrastructure-as-code platforms such as Terraform, configuration management
tools like Ansible, and observability tools like Promethus, Loki, and Grafana.
- **Data plane:** Executes decisions made by the control plane and moves actual
application data and traffic. Examples include container runtimes, service
mesh proxies, and application workloads.
![Diagram showing the relationship between control, management, and data
planes](/img/well-architected-framework/diagram-infrastructure-planes-intro.png#light-theme-only)
_placehodler-diagram...replace with better one_

## Conceptual design considerations for infrastructure planes

Designing infrastructure planes requires careful consideration of architecture
patterns, scalability requirements, and organizational constraints. Poor design
in any single plane can cascade failures across your infrastructure,
resulting in downtime, security vulnerabilities, or operational inefficiencies.

The following considerations will help you make informed decisions about your
infrastructure design. When starting your design, focus on conceptual requirements.
Conceptual design decisions focus on your needs, and how it should work, rather
than specific tools or vendors.

- **Identify team responsibilities:** Understand which teams are responsible for
managing each plane, and the services within each plane. Define clear
ownership boundaries to avoid confusion during incidents and clearly document
team experience.
- **Define scaling and reliability requirements:** Consider and test for the
baseline, average, and peak loads for each service.
- **Establish geographical distribution requirements:** Determine if your
application requires fault domains within a specific region, multi-region
scaling, dedicated local instances in each region, and the impact of data
residency requirements such as GDPR or CPPA.
- **Plan for separation of duties:** Define roles and responsibilities for teams
managing each plane, and application or service within the plane.
- **Design for high availability:** Ensure each plane, and each service operates
independently and that you can perform a failover without impacting availability.
- **Identify network segmentation needs:** Logically isolate traffic between
planes, and services within each plane. Open ports between planes and services
only as necessary. Ensure services can connect to only the required resources
to operate.

Document each of the considerations for your infrastructure planes. Having a
well documented conceptual design helps you make informed decisions during
the logical design phase. Here is an example of how you might document
conceptual requirements and constraints.
![Example requirements and constraints for a conceptual design](/img/well-architected-framework/example-conceptual-design.png#light-theme-only)

## Logical design considerations for infrastructure planes

Once you have created a conceptual design, and documented requirements for each
plane, you can consider logical requirements such as the type of services
needed. This stage is slightly deeper than the conceptual design, and focuses on
the capabilities needed to meet your requirements. Do not focus on specific
vendors, but rather type of tool or service needed like whether you need a service mesh,
load balancer, infrastructure-as-code, configuration management, or specific
type of storage.

Examples of logical design decisions include:

- **Service deployment models:** Consider whether to use managed services,
self-managed services, or a hybrid approach. For example, do you want to use a
hyper-scale public cloud provider, a specialized cloud provider,
software-as-a-service (SaaS), or self-managed platform.

Managed services can improve resilience by reducing operational overhead, but
require careful consideration to ensure each service meets your availability,
data locality, security, and disaster recovery requirements.

- **Data plane:** Hyper-scale public cloud provider with multiple availability
zones, and self-hosted infrastructure.
- **Control plane:** Managed container services and virtual machine services.
- **Management plane:** Software-as-a-Service (SaaS) by default, fallback to
self-managed services on a hyper-scale public cloud provider as needed.

- ** Redundancy and failover:** Determine if services running within each plan
need to be deployed in active-active or active-passive configuration, how
many instances of each service you need, and whether the services are
stateful or stateless.

- **Distribution strategy:** Do you need to deploy services in a single region,
or multiple regions? If you require services spread across multiple regions,
consider how services synchronize data, the affect on latency, and data
locality considerations such as GDPR or CPPA.

- **Service integration:** How will you run and manage individual services? How
will you ensure services can communicate securely and reliably while building
a network segmentation strategy? How will you deploy, update, and manage each
service and its configuration?

- **Observability:** Define the type of monitoring, logging, and tracing needed to
ensure visibility into each plane and service.

Each logical design decision should map back to a conceptual requirement or
constraint. Documenting your logical design helps you when making vendor or
feature based design considerations during the physical design phase.
Here is an example of how you might document logical requirements and constraints.
![Example requirements and constraints for a conceptual design](/img/well-architected-framework/example-logical-design.png#light-theme-only)

## Physical design considerations for infrastructure planes

Physical design builds off the conceptual and logical design requirements and
constraints. When writing the physical design, you select specific services,
tools, and vendors to implement your infrastructure planes. Ensure that each
selected service meets your documented requirements and constraints. For example:

- **Service Deployment Model Implementation:**

- **Data plane:** Amazon Web Services, Azure, Google Cloud, or IBM Cloud for
compute and storage services instead, and KVM-based virtualization for
specialized self-hosted infrastructure.
- **Control plane:** Managed Kubernetes services (EKS, GKE, AKS) for hyper-scale
platforms and OpenShift for self-hosted container orchestration.
- **Management plane:** GitHub for version control and CI/CD, HCP Terraform for
infrastructure-as-code automation, and Datadog for observability.

- **Redundancy and Failover Architecture:** Determine the number of nodes for
each service, what features you will enable, and define what
roles have access to each service. For example, deploy a 5-node Vault cluster
in the management plane, managed by Nomad, each in a unique availability zone
with auto-unseal enabled using a KMS, and Vault Agent nodes deployed in the
data plane as a side car for each containerized application.

- **Geographic distribution strategy:** Designate the us-east availability zone
as the primary region for US-based customers with an active-active deployment
pattern, while eu-west serves as a dedicated region for GDPR-compliant
workloads requiring local data storage and processing.

- **Service integration and communication patterns:** Deploy Consul service mesh
with Envoy proxies to handle all container-to-container communication,
enforcing mutual TLS (mTLS) for all inter-service traffic. Network
segmentation enforced through VPCs with dedicated public, private, and data
subnets, with security groups allowing only specific ports like 8500 for
Consul HTTP API, 8200 for Vault access, and application-specific ports.

- **Observability and monitoring capabilities:** Implement Datadog APM for
application performance monitoring while running self-hosted Prometheus for
infrastructure metrics with 15-day retention. Centralized logging
handled by Datadog Log Management with structured JSON format and 30-day
retention, enabling log-based alerting for error conditions. Integrate
PagerDuty with team-specific on-call schedules and escalation policies, while
using Slack for non-critical alert notifications. Service Level Objectives
(SLOs) target 99.9% uptime for production services and 99% for staging
environments, with automated SLO tracking configured in Datadog dashboards.

Each physical design decision directly supports your logical design
requirements and constraints, providing specific vendor selections,
configuration details, and deployment parameters that your teams can implement.
Here is an example of how you might document logical requirements and constraints.
![Example requirements and constraints for a physical
design](/img/well-architected-framework/example-physical-design.png#light-theme-only)

Consul forms the control plane for service networking, providing service
discovery, health checking, and service mesh control:

- Run Consul servers in clusters of 3, 5, or 7 nodes for high availability
- Servers maintain the service catalog, key-value store, and connection intentions
- Use Raft consensus for strong consistency across the cluster
- Support multi-datacenter federation for global service discovery

Consul agents and service mesh handle data plane traffic. You can deploy Consul
agents on each node register services with the control plane.
Mutual TLS provides encryption and authentication without application changes.
Intentions (managed via control plane) enforce which services can communicate.

Nomad provides orchestration and scheduling decisions:

- Determine which clients should run which workloads based on constraints and
resources.
- Maintain job specifications and allocation state.
- Handle failure detection and automatic rescheduling.
- Support multi-region deployments with federation.

Vault centralizes secrets management and encryption operations:

- Store and encrypt secrets at rest.
- Generate dynamic credentials for databases, and cloud providers.
- Provide encryption as a service through the Transit secrets engine.
- Enforce access policies and audit all secret access.

Vault agents run in the data plane to retrieve and cache secrets for
services. You can also configure the Vault agent to handle authentication,
eliminating the need for each service to authenticate with Vault. Using Vault
agents helps you reduce latency and traffic to the control plane.

Terraform serves as the primary management interface for infrastructure:

- Provides declarative infrastructure-as-code for all major cloud providers and services.
- Terraform Cloud offers team collaboration, policy enforcement (Sentinel), and private module registry.
- CLI and API enable both human and automated workflows.
- VCS integration provides change review and approval processes.

Terraform runs in the management plane, interacting with control and data planes
to provision and manage infrastructure resources. You can use Terraform to define
infrastructure components, configurations, and dependencies in code, enabling
version control, collaboration, and automated deployments.

Boundary bridges management and data plane access:

- Provides identity-based access to infrastructure without exposing networks.
- Session management and recording for compliance.
- Credential brokering from Vault for just-in-time access.
- Support for SSH, RDP, Kubernetes, databases, and custom protocols.

Boundary runs in the management plane, providing secure access to resources. You
can install workers in the data plane to facilitate connections, while the
controller manages access policies and sessions.

HashiCorp resources:

- ...

External resources:

- ...

## Next steps

In this guide you learned about why it is important to properly design your
control, management, and data planes. Following the conceptual, logical, and
physical design process helps ensure that your infrastructure meets your
organizations requirements, and helps you focus on requirements rather than
vendor tools or features. Design control, management, and data planes is part of
the [Design resilient
systems](/well-architected-framework/design-resilient-systems) pillar.

After you have completed your design, review the Secure control, management, and
data planes guide to ensure that your design meets security best practices.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading