Detailed design
This section provides more architectural detail on each Boundary Enterprise component. Review this section to identify all technical and personnel requirements before moving to implementation.
Sizing
Every hosting environment is different, and every customer's Boundary Enterprise usage profile is different. Refer to the tables below for sizing recommendations for controller nodes and worker nodes, as well as small and large use cases, based on expected usage.
Small deployments would be appropriate for most initial production deployments or for development and testing environments. Large deployments are production environments with a consistently high workload, such as a large number of sessions. Controller nodes
Size | CPU | Memory | Disk capacity | Network throughput |
---|---|---|---|---|
Small | 2-4 core | 8-16 GB RAM | 50+ GB | Minimum 5 GB/s |
Large | 4-8 core | 32-64 GB RAM | 200+ GB | Minimum 10 GB/s |
Worker nodes
Size | CPU | Memory | Disk capacity | Network throughput |
---|---|---|---|---|
Small | 2-4 core | 8-16 GB RAM | 50+ GB | Minimum 10 GB/s |
Large | 4-8 core | 32-64 GB RAM | 200+ GB | Minimum 10 GB/s |
Refer to Hardware sizing for Boundary servers for more details. These recommendations should only serve as a starting point for operations staff to observe and adjust to meet the unique needs of each deployment. Match your requirements and maximize the stability of your Boundary Enterprise controller and worker instances, by performing load tests and continue monitoring resource usage and all reported metrics from Boundary's telemetry.
Hardware considerations
CPU, memory, and storage performance requirements depend on your exact usage profile for example types of requests, average request rate, and peak request rate. Boundary Enterprise controllers and worker nodes have distinct resource needs as they handle different tasks. Refer to the Hardware Considerations for more information.
Enable audit logging in Boundary Enterprise. It is best to write audit logs to a separate disk for optimal performance. We recommend monitoring both the file descriptor usage and the memory consumption for each Boundary Enterprise worker node. These resources can become constrained depending on the number of clients connecting to Boundary Enterprise targets at any given time. If you have enabled session recording on a target, the worker stores the session recordings locally during the recording phase. Refer to Storage Considerations to determine how much storage to allocate for recordings on the worker nodes.
Networking
Network bandwidth requirements for Boundary Enterprise controllers and workers depend on your specific usage patterns. It is also essential to consider bandwidth requirements for other external systems, such as monitoring and logging collectors. Refer to Network Considerations for more information. Monitor the networking metrics of Boundary Enterprise workers to prevent situations where they are unable to initiate session connections. Review your provider-specific virtual machine networking limitations. You should increase the VM size to achieve higher network throughput.
Network connectivity
Refer to Network Connectivity for the minimum requirements for Boundary Enterprise cluster nodes. You may also need to grant the Boundary Enterprise nodes outbound access to additional services that live elsewhere, either within your internal network or via the Internet. Examples may include:
- Authentication provider backends, such as Okta, Auth0, or Microsoft Entra ID
- Remote log handlers, such as a Splunk or ELK environment
- Metrics collection, such as Prometheus
Storage
Enable audit logging in Boundary Enterprise. For optimal performance, writing audit logs on a separate disk is advisable. The worker stores session recordings locally during the recording process, if enabled. When estimating worker storage needs, consider the number of concurrent sessions recorded on that worker. Refer to the storage guidelines to determine the appropriate amount of storage to allocate for recordings on the worker nodes.
KMS
Boundary Enterprise controllers and workers require different types of cryptographic keys. The KMS provider provides the root of trust for keys used for various purposes, such as protecting secrets, authenticating workers, recovering data, encrypting values in Boundary Enterprise's configuration. Refer to the KMS section for more information.
Traffic encryption
Boundary Enterprise is secure by default, and uses TLS for all network traffic communication. Boundary Enterprise has three types of connections, as described in the previous TLS section:
- Client-to-controller TLS
- Client-to-worker TLS
- Worker-to-upstream TLS
Refer to the TLS documentation for detailed information on each connection type.
From a load balancing requirement, always configure TLS passthrough. The load balancing section provides more information on this.
Load balancing
A layer 4 load balancer meets Boundary Enterprise's requirements. However, organizations may implement layer 7-capable load balancers for additional controls. Regardless of which, follow these requirements:
- HTTPS listener with valid TLS certificate for the domain it is serving or TLS passthrough
- Health checks should use TCP 9203
Each major cloud provider offers one or more managed load-balancing services suitable for Boundary Enterprise. Follow the guidance provided in the load balancer recommendations.
Client-to-controller
Place Boundary Enterprise controller nodes in a private network and not exposed directly to the public Internet. Expose services such as the API and administrative console via a load balancer. This design utilizes a layer 4 load balancer with additional network security controls, such as security groups or firewall access control lists, to restrict the network flow to the load balancer interface.
Health check
Use a load balancer to monitor the health of the Boundary Enterprise controller nodes. Do this by detecting the status of the /health
endpoint.
This endpoint does not support any input. It returns an empty bodies to API responses.
Status | Description |
---|---|
200 | GET /health returns HTTPS status 200 if the controller's API gRPC service is up. |
5xx | GET /health returns HTTPS status 5xx or request timeout if unhealthy. |
503 | GET /health returns HTTPS status 503 Service Unavailable status if the controller is shutting down. |
Use the listener stanza to configure the controller's operational endpoints. By default, it listens on TCP 9203. The operational endpoint exposes both health and metrics endpoints.
Operational endpoint stanza configuration
# Ops listener for operations like health checks for load balancers
listener "tcp" {
# Should be the address of the interface where your external systems' load balancer and/or metrics collectors etc. will connect on.
address = "0.0.0.0:9203"
# The purpose of this listener block
purpose = "ops"
tls_disable = false
tls_cert_file = "/etc/boundary.d/tls/boundary-cert.pem"
tls_key_file = "/etc/boundary.d/tls/boundary-key.pem"
}
Worker-to-controller
Similar to clients-to-controllers, ingress workers require access to Boundary Enterprise's controller nodes placed in a private network. For this design, where the deployment consists of a single cloud, an internal load balancer would be sufficient to allow the ingress workers to establish connectivity via port TCP 9201 to the controllers.
For multi-cloud deployments operating a single control-plane, for example in AWS and targets with ingress workers in other clouds, or on-premise, it may be necessary to expose port TCP-9201 externally so that it is reachable. A consideration is to add another listener for port TCP-9201 to the load balancer used for client-to-controller communication.
Worker-to-worker (multi-hop sessions)
With multi-hop sessions, workers operate as intermediaries or egress workers. If more than one provides identical capabilities (typically for increased availability, resilience, and scale), they should be part of a load-balanced set of workers.
For example, configure the upstream configuration initial_upstream
as the FQDN or virtual IP (VIP) address of the load-balanced pool of workers.
The upstream configuration initial_upstream
allows a list of hosts/IPs. However, as workers can be dynamic, for example part of an auto scaling group, using a load balancer helps with future scale-out/in scenarios and ensures a robust architecture.
Monitoring
Gaining visibility into Boundary Enterprise's controllers and workers is essential for production environments. It enables operators to manage, scale, and troubleshoot Boundary Enterprise efficiently and assists in detecting and mitigating anomalies in a deployment.
Logs
If events are not configured, for example via the events stanza, Boundary Enterprise outputs logs to stdout/stderr by default. Linux distributions typically capture Boundary Enterprise's log output to the system journal.
In production environments, use the events stanza increase control over event logging. Event logging configured via the events stanza overrides the default behavior. For example, if configuring Boundary Enterprise to send events to a file, logs are no longer emitted to stdout or stderr.
Aggregate logs using a centralized platform for analysis, audit, and compliance, and aid in troubleshooting.
Minimum configuration
events {
audit_enabled = true
observations_enabled = true
sysevents_enabled = true
telemetry_enabled = false
sink "stderr" {
name = "all-events"
description = "All events sent to stderr"
event_types = ["*"]
format = "hclog-text"
}
}
Metrics
Metrics for controllers and workers are available for ingestion by third-party telemetry platforms, such as Prometheus and Grafana. The metrics use the OpenMetric exposition format. Refer to the Boundary Enterprise metrics documentation for a list of all available metrics.
Boundary Enterprise provides metrics through the /metrics path using a listener with the "ops" purpose.
Configure the controller's operational endpoints using the listener stanza. By default, it listens on TCP-9203. The operational ops
listener exposes both health and metrics endpoints.
Operational endpoint stanza configuration
# Ops listener for operations like health checks for load balancers
listener "tcp" {
# Should be the address of the interface where your external systems'
# (eg: Load-Balancer and metrics collectors) will connect on.
address = "0.0.0.0:9203"
# The purpose of this listener block
purpose = "ops"
tls_disable = false
tls_cert_file = "/etc/boundary.d/tls/boundary-cert.pem"
tls_key_file = "/etc/boundary.d/tls/boundary-key.pem"
}
Failure considerations
Organizations should rely on the experience from their architecture, cloud, and platform teams to provide the appropriate levels of availability for Boundary Enterprise that meet their requirements. This architecture design leverages several principles and standard infrastructure services with major cloud providers to provide the highest availability and fault tolerance while balancing costs.
- Auto scaling to enhance fault tolerance. For example, if a Boundary Enterprise instance is unhealthy, the auto scaling service can replace it. You can also configure the service to use multiple availability zones and launch instances in another availability zone to compensate if one becomes unavailable.
- Templating images to decrease the time to deploy controllers and workers during the initial deployment, notably failure and scaling scenarios.
- Infrastructure-as-code to produce consistent, known deployments that reduce configuration errors.
- Availability zones to protect against data center failures. Spread Boundary Enterprise components across at least three availability zones in production environments. If deploying Boundary Enterprise to three availability zones is not possible, you can use the same architecture across one or two availability zones at the expense of a reliability risk in case of an availability zone outage.
- Load balancing to provide traffic redirection and health checking.
Controllers
Boundary Enterprise controllers are stateless. They store all state and configuration within PostgreSQL and can withstand failure scenarios where only one node is accessible. When a controller node fails, users are still be able to interact with other Boundary Enterprise controllers, assuming the presence of additional nodes behind a load balancer. Boundary Enterprise controllers depend on a PostgreSQL database. Ensure the database is reachable to all Boundary Enterprise controller nodes. It should also inherit the same levels of availability as the controllers. Do not deploy PostgreSQL on the controller nodes.
Workers
Boundary Enterprise uses workers as either proxies or reverse proxies. Workers routinely communicate with the controllers to report their health. In the event of a worker node failure, it is best practice to have at least three workers per network boundary and per type (ingress and egress) for production environments. Therefore, the controller assigns a user's proxy session to an active Boundary Enterprise worker node.
Availability zone failures
The following section provides recommendations for controllers and workers to overcome availability zone outages.
Controllers
By deploying Boundary Enterprise controllers in the recommended architecture across three availability zones with load balancing in front of them, the Boundary Enterprise control plane can survive outages in up to 2 availability zones.
Workers
The best practice for deploying Boundary Enterprise workers is to have at least one worker deployed per availability zone. Provided correct security rules exist to allow for cross-subnet/AZ communication, should networking still be up in an otherwise-failed AZ, Boundary Enterprise proxies a user's session connection through a worker in another AZ.
Controllers
To continue to serve Boundary Enterprise controller requests during a regional outage, a deployment like the one outlined in this guide must be in a different region. Use a multi-regional database technology to allow the nodes in the secondary region to communicate with the PostgreSQL database. For example, promote secondary region AWS RDS read replicas to read-write in the event the primary region fails.
Use services like AWS Global Accelerator for AWS, Cross-region Load Balancer for Azure, and GCP Cloud Load Balancer for GCP to load balance healthy Boundary Enterprise controller requests across regions.
Workers
If a Boundary Enterprise worker cannot reach its upstream worker or a controller, the user cannot establish a proxied session to the target.