Scaling
This section explains the recommended configuration settings for large-scale self-hosted deployments of Teleport.
Teleport Enterprise Cloud takes care of this setup for you so you can provide secure access to your infrastructure right away.
Get started with a free trial of Teleport Enterprise Cloud.
Prerequisites
- Teleport v14.3.33 Open Source or Enterprise.
Hardware recommendations
Set up Teleport with a High Availability configuration.
Scenario | Max Recommended Count | Proxy Service | Auth Service | AWS Instance Types |
---|---|---|---|---|
Teleport SSH Nodes connected to Auth Service | 10,000 | 2x 4 vCPUs, 8GB RAM | 2x 8 vCPUs, 16GB RAM | m4.2xlarge |
Teleport SSH Nodes connected to Auth Service | 50,000 | 2x 4 vCPUs, 16GB RAM | 2x 8 vCPUs, 16GB RAM | m4.2xlarge |
Teleport SSH Nodes connected to Proxy Service through reverse tunnels | 10,000 | 2x 4 vCPUs, 8GB RAM | 2x 8 vCPUs, 16+GB RAM | m4.2xlarge |
Auth Service and Proxy Service Configuration
Upgrade Teleport's connection limits from the default connection limit of 15000
to 65000
.
# Teleport Auth Service and Proxy Service
teleport:
connection_limits:
max_connections: 65000
max_users: 1000
Agent configuration
Agents cache roles and other configuration locally in order to make access-control decisions quickly.
By default agents are fairly aggressive in trying to re-initialize their caches if they lose connectivity
to the Auth Service. In very large clusters, this can contribute to a "thundering herd" effect,
where control plane elements experience excess load immediately after restart. Setting the max_backoff
parameter to something in the 8-16 minute range can help mitigate this effect:
teleport:
cache:
enabled: yes
max_backoff: 12m
Kernel parameters
Tweak Teleport's systemd unit parameters to allow a higher amount of open files:
[Service]
LimitNOFILE=65536
Verify that Teleport's process has high enough file limits:
$ cat /proc/$(pidof teleport)/limits
# Limit Soft Limit Hard Limit Units
# Max open files 65536 65536 files
DynamoDB configuration
When using Teleport with DynamoDB, we recommend using on-demand provisioning. This allow DynamoDB to scale with cluster load.
For customers that can not use on-demand provisioning, we recommend at least 250 WCU and 100 RCU for 10k clusters.
etcd
When using Teleport with etcd, we recommend you do the following.
- For performance, use the fastest SSDs available and ensure low-latency network connectivity between etcd peers. See the etcd Hardware recommendations guide for more details.
- For debugging, ingest etcd's Prometheus metrics and visualize them over time using a dashboard. See the etcd Metrics guide for more details.
During an incident, we may ask you to run etcdctl
, test that you can run the
following command successfully.
etcdctl \
--write-out=table \
--cacert=/path/to/ca.cert \
--cert=/path/to/cert \
--key=/path/to/key.pem \
--endpoints=127.0.0.1:2379 \
endpoint status
Supported Load
The tests below were performed against a Teleport Cloud tenant which runs on instances with 8 vCPU and 32 GiB memory and has default limits of 4CPU and 4Gi memory.
Concurrent Logins
Resource Type | Login Command | Logins | Failure |
---|---|---|---|
SSH | tsh login | 2000 | Auth CPU Limits exceeded |
Application | tsh app login | 2000 | Auth CPU Limits exceeded |
Database | tsh db login | 2000 | Auth CPU Limits exceeded |
Kubernetes | tsh kube login && tsh kube credentials | 2000 | Auth CPU Limits exceeded |
Sessions Per Second
Resource Type | Sessions | Failure |
---|---|---|
SSH | 1000 | Auth CPU Limits exceeded |
Application | 2500 | Proxy CPU Limits exceeded |
Database | 40 | Proxy CPU Limits exceeded |
Kubernetes | 50 | Proxy CPU Limits exceeded |