Navigating Access Challenges in Kubernetes-Based Infrastructure
Sep 19
Virtual
Register Today
Teleport logoTry For Free
Fork me on GitHub

Teleport

Scaling

This section explains the recommended configuration settings for large-scale self-hosted deployments of Teleport.

Teleport Enterprise Cloud takes care of this setup for you so you can provide secure access to your infrastructure right away.

Get started with a free trial of Teleport Enterprise Cloud.

Prerequisites

  • Teleport v16.2.1 Open Source or Enterprise.

Hardware recommendations

Set up Teleport with a High Availability configuration.

ScenarioMax Recommended CountProxy ServiceAuth ServiceAWS Instance Types
Teleport SSH Nodes connected to Auth Service10,0002x 4 vCPUs, 8GB RAM2x 8 vCPUs, 16GB RAMm4.2xlarge
Teleport SSH Nodes connected to Auth Service50,0002x 4 vCPUs, 16GB RAM2x 8 vCPUs, 16GB RAMm4.2xlarge
Teleport SSH Nodes connected to Proxy Service through reverse tunnels10,0002x 4 vCPUs, 8GB RAM2x 8 vCPUs, 16+GB RAMm4.2xlarge

Auth Service and Proxy Service Configuration

Upgrade Teleport's connection limits from the default connection limit of 15000 to 65000.

# Teleport Auth Service and Proxy Service
teleport:
  connection_limits:
    max_connections: 65000
    max_users: 1000

Agent configuration

Agents cache roles and other configuration locally in order to make access-control decisions quickly. By default agents are fairly aggressive in trying to re-initialize their caches if they lose connectivity to the Auth Service. In very large clusters, this can contribute to a "thundering herd" effect, where control plane elements experience excess load immediately after restart. Setting the max_backoff parameter to something in the 8-16 minute range can help mitigate this effect:

teleport:
  cache:
    enabled: yes
    max_backoff: 12m

Kernel parameters

Tweak Teleport's systemd unit parameters to allow a higher amount of open files:

[Service]
LimitNOFILE=65536

Verify that Teleport's process has high enough file limits:

cat /proc/$(pidof teleport)/limits

Limit Soft Limit Hard Limit Units

Max open files 65536 65536 files

DynamoDB configuration

When using Teleport with DynamoDB, we recommend using on-demand provisioning. This allow DynamoDB to scale with cluster load.

For customers that can not use on-demand provisioning, we recommend at least 250 WCU and 100 RCU for 10k clusters.

etcd

When using Teleport with etcd, we recommend you do the following.

  • For performance, use the fastest SSDs available and ensure low-latency network connectivity between etcd peers. See the etcd Hardware recommendations guide for more details.
  • For debugging, ingest etcd's Prometheus metrics and visualize them over time using a dashboard. See the etcd Metrics guide for more details.

During an incident, we may ask you to run etcdctl, test that you can run the following command successfully.

etcdctl \ --write-out=table \ --cacert=/path/to/ca.cert \ --cert=/path/to/cert \ --key=/path/to/key.pem \ --endpoints=127.0.0.1:2379 \ endpoint status

Supported Load

The tests below were performed against a Teleport Cloud tenant which runs on instances with 8 vCPU and 32 GiB memory and has default limits of 4CPU and 4Gi memory.

Concurrent Logins

Resource TypeLogin CommandLoginsFailure
SSHtsh login2000Auth CPU Limits exceeded
Applicationtsh app login2000Auth CPU Limits exceeded
Databasetsh db login2000Auth CPU Limits exceeded
Kubernetestsh kube login && tsh kube credentials2000Auth CPU Limits exceeded

Sessions Per Second

Resource TypeSessionsFailure
SSH1000Auth CPU Limits exceeded
Application2500Proxy CPU Limits exceeded
Database40Proxy CPU Limits exceeded
Kubernetes50Proxy CPU Limits exceeded

Windows Desktop Service

Windows Desktop sessions can vary greatly in resource usage depending on the applications being used. The primary factor affecting resource usage per session is how often the screen is updated. For example, a session playing a video in full screen mode will consume significantly more resources than a session where the user is typing in a text editor.

We measured the resource usage of sessions playing fullscreen videos to get the worst-case estimate for resource requirements. We then inferred resource requirements for more standard use cases on the basis of those measurements.

Worst Case:

  • 1/12 vCPU per concurrent session
  • 42 MB RAM per concurrent session

Typical Case:

  • 1/240 vCPU per concurrent session
  • 2 MB RAM per concurrent session

From these estimates we calculated the following table of recommendations based on the expected maximum number of concurrent sessions:

Concurrent usersCPU (vCPU, low to high)Memory (GB, low to high)
110.5
1010.5 to 1
1001 to 81 to 8
10004 to 964 to 64

To avoid service interruptions, we recommend leaning towards the higher end of the recommendations to start while monitoring your resource usage, and then scaling resources based on measured outcomes.

Note that you are not limited to a single windows_desktop_service, and can connect multiple to your cluster in order to spread resources out over multiple logical machines.