Simplifying Zero Trust Security for AWS with Teleport
Jan 23
Virtual
Register Now
Teleport logoTry For Free
Fork me on GitHub

Teleport

Monitoring your Cluster

Teleport provides health checking mechanisms in order to verify that it is healthy and ready to serve traffic. Metrics, tracing, and profiling provide in-depth data, tracking cluster performance and responsiveness.

Enable health monitoring

How to monitor the health of a Teleport instance.

Teleport's diagnostic HTTP endpoints are disabled by default. You can enable them via:

Start a teleport instance with the --diag-addr flag set to the local address where the diagnostic endpoint will listen:

sudo teleport start --diag-addr=127.0.0.1:3000

Edit a teleport instance's configuration file (/etc/teleport.yaml by default) to include the following:

teleport:
    diag_addr: 127.0.0.1:3000

To enable debug logs:

log:
    severity: DEBUG

Verify that Teleport is now serving the diagnostics endpoint:

curl http://127.0.0.1:3000/healthz

Now you can collect monitoring information from several endpoints. These can be used by things like Kubernetes probes to monitor the health of a Teleport process.

/healthz

The http://127.0.0.1:3000/healthz endpoint responds with a body of {"status":"ok"} and an HTTP 200 OK status code if the process is running.

This is a check to determine if the Teleport process is still running.

/readyz

The http://127.0.0.1:3000/readyz endpoint is similar to /healthz, but its response includes information about the state of the process.

The response body is a JSON object of the form:

{ "status": "a status message here"}

/readyz and heartbeats

If a Teleport component fails to execute its heartbeat procedure, it will enter a degraded state. Teleport will begin recovering from this state when a heartbeat completes successfully.

The first successful heartbeat will transition Teleport into a recovering state. A second consecutive successful heartbeat will cause Teleport to transition to the OK state.

Teleport heartbeats run approximately every 60 seconds when healthy, and failed heartbeats are retried approximately every 5 seconds. This means that depending on the timing of heartbeats, it can take 60-70 seconds after connectivity is restored for /readyz to start reporting healthy again.

Status codes

The status code of the response can be one of:

  • HTTP 200 OK: Teleport is operating normally
  • HTTP 503 Service Unavailable: Teleport has encountered a connection error and is running in a degraded state. This happens when a Teleport heartbeat fails.
  • HTTP 400 Bad Request: Teleport is either entering its initial startup phase or has begun recovering from a degraded state.

The same state information is also available via the process_state metric under the /metrics endpoint.

Metrics

Teleport exposes metrics for all of its components, helping you get insight into the state of your cluster. This guide explains the metrics that you can collect from your Teleport cluster.

Enabling metrics

Teleport's diagnostic HTTP endpoints are disabled by default. You can enable them via:

Start a teleport instance with the --diag-addr flag set to the local address where the diagnostic endpoint will listen:

sudo teleport start --diag-addr=127.0.0.1:3000

Edit a teleport instance's configuration file (/etc/teleport.yaml by default) to include the following:

teleport:
    diag_addr: 127.0.0.1:3000

To enable debug logs:

log:
    severity: DEBUG

Verify that Teleport is now serving the diagnostics endpoint:

curl http://127.0.0.1:3000/healthz

This will enable the http://127.0.0.1:3000/metrics endpoint, which serves the metrics that Teleport tracks. It is compatible with Prometheus collectors.

The following metrics are available:

Teleport Enterprise (cloud-hosted) does not expose monitoring endpoints for the Auth Service and Proxy Service.

Auth Service and backends

NameTypeComponentDescription
audit_failed_disk_monitoringcounterTeleport Audit LogNumber of times disk monitoring failed.
audit_failed_emit_eventscounterTeleport Audit LogNumber of times emitting audit events failed.
audit_percentage_disk_space_usedgaugeTeleport Audit LogPercentage of disk space used.
audit_server_open_filesgaugeTeleport Audit LogNumber of open audit files.
auth_generate_requests_throttled_totalcounterTeleport AuthNumber of throttled requests to generate new server keys.
auth_generate_requests_totalcounterTeleport AuthNumber of requests to generate new server keys.
auth_generate_requestsgaugeTeleport AuthNumber of current generate requests.
auth_generate_secondshistogramTeleport AuthLatency for generate requests.
backend_batch_read_requests_totalcountercacheNumber of read requests to the backend.
backend_batch_read_secondshistogramcacheLatency for batch read operations.
backend_batch_write_requests_totalcountercacheNumber of batch write requests to the backend.
backend_batch_write_secondshistogramcacheLatency for backend batch write operations.
backend_read_requests_totalcountercacheNumber of read requests to the backend.
backend_read_secondshistogramcacheLatency for read operations.
backend_requestscountercacheNumber of requests to the backend (reads, writes, and keepalives).
backend_write_requests_totalcountercacheNumber of write requests to the backend.
backend_write_secondshistogramcacheLatency for backend write operations.
cluster_name_not_found_totalcounterTeleport AuthNumber of times a cluster was not found.
dynamo_requests_totalcounterDynamoDBTotal number of requests to the DynamoDB API.
dynamo_requestscounterDynamoDBTotal number of requests to the DynamoDB API grouped by result.
dynamo_requests_secondshistogramDynamoDBLatency of DynamoDB API requests.
etcd_backend_batch_read_requestscounteretcdNumber of read requests to the etcd database.
etcd_backend_batch_read_secondshistogrametcdLatency for etcd read operations.
etcd_backend_read_requestscounteretcdNumber of read requests to the etcd database.
etcd_backend_read_secondshistogrametcdLatency for etcd read operations.
etcd_backend_tx_requestscounteretcdNumber of transaction requests to the database.
etcd_backend_tx_secondshistogrametcdLatency for etcd transaction operations.
etcd_backend_write_requestscounteretcdNumber of write requests to the database.
etcd_backend_write_secondshistogrametcdLatency for etcd write operations.
teleport_etcd_eventscounteretcdTotal number of etcd events processed.
teleport_etcd_event_backpressurecounteretcdTotal number of times event processing encountered backpressure.
firestore_events_backend_batch_read_requestscounterGCP Cloud FirestoreNumber of batch read requests to Cloud Firestore events.
firestore_events_backend_batch_read_secondshistogramGCP Cloud FirestoreLatency for Cloud Firestore events batch read operations.
firestore_events_backend_batch_write_requestscounterGCP Cloud FirestoreNumber of batch write requests to Cloud Firestore events.
firestore_events_backend_batch_write_secondshistogramGCP Cloud FirestoreLatency for Cloud Firestore events batch write operations.
firestore_events_backend_write_requestscounterGCP Cloud FirestoreNumber of write requests to Cloud Firestore events.
firestore_events_backend_write_secondshistogramGCP Cloud FirestoreLatency for Cloud Firestore events write operations.
gcs_event_storage_downloads_secondshistogramGCP GCSLatency for GCS download operations.
gcs_event_storage_downloadscounterGCP GCSNumber of downloads from the GCS backend.
gcs_event_storage_uploads_secondshistogramGCP GCSLatency for GCS upload operations.
gcs_event_storage_uploadscounterGCP GCSNumber of uploads to the GCS backend.
grpc_server_started_totalcounterTeleport AuthTotal number of RPCs started on the server.
grpc_server_handled_totalcounterTeleport AuthTotal number of RPCs completed on the server, regardless of success or failure.
grpc_server_msg_received_totalcounterTeleport AuthTotal number of RPC stream messages received on the server.
grpc_server_msg_sent_totalcounterTeleport AuthTotal number of gRPC stream messages sent by the server.
heartbeat_connections_received_totalcounterTeleport AuthNumber of times the Auth Service received a heartbeat connection, representing total heart beating Agents.
s3_requests_totalcounterAmazon S3Total number of requests to the S3 API.
s3_requestscounterAmazon S3Total number of requests to the S3 API grouped by result.
s3_requests_secondshistogramAmazon S3Request latency for the S3 API.
teleport_audit_emit_eventscounterTeleport Audit LogNumber of audit events emitted.
teleport_audit_parquetlog_batch_processing_secondshistogramTeleport Audit LogDuration of processing single batch of events in the Parquet-format audit log.
teleport_audit_parquetlog_s3_flush_secondshistogramTeleport Audit LogDuration of flushing parquet files to S3 in Parquet-format audit log.
teleport_audit_parquetlog_delete_events_secondshistogramTeleport Audit LogDuration of deletion events from SQS in Parquet-format audit log.
teleport_audit_parquetlog_batch_sizehistogramTeleport Audit LogOverall size of events in single batch in Parquet-format audit log.
teleport_audit_parquetlog_batch_countcounterTeleport Audit LogTotal number of events in single batch in Parquet-format audit log.
teleport_audit_parquetlog_last_processed_timestampgaugeTeleport Audit LogNumber of last processing time in Parquet-format audit log.
teleport_audit_parquetlog_age_oldest_processed_messagegaugeTeleport Audit LogNumber of age of oldest event in Parquet-format audit log.
teleport_audit_parquetlog_errors_from_collect_countcounterTeleport Audit LogNumber of collect failures in Parquet-format audit log.
teleport_connected_resourcesgaugeTeleport AuthNumber and type of resources connected via keepalives. x
teleport_postgres_events_backend_write_requestscounterPostgres (Events)Number of write requests to postgres events, labeled with the request status (success or failure).
teleport_postgres_events_backend_batch_read_requestscounterPostgres (Events)Number of batch read requests to postgres events, labeled with the request status (success or failure).
teleport_postgres_events_backend_batch_delete_requestscounterPostgres (Events)Number of batch delete requests to postgres events, labeled with the request status (success or failure).
teleport_postgres_events_backend_write_secondshistogramPostgres (Events)Latency for postgres events write operations, in seconds.
teleport_postgres_events_backend_batch_read_secondshistogramPostgres (Events)Latency for postgres events batch read operations, in seconds.
teleport_postgres_events_backend_batch_delete_secondshistogramPostgres (Events)Latency for postgres events batch delete operations, in seconds.
teleport_registered_serversgaugeTeleport AuthThe number of Teleport services that are connected to an Auth Service instance grouped by version.
teleport_registered_servers_by_install_methodsgaugeTeleport AuthThe number of Teleport services that are connected to an Auth Service instance grouped by install methods.
teleport_roles_totalgaugeTeleport AuthThe number of roles that exist in the cluster.
teleport_migrationsgaugeTeleport AuthTracks for each migration if it is active (1) or not (0).
user_login_totalcounterTeleport AuthNumber of user logins.
watcher_event_sizeshistogramcacheOverall size of events emitted.
watcher_eventshistogramcachePer resource size of events emitted.

Enhanced Session Recording / BPF

NameTypeComponentDescription
bpf_lost_command_eventscounterBPFNumber of lost command events.
bpf_lost_disk_eventscounterBPFNumber of lost disk events.
bpf_lost_network_eventscounterBPFNumber of lost network events.

Proxy Service

NameTypeComponentDescription
failed_connect_to_node_attempts_totalcounterTeleport ProxyNumber of failed SSH connection attempts to the SSH Service. Use with teleport_connect_to_node_attempts_total to get the failure rate.
failed_login_attempts_totalcounterTeleport ProxyNumber of failed tsh login or tsh ssh logins.
grpc_client_started_totalcounterTeleport ProxyTotal number of RPCs started on the client.
grpc_client_handled_totalcounterTeleport ProxyTotal number of RPCs completed on the client, regardless of success or failure.
grpc_client_msg_received_totalcounterTeleport ProxyTotal number of RPC stream messages received on the client.
grpc_client_msg_sent_totalcounterTeleport ProxyTotal number of gRPC stream messages sent by the client.
proxy_connection_limit_exceeded_totalcounterTeleport ProxyNumber of connections that exceeded the Proxy Service connection limit.
proxy_peer_client_dial_error_totalcounterTeleport ProxyTotal number of errors encountered dialing peer Proxy Service instances.
proxy_peer_server_connectionsgaugeTeleport ProxyNumber of currently opened connection to proxy Proxy Service instances.
proxy_peer_client_rpcgaugeTeleport ProxyNumber of current client RPC requests.
proxy_peer_client_rpc_totalcounterTeleport ProxyTotal number of client RPC requests.
proxy_peer_client_rpc_duration_secondshistogramTeleport ProxyDuration in seconds of RPCs sent by the client.
proxy_peer_client_message_sent_sizehistogramTeleport ProxySize of messages sent by the client.
proxy_peer_client_message_received_sizehistogramTeleport ProxySize of messages received by the client.
proxy_peer_server_connectionsgaugeTeleport ProxyNumber of currently opened connection to peer Proxy Service clients.
proxy_peer_server_rpcgaugeTeleport ProxyNumber of current server RPC requests.
proxy_peer_server_rpc_totalcounterTeleport ProxyTotal number of server RPC requests.
proxy_peer_server_rpc_duration_secondshistogramTeleport ProxyDuration in seconds of RPCs sent by the server.
proxy_peer_server_message_sent_sizehistogramTeleport ProxySize of messages sent by the server.
proxy_peer_server_message_received_sizehistogramTeleport ProxySize of messages received by the server.
proxy_ssh_sessions_totalgaugeTeleport ProxyNumber of active sessions through this Proxy Service instance.
proxy_missing_ssh_tunnelsgaugeTeleport ProxyNumber of missing SSH tunnels. Used to debug if Teleport instances have discovered all Proxy Service instances.
remote_clustersgaugeTeleport ProxyNumber of inbound connections from leaf clusters.
teleport_connect_to_node_attempts_totalcounterTeleport ProxyNumber of SSH connection attempts to a SSH Service. Use with failed_connect_to_node_attempts_total to get the failure rate.
teleport_reverse_tunnels_connectedgaugeTeleport ProxyNumber of reverse SSH tunnels connected to the Teleport Proxy Service by Teleport instances.
teleport_proxy_db_connection_setup_time_secondshistogramTeleport ProxyTime to establish connection to DB service from Proxy service.
teleport_proxy_db_connection_dial_attempts_totalcounterTeleport ProxyNumber of dial attempts from Proxy to DB service made.
teleport_proxy_db_connection_dial_failures_totalcounterTeleport ProxyNumber of failed dial attempts from Proxy to DB service made.
teleport_proxy_db_attempted_servers_totalhistogramTeleport ProxyNumber of servers processed during connection attempt to the DB service from Proxy service.
teleport_proxy_db_connection_tls_config_time_secondshistogramTeleport ProxyTime to fetch TLS configuration for the connection to DB service from Proxy service.
teleport_proxy_db_active_connections_totalgaugeTeleport ProxyNumber of currently active connections to DB service from Proxy service.
trusted_clustersgaugeTeleport ProxyNumber of outbound connections to leaf clusters.

Database Service

NameTypeComponentDescription
teleport_db_messages_from_client_totalcounterTeleport Database ServiceNumber of messages (packets) received from the DB client.
teleport_db_messages_from_server_totalcounterTeleport Database ServiceNumber of messages (packets) received from the DB server.
teleport_db_method_call_count_totalcounterTeleport Database ServiceNumber of times a DB method was called.
teleport_db_method_call_latency_secondshistogramTeleport Database ServiceCall latency for a DB method calls.
teleport_db_initialized_connections_totalcounterTeleport Database ServiceNumber of initialized DB connections.
teleport_db_active_connections_totalgaugeTeleport Database ServiceNumber of active DB connections.
teleport_db_connection_durations_secondshistogramTeleport Database ServiceDuration of DB connection.
teleport_db_connection_setup_time_secondshistogramTeleport Database ServiceInitial time to setup DB connection, before any requests are handled.
teleport_db_errors_totalcounterTeleport Database ServiceNumber of synthetic DB errors sent to the client.

Kubernetes access

The following tables identify all metrics available in the Teleport Proxy Service if at least one Kubernetes cluster is enrolled in your Teleport cluster.

Client

The following table identifies all metrics available when the service connects to upstream servers. In the case of proxy, the upstream server can be a kubernetes_service or Kubernetes Cluster if it's running in legacy mode.

NameTypeComponentDescription
teleport_kubernetes_client_in_flight_requestsgaugeTeleport Kubernetes ProxyIn-flight requests waiting for the upstream response.
teleport_kubernetes_client_requests_totalcounterTeleport Kubernetes ProxyTotal number of requests sent to the upstream Teleport proxy, kube_service or Kubernetes Cluster servers.
teleport_kubernetes_client_tls_duration_secondshistogramTeleport Kubernetes ProxyLatency distribution of TLS handshakes.
teleport_kubernetes_client_got_conn_duration_secondshistogramTeleport Kubernetes ProxyLatency distribution of time to dial to the upstream server - using reverse tunnel or direct dialer.
teleport_kubernetes_client_first_byte_response_duration_secondshistogramTeleport Kubernetes ProxyLatency distribution of time to receive the first response byte from the upstream server.
teleport_kubernetes_client_request_duration_secondshistogramTeleport Kubernetes ProxyLatency distribution of the upstream request time.

Server

The following table identifies all metrics available for incoming connections.

NameTypeComponentDescription
teleport_kubernetes_server_in_flight_requestsgaugeTeleport Kubernetes ProxyIn-flight requests currently handled by the server.
teleport_kubernetes_server_api_requests_totalcounterTeleport Kubernetes ProxyTotal number of requests handled by the server.
teleport_kubernetes_server_request_duration_secondshistogramTeleport Kubernetes ProxyLatency distribution of the total request time.
teleport_kubernetes_server_response_size_byteshistogramTeleport Kubernetes ProxyDistribution of the response size.
teleport_kubernetes_server_exec_in_flight_sessionsgaugeTeleport Kubernetes ProxyNumber of active kubectl exec sessions.
teleport_kubernetes_server_exec_sessions_totalcounterTeleport Kubernetes ProxyTotal number of kubectl exec sessions.
teleport_kubernetes_server_portforward_in_flight_sessionsgaugeTeleport Kubernetes ProxyNumber of active kubectl portforward sessions.
teleport_kubernetes_server_portforward_sessions_totalcounterTeleport Kubernetes ProxyNumber of active kubectl portforward sessions.
teleport_kubernetes_server_join_in_flight_sessionsgaugeTeleport Kubernetes ProxyNumber of active joining sessions,
teleport_kubernetes_server_join_sessions_totalcounterTeleport Kubernetes ProxyTotal number of joining sessions.

Teleport SSH Service

NameTypeComponentDescription
user_max_concurrent_sessions_hit_totalcounterTeleport SSHNumber of times a user exceeded their concurrent session limit.

Teleport Kubernetes Service

The following table identifies all metrics available when the service connects to upstream servers. In the case of kubernetes_service, the upstream server is always a Kubernetes cluster.

NameTypeComponentDescription
teleport_kubernetes_client_in_flight_requestsgaugeTeleport Kubernetes ServiceIn-flight requests waiting for the upstream response.
teleport_kubernetes_client_requests_totalcounterTeleport Kubernetes ServiceTotal number of requests sent to the upstream teleport proxy, kube_service or Kubernetes Cluster servers.
teleport_kubernetes_client_tls_duration_secondshistogramTeleport Kubernetes ServiceLatency distribution of TLS handshakes.
teleport_kubernetes_client_got_conn_duration_secondshistogramTeleport Kubernetes ServiceLatency distribution of time to dial to the upstream server - using reversetunnel or direct dialer.
teleport_kubernetes_client_first_byte_response_duration_secondshistogramTeleport Kubernetes ServiceLatency distribution of time to receive the first response byte from the upstream server.
teleport_kubernetes_client_request_duration_secondshistogramTeleport Kubernetes ServiceLatency distribution of the upstream request time.

The following table identifies all metrics available for incoming connections.

NameTypeComponentDescription
teleport_kubernetes_server_in_flight_requestsgaugeTeleport Kubernetes ServiceIn-flight requests currently handled by the server.
teleport_kubernetes_server_api_requests_totalcounterTeleport Kubernetes ServiceTotal number of requests handled by the server.
teleport_kubernetes_server_request_duration_secondshistogramTeleport Kubernetes ServiceLatency distribution of the total request time.
teleport_kubernetes_server_response_size_byteshistogramTeleport Kubernetes ServiceDistribution of the response size.
teleport_kubernetes_server_exec_in_flight_sessionsgaugeTeleport Kubernetes ServiceNumber of active kubectl exec sessions.
teleport_kubernetes_server_exec_sessions_totalcounterTeleport Kubernetes ServiceTotal number of kubectl exec sessions.
teleport_kubernetes_server_portforward_in_flight_sessionsgaugeTeleport Kubernetes ServiceNumber of active kubectl portforward sessions.
teleport_kubernetes_server_portforward_sessions_totalcounterTeleport Kubernetes ServiceNumber of active kubectl portforward sessions.
teleport_kubernetes_server_join_in_flight_sessionsgaugeTeleport Kubernetes ServiceNumber of active joining sessions,
teleport_kubernetes_server_join_sessions_totalcounterTeleport Kubernetes ServiceTotal number of joining sessions.

All Teleport instances

NameTypeComponentDescription
process_stategaugeTeleportState of the teleport process: 0 - ok, 1 - recovering, 2 - degraded, 3 - starting.
certificate_mismatch_totalcounterTeleportNumber of SSH server login failures due to a certificate mismatch.
rxcounterTeleportNumber of bytes received during an SSH connection.
server_interactive_sessions_totalgaugeTeleportNumber of active sessions.
teleport_build_infogaugeTeleportProvides build information of Teleport including gitref (git describe --long --tags), Go version, and Teleport version. The value of this gauge will always be 1.
teleport_breaker_connector_executions_totalcounterTeleportNumber of requests to the Teleport Auth Service API that go through a circuit breaker done by Teleport services, labeled by role of the connector (almost always Instance), state of the associated circuit breaker and success as interpreted by the breaker.
teleport_cache_eventscounterTeleportNumber of events received by a Teleport service cache. Teleport's Auth Service, Proxy Service, and other services cache incoming events related to their service.
teleport_cache_stale_eventscounterTeleportNumber of stale events received by a Teleport service cache. A high percentage of stale events can indicate a degraded backend.
txcounterTeleportNumber of bytes transmitted during an SSH connection.

Go runtime metrics

These metrics are surfaced by the Go runtime and are not specific to Teleport.

NameTypeComponentDescription
go_gc_duration_secondssummaryInternal GoA summary of GC invocation durations.
go_goroutinesgaugeInternal GoNumber of goroutines that currently exist.
go_infogaugeInternal GoInformation about the Go environment.
go_memstats_alloc_bytes_totalcounterInternal GoTotal number of bytes allocated, even if freed.
go_memstats_alloc_bytesgaugeInternal GoNumber of bytes allocated and still in use.
go_memstats_buck_hash_sys_bytesgaugeInternal GoNumber of bytes used by the profiling bucket hash table.
go_memstats_frees_totalcounterInternal GoTotal number of frees.
go_memstats_gc_cpu_fractiongaugeInternal GoThe fraction of this program's available CPU time used by the GC since the program started.
go_memstats_gc_sys_bytesgaugeInternal GoNumber of bytes used for garbage collection system metadata.
go_memstats_heap_alloc_bytesgaugeInternal GoNumber of heap bytes allocated and still in use.
go_memstats_heap_idle_bytesgaugeInternal GoNumber of heap bytes waiting to be used.
go_memstats_heap_inuse_bytesgaugeInternal GoNumber of heap bytes that are in use.
go_memstats_heap_objectsgaugeInternal GoNumber of allocated objects.
go_memstats_heap_released_bytesgaugeInternal GoNumber of heap bytes released to the OS.
go_memstats_heap_sys_bytesgaugeInternal GoNumber of heap bytes obtained from the system.
go_memstats_last_gc_time_secondsgaugeInternal GoNumber of seconds since the Unix epoch of the last garbage collection.
go_memstats_lookups_totalcounterInternal GoTotal number of pointer lookups.
go_memstats_mallocs_totalcounterInternal GoTotal number of mallocs.
go_memstats_mcache_inuse_bytesgaugeInternal GoNumber of bytes in use by mcache structures.
go_memstats_mcache_sys_bytesgaugeInternal GoNumber of bytes used for mcache structures obtained from system.
go_memstats_mspan_inuse_bytesgaugeInternal GoNumber of bytes in use by mspan structures.
go_memstats_mspan_sys_bytesgaugeInternal GoNumber of bytes used for mspan structures obtained from system.
go_memstats_next_gc_bytesgaugeInternal GoNumber of heap bytes when next the garbage collection will take place.
go_memstats_other_sys_bytesgaugeInternal GoNumber of bytes used for other system allocations.
go_memstats_stack_inuse_bytesgaugeInternal GoNumber of bytes in use by the stack allocator.
go_memstats_stack_sys_bytesgaugeInternal GoNumber of bytes obtained from the system for stack allocator.
go_memstats_sys_bytesgaugeInternal GoNumber of bytes obtained from the system.
go_threadsgaugeInternal GoNumber of OS threads created.
process_cpu_seconds_totalcounterInternal GoTotal user and system CPU time spent in seconds.
process_max_fdsgaugeInternal GoMaximum number of open file descriptors.
process_open_fdsgaugeInternal GoNumber of open file descriptors.
process_resident_memory_bytesgaugeInternal GoResident memory size in bytes.
process_start_time_secondsgaugeInternal GoStart time of the process since the Unix epoch in seconds.
process_virtual_memory_bytesgaugeInternal GoVirtual memory size in bytes.
process_virtual_memory_max_bytesgaugeInternal GoMaximum amount of virtual memory available in bytes.

Prometheus

NameTypeComponentDescription
promhttp_metric_handler_requests_in_flightgaugeprometheusCurrent number of scrapes being served.
promhttp_metric_handler_requests_totalcounterprometheusTotal number of scrapes by HTTP status code.

Distributed tracing

How to enable distributed tracing for a Teleport instance.

Teleport leverages OpenTelemetry to generate traces and export them to any OpenTelemetry Protocol (OTLP) capable exporter. In the event that your telemetry backend doesn't support receiving OTLP traces, you may be able to leverage the OpenTelemetry Collector to proxy traces from OTLP to a format that your telemetry backend accepts.

Configure Teleport

In order to enable tracing for a teleport instance, add the following section to that instance's configuration file (/etc/teleport.yaml). For a detailed description of these configuration fields, see the configuration reference page.

tracing_service:
   enabled: yes
   exporter_url: grpc://collector.example.com:4317
   sampling_rate_per_million: 1000000

Sampling rate

It is important to choose the sampling rate wisely. Sampling at a rate of 100% could have a negative impact on the performance of your cluster. Teleport honors the sampling rate included in any incoming requests, which means that even when the tracing_service is enabled and the sampling rate is 0, if Teleport receives a request that has a span which is sampled, then Teleport will sample and export all spans that are generated in response to that request.

Exporter URL

The exporter_url setting indicates where Teleport should send spans to. Supported schemes are grpc://, http://, https://, and file:// (if no scheme is provided, then grpc:// is used).

When using file://, the url must be a path to a directory that Teleport has write permissions for. Spans will be saved to files within the provided directory, each file containing one proto encoded span per line. Files are rotated after exceeding 100MB, in order to override the default limit add ?limit=<desired_file_size_in_bytes> to the exporter_url (i.e. file:///var/lib/teleport/traces?limit=100).

By default the connection to the exporter is insecure, to support TLS add the following to the tracing_service configuration:

   # Optional path to CA certificates are used to validate the exporter.
  ca_certs:
    - /var/lib/teleport/exporter_ca.pem
  # Optional path tp TLS certificates are used to enable mTLS for the exporter
  https_keypairs:
    - key_file: /var/lib/teleport/exporter_key.pem
      cert_file: /var/lib/teleport/exporter_cert.pem

After updating teleport.yaml, start your teleport instance to apply the new configuration.

tsh

To capture traces from tsh add the --trace flag to your command. All traces generated by tsh --trace will be proxied to the exporter_url defined for the Auth Service of the cluster the command is being run on.

tsh --trace ssh root@myserver
tsh --trace ls

Exporting traces from tsh to a different exporter than the one defined in the Auth Service config is also possible via the --trace-exporter flag. A URL must be provided that adheres to the same format as the exporter_url of the tracing_service.

tsh --trace --trace-exporter=grpc://collector.example.com:4317 ssh root@myserver
tsh --trace --trace-exporter=file:///var/lib/teleport/traces ls

Collecting profiles

How to collect runtime profiling data from a Teleport instance.

Teleport leverages Go's diagnostic capabilities to collect and export profiling data. Profiles can help identify the cause of spikes in CPU, the source of memory leaks, or the reason for a deadlock.

Using the Debug Service

The Teleport Debug Service enables administrators to collect diagnostic profiles without enabling pprof endpoints at startup. The service, enabled by default, ensures local-only access and must be consumed from inside the same instance.

teleport debug profile collects a list of pprof profiles. It outputs a compressed tarball (.tar.gz) to STDOUT. You decompress it using tar or direct the result to a file.

By default, it collects goroutine, heap and profile profiles.

Each profile collected will have a correspondent file inside the tarball. For example, collecting goroutine,trace,heap will result in goroutine.pprof, trace.pprof, and heap.pprof files.

Collect default profiles and save to a file.

teleport debug profile > pprof.tar.gz
tar xvf pprof.tar.gz

Collect default profiles and decompress it.

teleport debug profile | tar xzv -C ./

Collect "trace" and "mutex" profiles and save to a file.

teleport debug profile trace,mutex > pprof.tar.gz

Collect profiles setting the profiling time in seconds

teleport debug profile -s 20 trace > pprof.tar.gz
Specify your Teleport configuration path

If your Teleport configuration is not placed on the default path (/etc/teleport.yaml), you must specify its location to the CLI command using the -c/--config flag.

If you're running Teleport on a Kubernetes cluster you can directly collect profiles to a local directory without an interactive session:

kubectl -n teleport exec my-pod -- teleport debug profile > pprof.tar.gz

After extracting the contents, you can use go tool commands to explore and visualize them:

Opens the terminal interactive explorer

go tool pprof heap.pprof

Opens the web visualizer

go tool pprof -http : heap.pprof

Visualize trace profiles

go tool trace trace.pprof

Using diagnostics endpoints

The profiling endpoint is only enabled if the --debug flag is supplied.

Teleport's diagnostic HTTP endpoints are disabled by default. You can enable them via:

Start a teleport instance with the --diag-addr flag set to the local address where the diagnostic endpoint will listen:

sudo teleport start --debug --diag-addr=127.0.0.1:3000

Edit a teleport instance's configuration file (/etc/teleport.yaml by default) to include the following:

teleport:
    diag_addr: 127.0.0.1:3000

To enable debug logs:

log:
    severity: DEBUG

Verify that Teleport is now serving the diagnostics endpoint:

curl http://127.0.0.1:3000/healthz

Collecting profiles

Go's standard profiling endpoints are served at http://127.0.0.1:3000/debug/pprof/. Retrieving a profile requires sending a request to the endpoint corresponding to the desired profile type. When debugging an issue it is helpful to collect a series of profiles over a period of time.

CPU

CPU profile show execution statistics gathered over a user specified period:

Download the profile into a file:

curl -o cpu.profile http://127.0.0.1:3000/debug/pprof/profile?seconds=30

Visualize the profile

go tool pprof -http : cpu.profile

Goroutine

Goroutine profiles show the stack traces for all running goroutines in the system:

Download the profile into a file:

curl -o goroutine.profile http://127.0.0.1:3000/debug/pprof/goroutine

Visualize the profile

go tool pprof -http : goroutine.profile

Heap

Heap profiles show allocated objects in the system:

Download the profile into a file:

curl -o heap.profile http://127.0.0.1:3000/debug/pprof/heap

Visualize the profile

go tool pprof -http : heap.profile

Trace

Trace profiles capture scheduling, system calls, garbage collections, heap size, and other events that are collected by the Go runtime over a user specified period of time:

Download the profile into a file:

curl -o trace.out http://127.0.0.1:3000/debug/pprof/trace?seconds=5

Visualize the profile

go tool trace trace.out

Further Reading