Teleport

Monitoring your Cluster

Teleport provides health checking mechanisms in order to verify that it is healthy and ready to serve traffic. Metrics, tracing, and profiling provide in-depth data, tracking cluster performance and responsiveness.

Enable health monitoring

How to monitor the health of a Teleport instance.

Teleport's diagnostic HTTP endpoints are disabled by default. You can enable them via:

Start a teleport instance with the --diag-addr flag set to the local address where the diagnostic endpoint will listen:

sudo teleport start  --diag-addr=127.0.0.1:3000

Edit a teleport instance's configuration file (/etc/teleport.yaml by default) to include the following:

teleport:
    diag_addr: 127.0.0.1:3000

To enable debug logs:

log:
    severity: DEBUG

Verify that Teleport is now serving the diagnostics endpoint:

curl http://127.0.0.1:3000/healthz

Now you can collect monitoring information from several endpoints. These can be used by things like Kubernetes probes to monitor the health of a Teleport process.

`/healthz`

The http://127.0.0.1:3000/healthz endpoint responds with a body of {"status":"ok"} and an HTTP 200 OK status code if the process is running.

This is a check to determine if the Teleport process is still running.

`/readyz`

The http://127.0.0.1:3000/readyz endpoint is similar to /healthz, but its response includes information about the state of the process.

The response body is a JSON object of the form:

{ "status": "a status message here"}

`/readyz` and heartbeats

If a Teleport component fails to execute its heartbeat procedure, it will enter a degraded state. Teleport will begin recovering from this state when a heartbeat completes successfully.

The first successful heartbeat will transition Teleport into a recovering state. A second consecutive successful heartbeat will cause Teleport to transition to the OK state.

Teleport heartbeats run approximately every 60 seconds when healthy, and failed heartbeats are retried approximately every 5 seconds. This means that depending on the timing of heartbeats, it can take 60-70 seconds after connectivity is restored for /readyz to start reporting healthy again.

Status codes

The status code of the response can be one of:

HTTP 200 OK: Teleport is operating normally
HTTP 503 Service Unavailable: Teleport has encountered a connection error and is running in a degraded state. This happens when a Teleport heartbeat fails.
HTTP 400 Bad Request: Teleport is either entering its initial startup phase or has begun recovering from a degraded state.

The same state information is also available via the process_state metric under the /metrics endpoint.

Metrics

Teleport exposes metrics for all of its components, helping you get insight into the state of your cluster. This guide explains the metrics that you can collect from your Teleport cluster.

Enabling metrics

Teleport's diagnostic HTTP endpoints are disabled by default. You can enable them via:

Start a teleport instance with the --diag-addr flag set to the local address where the diagnostic endpoint will listen:

sudo teleport start  --diag-addr=127.0.0.1:3000

Edit a teleport instance's configuration file (/etc/teleport.yaml by default) to include the following:

teleport:
    diag_addr: 127.0.0.1:3000

To enable debug logs:

log:
    severity: DEBUG

Verify that Teleport is now serving the diagnostics endpoint:

curl http://127.0.0.1:3000/healthz

This will enable the http://127.0.0.1:3000/metrics endpoint, which serves the metrics that Teleport tracks. It is compatible with Prometheus collectors.

The following metrics are available:

Teleport Enterprise (cloud-hosted) does not expose monitoring endpoints for the Auth Service and Proxy Service.

Auth Service and backends

Name	Type	Component	Description
`audit_failed_disk_monitoring`	counter	Teleport Audit Log	Number of times disk monitoring failed.
`audit_failed_emit_events`	counter	Teleport Audit Log	Number of times emitting audit events failed.
`audit_percentage_disk_space_used`	gauge	Teleport Audit Log	Percentage of disk space used.
`audit_server_open_files`	gauge	Teleport Audit Log	Number of open audit files.
`auth_generate_requests_throttled_total`	counter	Teleport Auth	Number of throttled requests to generate new server keys.
`auth_generate_requests_total`	counter	Teleport Auth	Number of requests to generate new server keys.
`auth_generate_requests`	gauge	Teleport Auth	Number of current generate requests.
`auth_generate_seconds`	histogram	Teleport Auth	Latency for generate requests.
`backend_batch_read_requests_total`	counter	cache	Number of read requests to the backend.
`backend_batch_read_seconds`	histogram	cache	Latency for batch read operations.
`backend_batch_write_requests_total`	counter	cache	Number of batch write requests to the backend.
`backend_batch_write_seconds`	histogram	cache	Latency for backend batch write operations.
`backend_read_requests_total`	counter	cache	Number of read requests to the backend.
`backend_read_seconds`	histogram	cache	Latency for read operations.
`backend_requests`	counter	cache	Number of requests to the backend (reads, writes, and keepalives).
`backend_write_requests_total`	counter	cache	Number of write requests to the backend.
`backend_write_seconds`	histogram	cache	Latency for backend write operations.
`cluster_name_not_found_total`	counter	Teleport Auth	Number of times a cluster was not found.
`dynamo_requests_total`	counter	DynamoDB	Total number of requests to the DynamoDB API.
`dynamo_requests`	counter	DynamoDB	Total number of requests to the DynamoDB API grouped by result.
`dynamo_requests_seconds`	histogram	DynamoDB	Latency of DynamoDB API requests.
`etcd_backend_batch_read_requests`	counter	etcd	Number of read requests to the etcd database.
`etcd_backend_batch_read_seconds`	histogram	etcd	Latency for etcd read operations.
`etcd_backend_read_requests`	counter	etcd	Number of read requests to the etcd database.
`etcd_backend_read_seconds`	histogram	etcd	Latency for etcd read operations.
`etcd_backend_tx_requests`	counter	etcd	Number of transaction requests to the database.
`etcd_backend_tx_seconds`	histogram	etcd	Latency for etcd transaction operations.
`etcd_backend_write_requests`	counter	etcd	Number of write requests to the database.
`etcd_backend_write_seconds`	histogram	etcd	Latency for etcd write operations.
`teleport_etcd_events`	counter	etcd	Total number of etcd events processed.
`teleport_etcd_event_backpressure`	counter	etcd	Total number of times event processing encountered backpressure.
`firestore_events_backend_batch_read_requests`	counter	GCP Cloud Firestore	Number of batch read requests to Cloud Firestore events.
`firestore_events_backend_batch_read_seconds`	histogram	GCP Cloud Firestore	Latency for Cloud Firestore events batch read operations.
`firestore_events_backend_batch_write_requests`	counter	GCP Cloud Firestore	Number of batch write requests to Cloud Firestore events.
`firestore_events_backend_batch_write_seconds`	histogram	GCP Cloud Firestore	Latency for Cloud Firestore events batch write operations.
`firestore_events_backend_write_requests`	counter	GCP Cloud Firestore	Number of write requests to Cloud Firestore events.
`firestore_events_backend_write_seconds`	histogram	GCP Cloud Firestore	Latency for Cloud Firestore events write operations.
`gcs_event_storage_downloads_seconds`	histogram	GCP GCS	Latency for GCS download operations.
`gcs_event_storage_downloads`	counter	GCP GCS	Number of downloads from the GCS backend.
`gcs_event_storage_uploads_seconds`	histogram	GCP GCS	Latency for GCS upload operations.
`gcs_event_storage_uploads`	counter	GCP GCS	Number of uploads to the GCS backend.
`grpc_server_started_total`	counter	Teleport Auth	Total number of RPCs started on the server.
`grpc_server_handled_total`	counter	Teleport Auth	Total number of RPCs completed on the server, regardless of success or failure.
`grpc_server_msg_received_total`	counter	Teleport Auth	Total number of RPC stream messages received on the server.
`grpc_server_msg_sent_total`	counter	Teleport Auth	Total number of gRPC stream messages sent by the server.
`heartbeat_connections_received_total`	counter	Teleport Auth	Number of times the Auth Service received a heartbeat connection, representing total heart beating Agents.
`s3_requests_total`	counter	Amazon S3	Total number of requests to the S3 API.
`s3_requests`	counter	Amazon S3	Total number of requests to the S3 API grouped by result.
`s3_requests_seconds`	histogram	Amazon S3	Request latency for the S3 API.
`teleport_audit_emit_events`	counter	Teleport Audit Log	Number of audit events emitted.
`teleport_audit_parquetlog_batch_processing_seconds`	histogram	Teleport Audit Log	Duration of processing single batch of events in the Parquet-format audit log.
`teleport_audit_parquetlog_s3_flush_seconds`	histogram	Teleport Audit Log	Duration of flushing parquet files to S3 in Parquet-format audit log.
`teleport_audit_parquetlog_delete_events_seconds`	histogram	Teleport Audit Log	Duration of deletion events from SQS in Parquet-format audit log.
`teleport_audit_parquetlog_batch_size`	histogram	Teleport Audit Log	Overall size of events in single batch in Parquet-format audit log.
`teleport_audit_parquetlog_batch_count`	counter	Teleport Audit Log	Total number of events in single batch in Parquet-format audit log.
`teleport_audit_parquetlog_last_processed_timestamp`	gauge	Teleport Audit Log	Number of last processing time in Parquet-format audit log.
`teleport_audit_parquetlog_age_oldest_processed_message`	gauge	Teleport Audit Log	Number of age of oldest event in Parquet-format audit log.
`teleport_audit_parquetlog_errors_from_collect_count`	counter	Teleport Audit Log	Number of collect failures in Parquet-format audit log.
`teleport_connected_resources`	gauge	Teleport Auth	Number and type of resources connected via keepalives. x
`teleport_postgres_events_backend_write_requests`	counter	Postgres (Events)	Number of write requests to postgres events, labeled with the request `status` (success or failure).
`teleport_postgres_events_backend_batch_read_requests`	counter	Postgres (Events)	Number of batch read requests to postgres events, labeled with the request `status` (success or failure).
`teleport_postgres_events_backend_batch_delete_requests`	counter	Postgres (Events)	Number of batch delete requests to postgres events, labeled with the request `status` (success or failure).
`teleport_postgres_events_backend_write_seconds`	histogram	Postgres (Events)	Latency for postgres events write operations, in seconds.
`teleport_postgres_events_backend_batch_read_seconds`	histogram	Postgres (Events)	Latency for postgres events batch read operations, in seconds.
`teleport_postgres_events_backend_batch_delete_seconds`	histogram	Postgres (Events)	Latency for postgres events batch delete operations, in seconds.
`teleport_registered_servers`	gauge	Teleport Auth	The number of Teleport services that are connected to an Auth Service instance grouped by version.
`teleport_registered_servers_by_install_methods`	gauge	Teleport Auth	The number of Teleport services that are connected to an Auth Service instance grouped by install methods.
`teleport_roles_total`	gauge	Teleport Auth	The number of roles that exist in the cluster.
`teleport_migrations`	gauge	Teleport Auth	Tracks for each migration if it is active (1) or not (0).
`user_login_total`	counter	Teleport Auth	Number of user logins.
`watcher_event_sizes`	histogram	cache	Overall size of events emitted.
`watcher_events`	histogram	cache	Per resource size of events emitted.

Enhanced Session Recording / BPF

Name	Type	Component	Description
`bpf_lost_command_events`	counter	BPF	Number of lost command events.
`bpf_lost_disk_events`	counter	BPF	Number of lost disk events.
`bpf_lost_network_events`	counter	BPF	Number of lost network events.

Proxy Service

Name	Type	Component	Description
`failed_connect_to_node_attempts_total`	counter	Teleport Proxy	Number of failed SSH connection attempts to the SSH Service. Use with `teleport_connect_to_node_attempts_total` to get the failure rate.
`failed_login_attempts_total`	counter	Teleport Proxy	Number of failed `tsh login` or `tsh ssh` logins.
`grpc_client_started_total`	counter	Teleport Proxy	Total number of RPCs started on the client.
`grpc_client_handled_total`	counter	Teleport Proxy	Total number of RPCs completed on the client, regardless of success or failure.
`grpc_client_msg_received_total`	counter	Teleport Proxy	Total number of RPC stream messages received on the client.
`grpc_client_msg_sent_total`	counter	Teleport Proxy	Total number of gRPC stream messages sent by the client.
`proxy_connection_limit_exceeded_total`	counter	Teleport Proxy	Number of connections that exceeded the Proxy Service connection limit.
`proxy_peer_client_dial_error_total`	counter	Teleport Proxy	Total number of errors encountered dialing peer Proxy Service instances.
`proxy_peer_server_connections`	gauge	Teleport Proxy	Number of currently opened connection to proxy Proxy Service instances.
`proxy_peer_client_rpc`	gauge	Teleport Proxy	Number of current client RPC requests.
`proxy_peer_client_rpc_total`	counter	Teleport Proxy	Total number of client RPC requests.
`proxy_peer_client_rpc_duration_seconds`	histogram	Teleport Proxy	Duration in seconds of RPCs sent by the client.
`proxy_peer_client_message_sent_size`	histogram	Teleport Proxy	Size of messages sent by the client.
`proxy_peer_client_message_received_size`	histogram	Teleport Proxy	Size of messages received by the client.
`proxy_peer_server_connections`	gauge	Teleport Proxy	Number of currently opened connection to peer Proxy Service clients.
`proxy_peer_server_rpc`	gauge	Teleport Proxy	Number of current server RPC requests.
`proxy_peer_server_rpc_total`	counter	Teleport Proxy	Total number of server RPC requests.
`proxy_peer_server_rpc_duration_seconds`	histogram	Teleport Proxy	Duration in seconds of RPCs sent by the server.
`proxy_peer_server_message_sent_size`	histogram	Teleport Proxy	Size of messages sent by the server.
`proxy_peer_server_message_received_size`	histogram	Teleport Proxy	Size of messages received by the server.
`proxy_ssh_sessions_total`	gauge	Teleport Proxy	Number of active sessions through this Proxy Service instance.
`proxy_missing_ssh_tunnels`	gauge	Teleport Proxy	Number of missing SSH tunnels. Used to debug if Teleport instances have discovered all Proxy Service instances.
`remote_clusters`	gauge	Teleport Proxy	Number of inbound connections from leaf clusters.
`teleport_connect_to_node_attempts_total`	counter	Teleport Proxy	Number of SSH connection attempts to a SSH Service. Use with `failed_connect_to_node_attempts_total` to get the failure rate.
`teleport_reverse_tunnels_connected`	gauge	Teleport Proxy	Number of reverse SSH tunnels connected to the Teleport Proxy Service by Teleport instances.
`teleport_proxy_db_connection_setup_time_seconds`	histogram	Teleport Proxy	Time to establish connection to DB service from Proxy service.
`teleport_proxy_db_connection_dial_attempts_total`	counter	Teleport Proxy	Number of dial attempts from Proxy to DB service made.
`teleport_proxy_db_connection_dial_failures_total`	counter	Teleport Proxy	Number of failed dial attempts from Proxy to DB service made.
`teleport_proxy_db_attempted_servers_total`	histogram	Teleport Proxy	Number of servers processed during connection attempt to the DB service from Proxy service.
`teleport_proxy_db_connection_tls_config_time_seconds`	histogram	Teleport Proxy	Time to fetch TLS configuration for the connection to DB service from Proxy service.
`teleport_proxy_db_active_connections_total`	gauge	Teleport Proxy	Number of currently active connections to DB service from Proxy service.
`trusted_clusters`	gauge	Teleport Proxy	Number of outbound connections to leaf clusters.

Database Service

Name	Type	Component	Description
`teleport_db_messages_from_client_total`	counter	Teleport Database Service	Number of messages (packets) received from the DB client.
`teleport_db_messages_from_server_total`	counter	Teleport Database Service	Number of messages (packets) received from the DB server.
`teleport_db_method_call_count_total`	counter	Teleport Database Service	Number of times a DB method was called.
`teleport_db_method_call_latency_seconds`	histogram	Teleport Database Service	Call latency for a DB method calls.
`teleport_db_initialized_connections_total`	counter	Teleport Database Service	Number of initialized DB connections.
`teleport_db_active_connections_total`	gauge	Teleport Database Service	Number of active DB connections.
`teleport_db_connection_durations_seconds`	histogram	Teleport Database Service	Duration of DB connection.
`teleport_db_connection_setup_time_seconds`	histogram	Teleport Database Service	Initial time to setup DB connection, before any requests are handled.
`teleport_db_errors_total`	counter	Teleport Database Service	Number of synthetic DB errors sent to the client.

Kubernetes access

The following tables identify all metrics available in the Teleport Proxy Service if at least one Kubernetes cluster is enrolled in your Teleport cluster.

Client

The following table identifies all metrics available when the service connects to upstream servers. In the case of proxy, the upstream server can be a kubernetes_service or Kubernetes Cluster if it's running in legacy mode.

Name	Type	Component	Description
`teleport_kubernetes_client_in_flight_requests`	gauge	Teleport Kubernetes Proxy	In-flight requests waiting for the upstream response.
`teleport_kubernetes_client_requests_total`	counter	Teleport Kubernetes Proxy	Total number of requests sent to the upstream Teleport proxy, kube_service or Kubernetes Cluster servers.
`teleport_kubernetes_client_tls_duration_seconds`	histogram	Teleport Kubernetes Proxy	Latency distribution of TLS handshakes.
`teleport_kubernetes_client_got_conn_duration_seconds`	histogram	Teleport Kubernetes Proxy	Latency distribution of time to dial to the upstream server - using reverse tunnel or direct dialer.
`teleport_kubernetes_client_first_byte_response_duration_seconds`	histogram	Teleport Kubernetes Proxy	Latency distribution of time to receive the first response byte from the upstream server.
`teleport_kubernetes_client_request_duration_seconds`	histogram	Teleport Kubernetes Proxy	Latency distribution of the upstream request time.

Server

The following table identifies all metrics available for incoming connections.

Name	Type	Component	Description
`teleport_kubernetes_server_in_flight_requests`	gauge	Teleport Kubernetes Proxy	In-flight requests currently handled by the server.
`teleport_kubernetes_server_api_requests_total`	counter	Teleport Kubernetes Proxy	Total number of requests handled by the server.
`teleport_kubernetes_server_request_duration_seconds`	histogram	Teleport Kubernetes Proxy	Latency distribution of the total request time.
`teleport_kubernetes_server_response_size_bytes`	histogram	Teleport Kubernetes Proxy	Distribution of the response size.
`teleport_kubernetes_server_exec_in_flight_sessions`	gauge	Teleport Kubernetes Proxy	Number of active kubectl exec sessions.
`teleport_kubernetes_server_exec_sessions_total`	counter	Teleport Kubernetes Proxy	Total number of kubectl exec sessions.
`teleport_kubernetes_server_portforward_in_flight_sessions`	gauge	Teleport Kubernetes Proxy	Number of active kubectl portforward sessions.
`teleport_kubernetes_server_portforward_sessions_total`	counter	Teleport Kubernetes Proxy	Number of active kubectl portforward sessions.
`teleport_kubernetes_server_join_in_flight_sessions`	gauge	Teleport Kubernetes Proxy	Number of active joining sessions,
`teleport_kubernetes_server_join_sessions_total`	counter	Teleport Kubernetes Proxy	Total number of joining sessions.

Teleport SSH Service

Name	Type	Component	Description
`user_max_concurrent_sessions_hit_total`	counter	Teleport SSH	Number of times a user exceeded their concurrent session limit.

Teleport Kubernetes Service

The following table identifies all metrics available when the service connects to upstream servers. In the case of kubernetes_service, the upstream server is always a Kubernetes cluster.

Name	Type	Component	Description
`teleport_kubernetes_client_in_flight_requests`	gauge	Teleport Kubernetes Service	In-flight requests waiting for the upstream response.
`teleport_kubernetes_client_requests_total`	counter	Teleport Kubernetes Service	Total number of requests sent to the upstream teleport proxy, kube_service or Kubernetes Cluster servers.
`teleport_kubernetes_client_tls_duration_seconds`	histogram	Teleport Kubernetes Service	Latency distribution of TLS handshakes.
`teleport_kubernetes_client_got_conn_duration_seconds`	histogram	Teleport Kubernetes Service	Latency distribution of time to dial to the upstream server - using reversetunnel or direct dialer.
`teleport_kubernetes_client_first_byte_response_duration_seconds`	histogram	Teleport Kubernetes Service	Latency distribution of time to receive the first response byte from the upstream server.
`teleport_kubernetes_client_request_duration_seconds`	histogram	Teleport Kubernetes Service	Latency distribution of the upstream request time.

The following table identifies all metrics available for incoming connections.

Name	Type	Component	Description
`teleport_kubernetes_server_in_flight_requests`	gauge	Teleport Kubernetes Service	In-flight requests currently handled by the server.
`teleport_kubernetes_server_api_requests_total`	counter	Teleport Kubernetes Service	Total number of requests handled by the server.
`teleport_kubernetes_server_request_duration_seconds`	histogram	Teleport Kubernetes Service	Latency distribution of the total request time.
`teleport_kubernetes_server_response_size_bytes`	histogram	Teleport Kubernetes Service	Distribution of the response size.
`teleport_kubernetes_server_exec_in_flight_sessions`	gauge	Teleport Kubernetes Service	Number of active kubectl exec sessions.
`teleport_kubernetes_server_exec_sessions_total`	counter	Teleport Kubernetes Service	Total number of kubectl exec sessions.
`teleport_kubernetes_server_portforward_in_flight_sessions`	gauge	Teleport Kubernetes Service	Number of active kubectl portforward sessions.
`teleport_kubernetes_server_portforward_sessions_total`	counter	Teleport Kubernetes Service	Number of active kubectl portforward sessions.
`teleport_kubernetes_server_join_in_flight_sessions`	gauge	Teleport Kubernetes Service	Number of active joining sessions,
`teleport_kubernetes_server_join_sessions_total`	counter	Teleport Kubernetes Service	Total number of joining sessions.

All Teleport instances

Name	Type	Component	Description
`process_state`	gauge	Teleport	State of the teleport process: 0 - ok, 1 - recovering, 2 - degraded, 3 - starting.
`certificate_mismatch_total`	counter	Teleport	Number of SSH server login failures due to a certificate mismatch.
`rx`	counter	Teleport	Number of bytes received during an SSH connection.
`server_interactive_sessions_total`	gauge	Teleport	Number of active sessions.
`teleport_build_info`	gauge	Teleport	Provides build information of Teleport including gitref (git describe --long --tags), Go version, and Teleport version. The value of this gauge will always be 1.
`teleport_breaker_connector_executions_total`	counter	Teleport	Number of requests to the Teleport Auth Service API that go through a circuit breaker done by Teleport services, labeled by `role` of the connector (almost always `Instance`), `state` of the associated circuit breaker and `success` as interpreted by the breaker.
`teleport_cache_events`	counter	Teleport	Number of events received by a Teleport service cache. Teleport's Auth Service, Proxy Service, and other services cache incoming events related to their service.
`teleport_cache_stale_events`	counter	Teleport	Number of stale events received by a Teleport service cache. A high percentage of stale events can indicate a degraded backend.
`tx`	counter	Teleport	Number of bytes transmitted during an SSH connection.

Go runtime metrics

These metrics are surfaced by the Go runtime and are not specific to Teleport.

Name	Type	Component	Description
`go_gc_duration_seconds`	summary	Internal Go	A summary of GC invocation durations.
`go_goroutines`	gauge	Internal Go	Number of goroutines that currently exist.
`go_info`	gauge	Internal Go	Information about the Go environment.
`go_memstats_alloc_bytes_total`	counter	Internal Go	Total number of bytes allocated, even if freed.
`go_memstats_alloc_bytes`	gauge	Internal Go	Number of bytes allocated and still in use.
`go_memstats_buck_hash_sys_bytes`	gauge	Internal Go	Number of bytes used by the profiling bucket hash table.
`go_memstats_frees_total`	counter	Internal Go	Total number of frees.
`go_memstats_gc_cpu_fraction`	gauge	Internal Go	The fraction of this program's available CPU time used by the GC since the program started.
`go_memstats_gc_sys_bytes`	gauge	Internal Go	Number of bytes used for garbage collection system metadata.
`go_memstats_heap_alloc_bytes`	gauge	Internal Go	Number of heap bytes allocated and still in use.
`go_memstats_heap_idle_bytes`	gauge	Internal Go	Number of heap bytes waiting to be used.
`go_memstats_heap_inuse_bytes`	gauge	Internal Go	Number of heap bytes that are in use.
`go_memstats_heap_objects`	gauge	Internal Go	Number of allocated objects.
`go_memstats_heap_released_bytes`	gauge	Internal Go	Number of heap bytes released to the OS.
`go_memstats_heap_sys_bytes`	gauge	Internal Go	Number of heap bytes obtained from the system.
`go_memstats_last_gc_time_seconds`	gauge	Internal Go	Number of seconds since the Unix epoch of the last garbage collection.
`go_memstats_lookups_total`	counter	Internal Go	Total number of pointer lookups.
`go_memstats_mallocs_total`	counter	Internal Go	Total number of mallocs.
`go_memstats_mcache_inuse_bytes`	gauge	Internal Go	Number of bytes in use by mcache structures.
`go_memstats_mcache_sys_bytes`	gauge	Internal Go	Number of bytes used for mcache structures obtained from system.
`go_memstats_mspan_inuse_bytes`	gauge	Internal Go	Number of bytes in use by mspan structures.
`go_memstats_mspan_sys_bytes`	gauge	Internal Go	Number of bytes used for mspan structures obtained from system.
`go_memstats_next_gc_bytes`	gauge	Internal Go	Number of heap bytes when next the garbage collection will take place.
`go_memstats_other_sys_bytes`	gauge	Internal Go	Number of bytes used for other system allocations.
`go_memstats_stack_inuse_bytes`	gauge	Internal Go	Number of bytes in use by the stack allocator.
`go_memstats_stack_sys_bytes`	gauge	Internal Go	Number of bytes obtained from the system for stack allocator.
`go_memstats_sys_bytes`	gauge	Internal Go	Number of bytes obtained from the system.
`go_threads`	gauge	Internal Go	Number of OS threads created.
`process_cpu_seconds_total`	counter	Internal Go	Total user and system CPU time spent in seconds.
`process_max_fds`	gauge	Internal Go	Maximum number of open file descriptors.
`process_open_fds`	gauge	Internal Go	Number of open file descriptors.
`process_resident_memory_bytes`	gauge	Internal Go	Resident memory size in bytes.
`process_start_time_seconds`	gauge	Internal Go	Start time of the process since the Unix epoch in seconds.
`process_virtual_memory_bytes`	gauge	Internal Go	Virtual memory size in bytes.
`process_virtual_memory_max_bytes`	gauge	Internal Go	Maximum amount of virtual memory available in bytes.

Prometheus

Name	Type	Component	Description
`promhttp_metric_handler_requests_in_flight`	gauge	prometheus	Current number of scrapes being served.
`promhttp_metric_handler_requests_total`	counter	prometheus	Total number of scrapes by HTTP status code.

Distributed tracing

How to enable distributed tracing for a Teleport instance.

Teleport leverages OpenTelemetry to generate traces and export them to any OpenTelemetry Protocol (OTLP) capable exporter. In the event that your telemetry backend doesn't support receiving OTLP traces, you may be able to leverage the OpenTelemetry Collector to proxy traces from OTLP to a format that your telemetry backend accepts.

Configure Teleport

In order to enable tracing for a teleport instance, add the following section to that instance's configuration file (/etc/teleport.yaml). For a detailed description of these configuration fields, see the configuration reference page.

tracing_service:
   enabled: yes
   exporter_url: grpc://collector.example.com:4317
   sampling_rate_per_million: 1000000

Sampling rate

It is important to choose the sampling rate wisely. Sampling at a rate of 100% could have a negative impact on the performance of your cluster. Teleport honors the sampling rate included in any incoming requests, which means that even when the tracing_service is enabled and the sampling rate is 0, if Teleport receives a request that has a span which is sampled, then Teleport will sample and export all spans that are generated in response to that request.

Exporter URL

The exporter_url setting indicates where Teleport should send spans to. Supported schemes are grpc://, http://, https://, and file:// (if no scheme is provided, then grpc:// is used).

When using file://, the url must be a path to a directory that Teleport has write permissions for. Spans will be saved to files within the provided directory, each file containing one proto encoded span per line. Files are rotated after exceeding 100MB, in order to override the default limit add ?limit=<desired_file_size_in_bytes> to the exporter_url (i.e. file:///var/lib/teleport/traces?limit=100).

By default the connection to the exporter is insecure, to support TLS add the following to the tracing_service configuration:

   # Optional path to CA certificates are used to validate the exporter.
  ca_certs:
    - /var/lib/teleport/exporter_ca.pem
  # Optional path tp TLS certificates are used to enable mTLS for the exporter
  https_keypairs:
    - key_file: /var/lib/teleport/exporter_key.pem
      cert_file: /var/lib/teleport/exporter_cert.pem

After updating teleport.yaml, start your teleport instance to apply the new configuration.

tsh

To capture traces from tsh add the --trace flag to your command. All traces generated by tsh --trace will be proxied to the exporter_url defined for the Auth Service of the cluster the command is being run on.

tsh --trace ssh root@myserver
tsh --trace ls

Exporting traces from tsh to a different exporter than the one defined in the Auth Service config is also possible via the --trace-exporter flag. A URL must be provided that adheres to the same format as the exporter_url of the tracing_service.

tsh --trace --trace-exporter=grpc://collector.example.com:4317 ssh root@myserver
tsh --trace --trace-exporter=file:///var/lib/teleport/traces ls

Collecting profiles

How to collect runtime profiling data from a Teleport instance.

Teleport leverages Go's diagnostic capabilities to collect and export profiling data. Profiles can help identify the cause of spikes in CPU, the source of memory leaks, or the reason for a deadlock.

Using the Debug Service

The Teleport Debug Service enables administrators to collect diagnostic profiles without enabling pprof endpoints at startup. The service, enabled by default, ensures local-only access and must be consumed from inside the same instance.

teleport debug profile collects a list of pprof profiles. It outputs a compressed tarball (.tar.gz) to STDOUT. You decompress it using tar or direct the result to a file.

By default, it collects goroutine, heap and profile profiles.

Each profile collected will have a correspondent file inside the tarball. For example, collecting goroutine,trace,heap will result in goroutine.pprof, trace.pprof, and heap.pprof files.

Collect default profiles and save to a file.
teleport debug profile > pprof.tar.gz
tar xvf pprof.tar.gz

Collect default profiles and decompress it.
teleport debug profile | tar xzv -C ./

Collect "trace" and "mutex" profiles and save to a file.
teleport debug profile trace,mutex > pprof.tar.gz

Collect profiles setting the profiling time in seconds
teleport debug profile -s 20 trace > pprof.tar.gz

Specify your Teleport configuration path

If your Teleport configuration is not placed on the default path (/etc/teleport.yaml), you must specify its location to the CLI command using the -c/--config flag.

If you're running Teleport on a Kubernetes cluster you can directly collect profiles to a local directory without an interactive session:

kubectl -n teleport exec my-pod -- teleport debug profile > pprof.tar.gz

After extracting the contents, you can use go tool commands to explore and visualize them:

Opens the terminal interactive explorer
go tool pprof heap.pprof

Opens the web visualizer
go tool pprof -http : heap.pprof

Visualize trace profiles
go tool trace trace.pprof

Using diagnostics endpoints

The profiling endpoint is only enabled if the --debug flag is supplied.

Teleport's diagnostic HTTP endpoints are disabled by default. You can enable them via:

Start a teleport instance with the --diag-addr flag set to the local address where the diagnostic endpoint will listen:

sudo teleport start --debug --diag-addr=127.0.0.1:3000

Edit a teleport instance's configuration file (/etc/teleport.yaml by default) to include the following:

teleport:
    diag_addr: 127.0.0.1:3000

To enable debug logs:

log:
    severity: DEBUG

Verify that Teleport is now serving the diagnostics endpoint:

curl http://127.0.0.1:3000/healthz

Collecting profiles

Go's standard profiling endpoints are served at http://127.0.0.1:3000/debug/pprof/. Retrieving a profile requires sending a request to the endpoint corresponding to the desired profile type. When debugging an issue it is helpful to collect a series of profiles over a period of time.

CPU

CPU profile show execution statistics gathered over a user specified period:

Download the profile into a file:
curl -o cpu.profile http://127.0.0.1:3000/debug/pprof/profile?seconds=30

Visualize the profile
go tool pprof -http : cpu.profile

Goroutine

Goroutine profiles show the stack traces for all running goroutines in the system:

Download the profile into a file:
curl -o goroutine.profile http://127.0.0.1:3000/debug/pprof/goroutine

Visualize the profile
go tool pprof -http : goroutine.profile

Heap

Heap profiles show allocated objects in the system:

Download the profile into a file:
curl -o heap.profile http://127.0.0.1:3000/debug/pprof/heap

Visualize the profile
go tool pprof  -http : heap.profile

Trace

Trace profiles capture scheduling, system calls, garbage collections, heap size, and other events that are collected by the Go runtime over a user specified period of time:

Download the profile into a file:
curl -o trace.out http://127.0.0.1:3000/debug/pprof/trace?seconds=5

Visualize the profile
go tool trace trace.out

The Teleport Access Platform

Featured Resource

Integrations

Featured Resource

Use Cases

Featured Resource

Industries

Featured Resource

Compliance

Featured Resource

Strategic Partners

Featured AWS Webinar

Featured Blog Post

Featured Event

Monitoring your Cluster

Enable health monitoring

/healthz

/readyz

/readyz and heartbeats

Status codes

Metrics

Enabling metrics

Auth Service and backends

Enhanced Session Recording / BPF

Proxy Service

Database Service

Kubernetes access

Client

Server

Teleport SSH Service

Teleport Kubernetes Service

All Teleport instances

Go runtime metrics

Prometheus

Distributed tracing

Configure Teleport

Sampling rate

Exporter URL

tsh

Collecting profiles

Using the Debug Service

Using diagnostics endpoints

Collecting profiles

CPU

Goroutine

Heap

Trace

Further Reading

`/healthz`

`/readyz`

`/readyz` and heartbeats