Monitoring your Cluster
Teleport provides health checking mechanisms in order to verify that it is healthy and ready to serve traffic. Metrics, tracing, and profiling provide in-depth data, tracking cluster performance and responsiveness.
Enable health monitoring
How to monitor the health of a Teleport instance.
Teleport's diagnostic HTTP endpoints are disabled by default. You can enable them via:
- Command line
- Config file
Start a teleport
instance with the --diag-addr
flag set to the local
address where the diagnostic endpoint will listen:
$ sudo teleport start --diag-addr=127.0.0.1:3000
Edit a teleport
instance's configuration file (/etc/teleport.yaml
by
default) to include the following:
teleport:
diag_addr: 127.0.0.1:3000
To enable debug logs:
log:
severity: DEBUG
Ensure you can connect to the diagnostic endpoint
Verify that Teleport is now serving the diagnostics endpoint:
$ curl http://127.0.0.1:3000/healthz
Now you can collect monitoring information from several endpoints. These can be used by things like Kubernetes probes to monitor the health of a Teleport process.
/healthz
The http://127.0.0.1:3000/healthz
endpoint responds with a body of
{"status":"ok"}
and an HTTP 200 OK status code if the process is running.
This is a check to determine if the Teleport process is still running.
/readyz
The http://127.0.0.1:3000/readyz
endpoint is similar to /healthz
, but its
response includes information about the state of the process.
The response body is a JSON object of the form:
{ "status": "a status message here"}
/readyz
and heartbeats
If a Teleport component fails to execute its heartbeat procedure, it will enter a degraded state. Teleport will begin recovering from this state when a heartbeat completes successfully.
The first successful heartbeat will transition Teleport into a recovering state. A second consecutive successful heartbeat will cause Teleport to transition to the OK state.
Teleport heartbeats run approximately every 60 seconds when healthy, and failed
heartbeats are retried approximately every 5 seconds. This means that depending
on the timing of heartbeats, it can take 60-70 seconds after connectivity is
restored for /readyz
to start reporting healthy again.
Status codes
The status code of the response can be one of:
- HTTP 200 OK: Teleport is operating normally
- HTTP 503 Service Unavailable: Teleport has encountered a connection error and is running in a degraded state. This happens when a Teleport heartbeat fails.
- HTTP 400 Bad Request: Teleport is either entering its initial startup phase or has begun recovering from a degraded state.
The same state information is also available via the process_state
metric
under the /metrics
endpoint.
Metrics
Teleport exposes metrics for all of its components, helping you get insight into the state of your cluster. This guide explains the metrics that you can collect from your Teleport cluster.
Enabling metrics
Teleport's diagnostic HTTP endpoints are disabled by default. You can enable them via:
- Command line
- Config file
Start a teleport
instance with the --diag-addr
flag set to the local
address where the diagnostic endpoint will listen:
$ sudo teleport start --diag-addr=127.0.0.1:3000
Edit a teleport
instance's configuration file (/etc/teleport.yaml
by
default) to include the following:
teleport:
diag_addr: 127.0.0.1:3000
To enable debug logs:
log:
severity: DEBUG
Ensure you can connect to the diagnostic endpoint
Verify that Teleport is now serving the diagnostics endpoint:
$ curl http://127.0.0.1:3000/healthz
This will enable the http://127.0.0.1:3000/metrics
endpoint, which serves the
metrics that Teleport tracks. It is compatible with Prometheus collectors.
The following metrics are available:
Teleport Enterprise (cloud-hosted) does not expose monitoring endpoints for the Auth Service and Proxy Service.
Auth Service and backends
Name | Type | Component | Description |
---|---|---|---|
audit_failed_disk_monitoring | counter | Teleport Audit Log | Number of times disk monitoring failed. |
audit_failed_emit_events | counter | Teleport Audit Log | Number of times emitting audit events failed. |
audit_percentage_disk_space_used | gauge | Teleport Audit Log | Percentage of disk space used. |
audit_server_open_files | gauge | Teleport Audit Log | Number of open audit files. |
auth_generate_requests_throttled_total | counter | Teleport Auth | Number of throttled requests to generate new server keys. |
auth_generate_requests_total | counter | Teleport Auth | Number of requests to generate new server keys. |
auth_generate_requests | gauge | Teleport Auth | Number of current generate requests. |
auth_generate_seconds | histogram | Teleport Auth | Latency for generate requests. |
backend_batch_read_requests_total | counter | cache | Number of read requests to the backend. |
backend_batch_read_seconds | histogram | cache | Latency for batch read operations. |
backend_batch_write_requests_total | counter | cache | Number of batch write requests to the backend. |
backend_batch_write_seconds | histogram | cache | Latency for backend batch write operations. |
backend_read_requests_total | counter | cache | Number of read requests to the backend. |
backend_read_seconds | histogram | cache | Latency for read operations. |
backend_requests | counter | cache | Number of requests to the backend (reads, writes, and keepalives). |
backend_write_requests_total | counter | cache | Number of write requests to the backend. |
backend_write_seconds | histogram | cache | Latency for backend write operations. |
cluster_name_not_found_total | counter | Teleport Auth | Number of times a cluster was not found. |
dynamo_requests_total | counter | DynamoDB | Total number of requests to the DynamoDB API. |
dynamo_requests | counter | DynamoDB | Total number of requests to the DynamoDB API grouped by result. |
dynamo_requests_seconds | histogram | DynamoDB | Latency of DynamoDB API requests. |
etcd_backend_batch_read_requests | counter | etcd | Number of read requests to the etcd database. |
etcd_backend_batch_read_seconds | histogram | etcd | Latency for etcd read operations. |
etcd_backend_read_requests | counter | etcd | Number of read requests to the etcd database. |
etcd_backend_read_seconds | histogram | etcd | Latency for etcd read operations. |
etcd_backend_tx_requests | counter | etcd | Number of transaction requests to the database. |
etcd_backend_tx_seconds | histogram | etcd | Latency for etcd transaction operations. |
etcd_backend_write_requests | counter | etcd | Number of write requests to the database. |
etcd_backend_write_seconds | histogram | etcd | Latency for etcd write operations. |
teleport_etcd_events | counter | etcd | Total number of etcd events processed. |
teleport_etcd_event_backpressure | counter | etcd | Total number of times event processing encountered backpressure. |
firestore_events_backend_batch_read_requests | counter | GCP Cloud Firestore | Number of batch read requests to Cloud Firestore events. |
firestore_events_backend_batch_read_seconds | histogram | GCP Cloud Firestore | Latency for Cloud Firestore events batch read operations. |
firestore_events_backend_batch_write_requests | counter | GCP Cloud Firestore | Number of batch write requests to Cloud Firestore events. |
firestore_events_backend_batch_write_seconds | histogram | GCP Cloud Firestore | Latency for Cloud Firestore events batch write operations. |
firestore_events_backend_write_requests | counter | GCP Cloud Firestore | Number of write requests to Cloud Firestore events. |
firestore_events_backend_write_seconds | histogram | GCP Cloud Firestore | Latency for Cloud Firestore events write operations. |
gcs_event_storage_downloads_seconds | histogram | GCP GCS | Latency for GCS download operations. |
gcs_event_storage_downloads | counter | GCP GCS | Number of downloads from the GCS backend. |
gcs_event_storage_uploads_seconds | histogram | GCP GCS | Latency for GCS upload operations. |
gcs_event_storage_uploads | counter | GCP GCS | Number of uploads to the GCS backend. |
grpc_server_started_total | counter | Teleport Auth | Total number of RPCs started on the server. |
grpc_server_handled_total | counter | Teleport Auth | Total number of RPCs completed on the server, regardless of success or failure. |
grpc_server_msg_received_total | counter | Teleport Auth | Total number of RPC stream messages received on the server. |
grpc_server_msg_sent_total | counter | Teleport Auth | Total number of gRPC stream messages sent by the server. |
heartbeat_connections_received_total | counter | Teleport Auth | Number of times the Auth Service received a heartbeat connection, representing total heart beating Agents. |
s3_requests_total | counter | Amazon S3 | Total number of requests to the S3 API. |
s3_requests | counter | Amazon S3 | Total number of requests to the S3 API grouped by result. |
s3_requests_seconds | histogram | Amazon S3 | Request latency for the S3 API. |
teleport_audit_emit_events | counter | Teleport Audit Log | Number of audit events emitted. |
teleport_audit_parquetlog_batch_processing_seconds | histogram | Teleport Audit Log | Duration of processing single batch of events in the Parquet-format audit log. |
teleport_audit_parquetlog_s3_flush_seconds | histogram | Teleport Audit Log | Duration of flushing parquet files to S3 in Parquet-format audit log. |
teleport_audit_parquetlog_delete_events_seconds | histogram | Teleport Audit Log | Duration of deletion events from SQS in Parquet-format audit log. |
teleport_audit_parquetlog_batch_size | histogram | Teleport Audit Log | Overall size of events in single batch in Parquet-format audit log. |
teleport_audit_parquetlog_batch_count | counter | Teleport Audit Log | Total number of events in single batch in Parquet-format audit log. |
teleport_audit_parquetlog_last_processed_timestamp | gauge | Teleport Audit Log | Number of last processing time in Parquet-format audit log. |
teleport_audit_parquetlog_age_oldest_processed_message | gauge | Teleport Audit Log | Number of age of oldest event in Parquet-format audit log. |
teleport_audit_parquetlog_errors_from_collect_count | counter | Teleport Audit Log | Number of collect failures in Parquet-format audit log. |
teleport_connected_resources | gauge | Teleport Auth | Number and type of resources connected via keepalives. x |
teleport_postgres_events_backend_write_requests | counter | Postgres (Events) | Number of write requests to postgres events, labeled with the request status (success or failure). |
teleport_postgres_events_backend_batch_read_requests | counter | Postgres (Events) | Number of batch read requests to postgres events, labeled with the request status (success or failure). |
teleport_postgres_events_backend_batch_delete_requests | counter | Postgres (Events) | Number of batch delete requests to postgres events, labeled with the request status (success or failure). |
teleport_postgres_events_backend_write_seconds | histogram | Postgres (Events) | Latency for postgres events write operations, in seconds. |
teleport_postgres_events_backend_batch_read_seconds | histogram | Postgres (Events) | Latency for postgres events batch read operations, in seconds. |
teleport_postgres_events_backend_batch_delete_seconds | histogram | Postgres (Events) | Latency for postgres events batch delete operations, in seconds. |
teleport_registered_servers | gauge | Teleport Auth | The number of Teleport services that are connected to an Auth Service instance grouped by version. |
teleport_registered_servers_by_install_methods | gauge | Teleport Auth | The number of Teleport services that are connected to an Auth Service instance grouped by install methods. |
teleport_roles_total | gauge | Teleport Auth | The number of roles that exist in the cluster. |
teleport_migrations | gauge | Teleport Auth | Tracks for each migration if it is active (1) or not (0). |
user_login_total | counter | Teleport Auth | Number of user logins. |
watcher_event_sizes | histogram | cache | Overall size of events emitted. |
watcher_events | histogram | cache | Per resource size of events emitted. |
Enhanced Session Recording / BPF
Name | Type | Component | Description |
---|---|---|---|
bpf_lost_command_events | counter | BPF | Number of lost command events. |
bpf_lost_disk_events | counter | BPF | Number of lost disk events. |
bpf_lost_network_events | counter | BPF | Number of lost network events. |
Proxy Service
Name | Type | Component | Description |
---|---|---|---|
failed_connect_to_node_attempts_total | counter | Teleport Proxy | Number of failed SSH connection attempts to the SSH Service. Use with teleport_connect_to_node_attempts_total to get the failure rate. |
failed_login_attempts_total | counter | Teleport Proxy | Number of failed tsh login or tsh ssh logins. |
grpc_client_started_total | counter | Teleport Proxy | Total number of RPCs started on the client. |
grpc_client_handled_total | counter | Teleport Proxy | Total number of RPCs completed on the client, regardless of success or failure. |
grpc_client_msg_received_total | counter | Teleport Proxy | Total number of RPC stream messages received on the client. |
grpc_client_msg_sent_total | counter | Teleport Proxy | Total number of gRPC stream messages sent by the client. |
proxy_connection_limit_exceeded_total | counter | Teleport Proxy | Number of connections that exceeded the Proxy Service connection limit. |
proxy_peer_client_dial_error_total | counter | Teleport Proxy | Total number of errors encountered dialing peer Proxy Service instances. |
proxy_peer_server_connections | gauge | Teleport Proxy | Number of currently opened connection to proxy Proxy Service instances. |
proxy_peer_client_rpc | gauge | Teleport Proxy | Number of current client RPC requests. |
proxy_peer_client_rpc_total | counter | Teleport Proxy | Total number of client RPC requests. |
proxy_peer_client_rpc_duration_seconds | histogram | Teleport Proxy | Duration in seconds of RPCs sent by the client. |
proxy_peer_client_message_sent_size | histogram | Teleport Proxy | Size of messages sent by the client. |
proxy_peer_client_message_received_size | histogram | Teleport Proxy | Size of messages received by the client. |
proxy_peer_server_connections | gauge | Teleport Proxy | Number of currently opened connection to peer Proxy Service clients. |
proxy_peer_server_rpc | gauge | Teleport Proxy | Number of current server RPC requests. |
proxy_peer_server_rpc_total | counter | Teleport Proxy | Total number of server RPC requests. |
proxy_peer_server_rpc_duration_seconds | histogram | Teleport Proxy | Duration in seconds of RPCs sent by the server. |
proxy_peer_server_message_sent_size | histogram | Teleport Proxy | Size of messages sent by the server. |
proxy_peer_server_message_received_size | histogram | Teleport Proxy | Size of messages received by the server. |
proxy_ssh_sessions_total | gauge | Teleport Proxy | Number of active sessions through this Proxy Service instance. |
proxy_missing_ssh_tunnels | gauge | Teleport Proxy | Number of missing SSH tunnels. Used to debug if Teleport instances have discovered all Proxy Service instances. |
remote_clusters | gauge | Teleport Proxy | Number of inbound connections from leaf clusters. |
teleport_connect_to_node_attempts_total | counter | Teleport Proxy | Number of SSH connection attempts to a SSH Service. Use with failed_connect_to_node_attempts_total to get the failure rate. |
teleport_reverse_tunnels_connected | gauge | Teleport Proxy | Number of reverse SSH tunnels connected to the Teleport Proxy Service by Teleport instances. |
teleport_proxy_db_connection_setup_time_seconds | histogram | Teleport Proxy | Time to establish connection to DB service from Proxy service. |
teleport_proxy_db_connection_dial_attempts_total | counter | Teleport Proxy | Number of dial attempts from Proxy to DB service made. |
teleport_proxy_db_connection_dial_failures_total | counter | Teleport Proxy | Number of failed dial attempts from Proxy to DB service made. |
teleport_proxy_db_attempted_servers_total | histogram | Teleport Proxy | Number of servers processed during connection attempt to the DB service from Proxy service. |
teleport_proxy_db_connection_tls_config_time_seconds | histogram | Teleport Proxy | Time to fetch TLS configuration for the connection to DB service from Proxy service. |
teleport_proxy_db_active_connections_total | gauge | Teleport Proxy | Number of currently active connections to DB service from Proxy service. |
trusted_clusters | gauge | Teleport Proxy | Number of outbound connections to leaf clusters. |
Database Service
Name | Type | Component | Description |
---|---|---|---|
teleport_db_messages_from_client_total | counter | Teleport Database Service | Number of messages (packets) received from the DB client. |
teleport_db_messages_from_server_total | counter | Teleport Database Service | Number of messages (packets) received from the DB server. |
teleport_db_method_call_count_total | counter | Teleport Database Service | Number of times a DB method was called. |
teleport_db_method_call_latency_seconds | histogram | Teleport Database Service | Call latency for a DB method calls. |
teleport_db_initialized_connections_total | counter | Teleport Database Service | Number of initialized DB connections. |
teleport_db_active_connections_total | gauge | Teleport Database Service | Number of active DB connections. |
teleport_db_connection_durations_seconds | histogram | Teleport Database Service | Duration of DB connection. |
teleport_db_connection_setup_time_seconds | histogram | Teleport Database Service | Initial time to setup DB connection, before any requests are handled. |
teleport_db_errors_total | counter | Teleport Database Service | Number of synthetic DB errors sent to the client. |
Kubernetes Access
The following tables identify all metrics available in the proxy service if Kubernetes access is enabled.
Client
The following table identifies all metrics available when the service connects
to upstream servers. In the case of proxy
, the upstream server can be a
kubernetes_service
or Kubernetes Cluster if it's running in legacy mode.
Name | Type | Component | Description |
---|---|---|---|
teleport_kubernetes_client_in_flight_requests | gauge | Teleport Kubernetes Proxy | In-flight requests waiting for the upstream response. |
teleport_kubernetes_client_requests_total | counter | Teleport Kubernetes Proxy | Total number of requests sent to the upstream Teleport proxy, kube_service or Kubernetes Cluster servers. |
teleport_kubernetes_client_tls_duration_seconds | histogram | Teleport Kubernetes Proxy | Latency distribution of TLS handshakes. |
teleport_kubernetes_client_got_conn_duration_seconds | histogram | Teleport Kubernetes Proxy | Latency distribution of time to dial to the upstream server - using reverse tunnel or direct dialer. |
teleport_kubernetes_client_first_byte_response_duration_seconds | histogram | Teleport Kubernetes Proxy | Latency distribution of time to receive the first response byte from the upstream server. |
teleport_kubernetes_client_request_duration_seconds | histogram | Teleport Kubernetes Proxy | Latency distribution of the upstream request time. |
Server
The following table identifies all metrics available for incoming connections.
Name | Type | Component | Description |
---|---|---|---|
teleport_kubernetes_server_in_flight_requests | gauge | Teleport Kubernetes Proxy | In-flight requests currently handled by the server. |
teleport_kubernetes_server_api_requests_total | counter | Teleport Kubernetes Proxy | Total number of requests handled by the server. |
teleport_kubernetes_server_request_duration_seconds | histogram | Teleport Kubernetes Proxy | Latency distribution of the total request time. |
teleport_kubernetes_server_response_size_bytes | histogram | Teleport Kubernetes Proxy | Distribution of the response size. |
teleport_kubernetes_server_exec_in_flight_sessions | gauge | Teleport Kubernetes Proxy | Number of active kubectl exec sessions. |
teleport_kubernetes_server_exec_sessions_total | counter | Teleport Kubernetes Proxy | Total number of kubectl exec sessions. |
teleport_kubernetes_server_portforward_in_flight_sessions | gauge | Teleport Kubernetes Proxy | Number of active kubectl portforward sessions. |
teleport_kubernetes_server_portforward_sessions_total | counter | Teleport Kubernetes Proxy | Number of active kubectl portforward sessions. |
teleport_kubernetes_server_join_in_flight_sessions | gauge | Teleport Kubernetes Proxy | Number of active joining sessions, |
teleport_kubernetes_server_join_sessions_total | counter | Teleport Kubernetes Proxy | Total number of joining sessions. |
Teleport SSH Service
Name | Type | Component | Description |
---|---|---|---|
user_max_concurrent_sessions_hit_total | counter | Teleport SSH | Number of times a user exceeded their concurrent session limit. |
Teleport Kubernetes Service
The following table identifies all metrics available when the service connects
to upstream servers. In the case of kubernetes_service
, the upstream server
is always a Kubernetes cluster.
Name | Type | Component | Description |
---|---|---|---|
teleport_kubernetes_client_in_flight_requests | gauge | Teleport Kubernetes Service | In-flight requests waiting for the upstream response. |
teleport_kubernetes_client_requests_total | counter | Teleport Kubernetes Service | Total number of requests sent to the upstream teleport proxy, kube_service or Kubernetes Cluster servers. |
teleport_kubernetes_client_tls_duration_seconds | histogram | Teleport Kubernetes Service | Latency distribution of TLS handshakes. |
teleport_kubernetes_client_got_conn_duration_seconds | histogram | Teleport Kubernetes Service | Latency distribution of time to dial to the upstream server - using reversetunnel or direct dialer. |
teleport_kubernetes_client_first_byte_response_duration_seconds | histogram | Teleport Kubernetes Service | Latency distribution of time to receive the first response byte from the upstream server. |
teleport_kubernetes_client_request_duration_seconds | histogram | Teleport Kubernetes Service | Latency distribution of the upstream request time. |
The following table identifies all metrics available for incoming connections.
Name | Type | Component | Description |
---|---|---|---|
teleport_kubernetes_server_in_flight_requests | gauge | Teleport Kubernetes Service | In-flight requests currently handled by the server. |
teleport_kubernetes_server_api_requests_total | counter | Teleport Kubernetes Service | Total number of requests handled by the server. |
teleport_kubernetes_server_request_duration_seconds | histogram | Teleport Kubernetes Service | Latency distribution of the total request time. |
teleport_kubernetes_server_response_size_bytes | histogram | Teleport Kubernetes Service | Distribution of the response size. |
teleport_kubernetes_server_exec_in_flight_sessions | gauge | Teleport Kubernetes Service | Number of active kubectl exec sessions. |
teleport_kubernetes_server_exec_sessions_total | counter | Teleport Kubernetes Service | Total number of kubectl exec sessions. |
teleport_kubernetes_server_portforward_in_flight_sessions | gauge | Teleport Kubernetes Service | Number of active kubectl portforward sessions. |
teleport_kubernetes_server_portforward_sessions_total | counter | Teleport Kubernetes Service | Number of active kubectl portforward sessions. |
teleport_kubernetes_server_join_in_flight_sessions | gauge | Teleport Kubernetes Service | Number of active joining sessions, |
teleport_kubernetes_server_join_sessions_total | counter | Teleport Kubernetes Service | Total number of joining sessions. |
All Teleport instances
Name | Type | Component | Description |
---|---|---|---|
process_state | gauge | Teleport | State of the teleport process: 0 - ok, 1 - recovering, 2 - degraded, 3 - starting. |
certificate_mismatch_total | counter | Teleport | Number of SSH server login failures due to a certificate mismatch. |
rx | counter | Teleport | Number of bytes received during an SSH connection. |
server_interactive_sessions_total | gauge | Teleport | Number of active sessions. |
teleport_build_info | gauge | Teleport | Provides build information of Teleport including gitref (git describe --long --tags), Go version, and Teleport version. The value of this gauge will always be 1. |
teleport_breaker_connector_executions_total | counter | Teleport | Number of requests to the Teleport Auth Service API that go through a circuit breaker done by Teleport services, labeled by role of the connector (almost always Instance ), state of the associated circuit breaker and success as interpreted by the breaker. |
teleport_cache_events | counter | Teleport | Number of events received by a Teleport service cache. Teleport's Auth Service, Proxy Service, and other services cache incoming events related to their service. |
teleport_cache_stale_events | counter | Teleport | Number of stale events received by a Teleport service cache. A high percentage of stale events can indicate a degraded backend. |
tx | counter | Teleport | Number of bytes transmitted during an SSH connection. |
Go runtime metrics
These metrics are surfaced by the Go runtime and are not specific to Teleport.
Name | Type | Component | Description |
---|---|---|---|
go_gc_duration_seconds | summary | Internal Go | A summary of GC invocation durations. |
go_goroutines | gauge | Internal Go | Number of goroutines that currently exist. |
go_info | gauge | Internal Go | Information about the Go environment. |
go_memstats_alloc_bytes_total | counter | Internal Go | Total number of bytes allocated, even if freed. |
go_memstats_alloc_bytes | gauge | Internal Go | Number of bytes allocated and still in use. |
go_memstats_buck_hash_sys_bytes | gauge | Internal Go | Number of bytes used by the profiling bucket hash table. |
go_memstats_frees_total | counter | Internal Go | Total number of frees. |
go_memstats_gc_cpu_fraction | gauge | Internal Go | The fraction of this program's available CPU time used by the GC since the program started. |
go_memstats_gc_sys_bytes | gauge | Internal Go | Number of bytes used for garbage collection system metadata. |
go_memstats_heap_alloc_bytes | gauge | Internal Go | Number of heap bytes allocated and still in use. |
go_memstats_heap_idle_bytes | gauge | Internal Go | Number of heap bytes waiting to be used. |
go_memstats_heap_inuse_bytes | gauge | Internal Go | Number of heap bytes that are in use. |
go_memstats_heap_objects | gauge | Internal Go | Number of allocated objects. |
go_memstats_heap_released_bytes | gauge | Internal Go | Number of heap bytes released to the OS. |
go_memstats_heap_sys_bytes | gauge | Internal Go | Number of heap bytes obtained from the system. |
go_memstats_last_gc_time_seconds | gauge | Internal Go | Number of seconds since the Unix epoch of the last garbage collection. |
go_memstats_lookups_total | counter | Internal Go | Total number of pointer lookups. |
go_memstats_mallocs_total | counter | Internal Go | Total number of mallocs. |
go_memstats_mcache_inuse_bytes | gauge | Internal Go | Number of bytes in use by mcache structures. |
go_memstats_mcache_sys_bytes | gauge | Internal Go | Number of bytes used for mcache structures obtained from system. |
go_memstats_mspan_inuse_bytes | gauge | Internal Go | Number of bytes in use by mspan structures. |
go_memstats_mspan_sys_bytes | gauge | Internal Go | Number of bytes used for mspan structures obtained from system. |
go_memstats_next_gc_bytes | gauge | Internal Go | Number of heap bytes when next the garbage collection will take place. |
go_memstats_other_sys_bytes | gauge | Internal Go | Number of bytes used for other system allocations. |
go_memstats_stack_inuse_bytes | gauge | Internal Go | Number of bytes in use by the stack allocator. |
go_memstats_stack_sys_bytes | gauge | Internal Go | Number of bytes obtained from the system for stack allocator. |
go_memstats_sys_bytes | gauge | Internal Go | Number of bytes obtained from the system. |
go_threads | gauge | Internal Go | Number of OS threads created. |
process_cpu_seconds_total | counter | Internal Go | Total user and system CPU time spent in seconds. |
process_max_fds | gauge | Internal Go | Maximum number of open file descriptors. |
process_open_fds | gauge | Internal Go | Number of open file descriptors. |
process_resident_memory_bytes | gauge | Internal Go | Resident memory size in bytes. |
process_start_time_seconds | gauge | Internal Go | Start time of the process since the Unix epoch in seconds. |
process_virtual_memory_bytes | gauge | Internal Go | Virtual memory size in bytes. |
process_virtual_memory_max_bytes | gauge | Internal Go | Maximum amount of virtual memory available in bytes. |
Prometheus
Name | Type | Component | Description |
---|---|---|---|
promhttp_metric_handler_requests_in_flight | gauge | prometheus | Current number of scrapes being served. |
promhttp_metric_handler_requests_total | counter | prometheus | Total number of scrapes by HTTP status code. |
Distributed tracing
How to enable distributed tracing for a Teleport instance.
Teleport leverages OpenTelemetry to generate traces and export them to any OpenTelemetry Protocol (OTLP) capable exporter. In the event that your telemetry backend doesn't support receiving OTLP traces, you may be able to leverage the OpenTelemetry Collector to proxy traces from OTLP to a format that your telemetry backend accepts.
Configure Teleport
In order to enable tracing for a teleport
instance, add the following section to that instance's configuration file (/etc/teleport.yaml
).
For a detailed description of these configuration fields, see the configuration reference page.
tracing_service:
enabled: yes
exporter_url: grpc://collector.example.com:4317
sampling_rate_per_million: 1000000
Sampling rate
It is important to choose the sampling rate wisely. Sampling at a rate of 100% could have a negative impact on the
performance of your cluster. Teleport honors the sampling rate included in any incoming requests, which means
that even when the tracing_service
is enabled and the sampling rate is 0, if Teleport receives a request that has a span which is
sampled, then Teleport will sample and export all spans that are generated in response to that request.
Exporter URL
The exporter_url
setting indicates where Teleport should send spans to. Supported schemes are grpc://
, http://
,
https://
, and file://
(if no scheme is provided, then grpc://
is used).
When using file://
, the url must be a path to a directory that Teleport has write permissions for. Spans will be saved to files within
the provided directory, each file containing one proto encoded span per line. Files are rotated after exceeding 100MB, in order to
override the default limit add ?limit=<desired_file_size_in_bytes>
to the exporter_url
(i.e. file:///var/lib/teleport/traces?limit=100
).
By default the connection to the exporter is insecure, to support TLS add the following to the tracing_service
configuration:
# Optional path to CA certificates are used to validate the exporter.
ca_certs:
- /var/lib/teleport/exporter_ca.pem
# Optional path tp TLS certificates are used to enable mTLS for the exporter
https_keypairs:
- key_file: /var/lib/teleport/exporter_key.pem
cert_file: /var/lib/teleport/exporter_cert.pem
After updating teleport.yaml
, start your teleport
instance to apply the new configuration.
tsh
To capture traces from tsh
add the --trace
flag to your command. All traces generated by tsh --trace
will be
proxied to the exporter_url
defined for the Auth Service of the cluster the command is being run on.
$ tsh --trace ssh root@myserver
$ tsh --trace ls
Exporting traces from tsh
to a different exporter than the one defined in the Auth Service config
is also possible via the --trace-exporter
flag. A URL must be provided that adheres to the same
format as the exporter_url
of the tracing_service
.
$ tsh --trace --trace-exporter=grpc://collector.example.com:4317 ssh root@myserver
$ tsh --trace --trace-exporter=file:///var/lib/teleport/traces ls
Collecting profiles
How to collect runtime profiling data from a Teleport instance.
Teleport leverages Go's diagnostic capabilities to collect and export profiling data. Profiles can help identify the cause of spikes in CPU, the source of memory leaks, or the reason for a deadlock.
Using the Debug Service
The Teleport Debug Service enables administrators to collect diagnostic profiles without enabling pprof endpoints at startup. The service, enabled by default, ensures local-only access and must be consumed from inside the same instance.
teleport debug profile
collects a list of pprof profiles. It outputs a
compressed tarball (.tar.gz
) to STDOUT. You decompress it using tar
or
direct the result to a file.
By default, it collects goroutine
, heap
and profile
profiles.
Each profile collected will have a correspondent file inside the tarball. For
example, collecting goroutine,trace,heap
will result in goroutine.pprof
,
trace.pprof
, and heap.pprof
files.
# Collect default profiles and save to a file.
$ teleport debug profile > pprof.tar.gz
$ tar xvf pprof.tar.gz
# Collect default profiles and decompress it.
$ teleport debug profile | tar xzv -C ./
# Collect "trace" and "mutex" profiles and save to a file.
$ teleport debug profile trace,mutex > pprof.tar.gz
# Collect profiles setting the profiling time in seconds
$ teleport debug profile -s 20 trace > pprof.tar.gz
If your Teleport configuration is not placed on the default path
(/etc/teleport.yaml
), you must specify its location to the CLI command
using the -c/--config
flag.
If you're running Teleport on a Kubernetes cluster you can directly collect profiles to a local directory without an interactive session:
$ kubectl -n teleport exec my-pod -- teleport debug profile > pprof.tar.gz
After extracting the contents, you can use go tool
commands to explore and
visualize them:
# Opens the terminal interactive explorer
$ go tool pprof heap.pprof
# Opens the web visualizer
$ go tool pprof -http : heap.pprof
# Visualize trace profiles
$ go tool trace trace.pprof
Using diagnostics endpoints
The profiling endpoint is only enabled if the --debug
flag is supplied.
Teleport's diagnostic HTTP endpoints are disabled by default. You can enable them via:
- Command line
- Config file
Start a teleport
instance with the --diag-addr
flag set to the local
address where the diagnostic endpoint will listen:
$ sudo teleport start --debug --diag-addr=127.0.0.1:3000
Edit a teleport
instance's configuration file (/etc/teleport.yaml
by
default) to include the following:
teleport:
diag_addr: 127.0.0.1:3000
To enable debug logs:
log:
severity: DEBUG
Ensure you can connect to the diagnostic endpoint
Verify that Teleport is now serving the diagnostics endpoint:
$ curl http://127.0.0.1:3000/healthz
Collecting profiles
Go's standard profiling endpoints are served at http://127.0.0.1:3000/debug/pprof/
.
Retrieving a profile requires sending a request to the endpoint corresponding
to the desired profile type. When debugging an issue it is helpful to collect
a series of profiles over a period of time.
CPU
CPU profile show execution statistics gathered over a user specified period:
# Download the profile into a file:
$ curl -o cpu.profile http://127.0.0.1:3000/debug/pprof/profile?seconds=30
# Visualize the profile
$ go tool pprof -http : cpu.profile
Goroutine
Goroutine profiles show the stack traces for all running goroutines in the system:
# Download the profile into a file:
$ curl -o goroutine.profile http://127.0.0.1:3000/debug/pprof/goroutine
# Visualize the profile
$ go tool pprof -http : goroutine.profile
Heap
Heap profiles show allocated objects in the system:
# Download the profile into a file:
$ curl -o heap.profile http://127.0.0.1:3000/debug/pprof/heap
# Visualize the profile
$ go tool pprof -http : heap.profile
Trace
Trace profiles capture scheduling, system calls, garbage collections, heap size, and other events that are collected by the Go runtime over a user specified period of time:
# Download the profile into a file:
$ curl -o trace.out http://127.0.0.1:3000/debug/pprof/trace?seconds=5
# Visualize the profile
$ go tool trace trace.out
Further Reading
- More information about diagnostics in the Go ecosystem: https://go.dev/doc/diagnostics
- Go's profiling endpoints: https://golang.org/pkg/net/http/pprof/
- A deep dive on profiling Go programs: https://go.dev/blog/pprof