Fork me on GitHub
Teleport

Teleport Diagnostics

Improve

Teleport provides HTTP endpoints for monitoring purposes. They are disabled by default, but you can enable them using the --diag-addr flag when running teleport start:

sudo teleport start --diag-addr=127.0.0.1:3000

Now you can collect monitoring information from several endpoints.

/healthz

The http://127.0.0.1:3000/healthz endpoint responds with a body of {"status":"ok"} and an HTTP 200 OK status code if the process is running.

This is a simple check, suitable for determining if the Teleport process is still running.

/readyz

The http://127.0.0.1:3000/readyz endpoint is similar to /healthz, but its response includes information about the state of the process.

The response body is a JSON object of the form:

{ "status": "a status message here"}

/readyz and heartbeats

If a Teleport component fails to execute its heartbeat procedure, it will enter a degraded state. Teleport will begin recovering from this state when a heartbeat completes successfully.

The first successful heartbeat will transition Teleport into a recovering state.

A second consecutive successful heartbeat will cause Teleport to transition to the OK state, so long as at least 10 seconds have elapsed since the first successful heartbeat.

Teleport heartbeats run every 5 seconds. This means that depending on the timing of heartbeats, it can take 10-20 seconds after connectivity is restored for /readyz to start reporting healthy again.

Status codes

The status code of the response can be one of:

  • HTTP 200 OK: Teleport is operating normally
  • HTTP 503 Service Unavailable: Teleport has encountered a connection error and is running in a degraded state. This happens when a Teleport heartbeat fails.
  • HTTP 400 Bad Request: Teleport is either entering its initial startup phase or has begun recovering from a degraded state.

The same state information is also available via the process_state metric under the /metrics endpoint.

/debug/pprof

The http://127.0.0.1:3000/debug/pprof/ endpoint is Go's standard pprof profiler. This endpoint is only available if the --debug (or -d) flag is supplied (in addition to --diag-addr).

/metrics

The http://127.0.0.1:3000/metrics endpoint serves the internal metrics Teleport is tracking. It is compatible with Prometheus collectors.

The following metrics are available:

Teleport Cloud does not expose monitoring endpoints for the Auth Service and Proxy Service.

Auth Service and backends

NameTypeComponentDescription
audit_failed_disk_monitoringcounterTeleport Audit LogNumber of times disk monitoring failed.
audit_failed_emit_eventscounterTeleport Audit LogNumber of times emitting audit events failed.
audit_percentage_disk_space_usedgaugeTeleport Audit LogPercentage of disk space used.
audit_server_open_filesgaugeTeleport Audit LogNumber of open audit files.
auth_generate_requests_throttled_totalcounterTeleport AuthNumber of throttled requests to generate new server keys.
auth_generate_requests_totalcounterTeleport AuthNumber of requests to generate new server keys.
auth_generate_requestsgaugeTeleport AuthNumber of current generate requests.
auth_generate_secondshistogramTeleport AuthLatency for generate requests.
backend_batch_read_requests_totalcountercacheNumber of read requests to the backend.
backend_batch_read_secondshistogramcacheLatency for batch read operations.
backend_batch_write_requests_totalcountercacheNumber of batch write requests to the backend.
backend_batch_write_secondshistogramcacheLatency for backend batch write operations.
backend_read_requests_totalcountercacheNumber of read requests to the backend.
backend_read_secondshistogramcacheLatency for read operations.
backend_write_requests_totalcountercacheNumber of write requests to the backend.
backend_write_secondshistogramcacheLatency for backend write operations.
cluster_name_not_found_totalcounterTeleport AuthNumber of times a cluster was not found.
etcd_backend_batch_read_requestscounteretcdNumber of read requests to the etcd database.
etcd_backend_batch_read_secondshistogrametcdLatency for etcd read operations.
etcd_backend_read_requestscounteretcdNumber of read requests to the etcd database.
etcd_backend_read_secondshistogrametcdLatency for etcd read operations.
etcd_backend_tx_requestscounteretcdNumber of transaction requests to the database.
etcd_backend_tx_secondshistogrametcdLatency for etcd transaction operations.
etcd_backend_write_requestscounteretcdNumber of write requests to the database.
etcd_backend_write_secondshistogrametcdLatency for etcd write operations.
firestore_events_backend_batch_read_requestscounterGCP Cloud FirestoreNumber of batch read requests to Cloud Firestore events.
firestore_events_backend_batch_read_secondshistogramGCP Cloud FirestoreLatency for Cloud Firestore events batch read operations.
firestore_events_backend_batch_write_requestscounterGCP Cloud FirestoreNumber of batch write requests to Cloud Firestore events.
firestore_events_backend_batch_write_secondshistogramGCP Cloud FirestoreLatency for Cloud Firestore events batch write operations.
gcs_event_storage_downloads_secondshistogramGCP GCSLatency for GCS download operations.
gcs_event_storage_downloadscounterGCP GCSNumber of downloads from the GCS backend.
gcs_event_storage_uploads_secondshistogramGCP GCSLatency for GCS upload operations.
gcs_event_storage_uploadscounterGCP GCSNumber of uploads to the GCS backend.
heartbeat_connections_missed_totalcounterTeleport AuthNumber of times the Auth Service did not receive a heartbeat from a Node.
heartbeat_connections_received_totalcounterTeleport AuthNumber of times the Auth Service received a heartbeat connection.
teleport_audit_emit_eventscounterTeleport Audit LogNumber of audit events emitted.
teleport_connected_resourcesgaugeTeleport Auth ServiceTracks the number and type of resources connected via keepalives.
teleport_registered_serversgaugeTeleport Auth ServiceThe number of Teleport servers (a server consists of one or more Teleport services) that have connected to the Teleport cluster, including the Teleport version. After disconnecting, a Teleport server has a TTL of 10 minutes, so this value will include servers that have recently disconnected but have not reached their TTL.
user_login_totalcounterTeleport Auth ServiceNumber of user logins.
watcher_event_sizeshistogramcacheOverall size of events emitted.
watcher_eventshistogramcachePer resource size of events emitted.

Proxy Service

NameTypeComponentDescription
failed_connect_to_node_attempts_totalcounterTeleport ProxyNumber of times a user failed connecting to a Node.
failed_login_attempts_totalcounterTeleport ProxyNumber of failed tsh login or tsh ssh logins.
proxy_connection_limit_exceeded_totalcounterTeleport ProxyNumber of connections that exceeded the proxy connection limit.
proxy_missing_ssh_tunnelsgaugeTeleport ProxyNumber of missing SSH tunnels. Used to debug if nodes have discovered all proxies.
teleport_connect_to_node_attempts_totalcounterTeleport ProxyNumber of SSH connection attempts to a node. Use with failed_connect_to_node_attempts_total to get the failure rate.
teleport_reverse_tunnels_connectedgaugeTeleport ProxyNumber of reverse SSH tunnels connected to the Teleport Proxy Service by Teleport instances.
teleport_reverse_tunnels_connectedgaugeTeleport ProxyNumber of reverse SSH tunnels connected to the Teleport Proxy Service by Teleport instances.

Teleport Nodes

NameTypeComponentDescription
user_max_concurrent_sessions_hit_totalcounterTeleport NodeNumber of times a user exceeded their concurrent session limit.

All Teleport instances

NameTypeComponentDescription
certificate_mismatch_totalcounterTeleportNumber of SSH server login failures due to a certificate mismatch.
reversetunnel_connected_proxiesgaugeTeleportNumber of known proxies being sought.
rxcounterTeleportNumber of bytes received during an SSH connection.
server_interactive_sessions_totalgaugeTeleportNumber of active sessions.
teleport_build_infogaugeTeleportProvides build information of Teleport including gitref (git describe --long --tags), Go version, and Teleport version. The value of this gauge will always be 1.
teleport_cache_eventscounterTeleportNumber of events received by a Teleport service cache. Teleport's Auth Service, Proxy Service, and other services cache incoming events related to their service.
teleport_cache_stale_eventscounterTeleportNumber of stale events received by a Teleport service cache. A high percentage of stale events can indicate a degraded backend.
trusted_clustersgaugeTeleportNumber of tunnels per state.
txcounterTeleportNumber of bytes transmitted during an SSH connection.

Golang runtime metrics

NameTypeComponentDescription
go_gc_duration_secondssummaryInternal GolangA summary of GC invocation durations.
go_goroutinesgaugeInternal GolangNumber of goroutines that currently exist.
go_infogaugeInternal GolangInformation about the Go environment.
go_memstats_alloc_bytes_totalcounterInternal GolangTotal number of bytes allocated, even if freed.
go_memstats_alloc_bytesgaugeInternal GolangNumber of bytes allocated and still in use.
go_memstats_buck_hash_sys_bytesgaugeInternal GolangNumber of bytes used by the profiling bucket hash table.
go_memstats_frees_totalcounterInternal GolangTotal number of frees.
go_memstats_gc_cpu_fractiongaugeInternal GolangThe fraction of this program's available CPU time used by the GC since the program started.
go_memstats_gc_sys_bytesgaugeInternal GolangNumber of bytes used for garbage collection system metadata.
go_memstats_heap_alloc_bytesgaugeInternal GolangNumber of heap bytes allocated and still in use.
go_memstats_heap_idle_bytesgaugeInternal GolangNumber of heap bytes waiting to be used.
go_memstats_heap_inuse_bytesgaugeInternal GolangNumber of heap bytes that are in use.
go_memstats_heap_objectsgaugeInternal GolangNumber of allocated objects.
go_memstats_heap_released_bytesgaugeInternal GolangNumber of heap bytes released to the OS.
go_memstats_heap_sys_bytesgaugeInternal GolangNumber of heap bytes obtained from the system.
go_memstats_last_gc_time_secondsgaugeInternal GolangNumber of seconds since the Unix epoch of the last garbage collection.
go_memstats_lookups_totalcounterInternal GolangTotal number of pointer lookups.
go_memstats_mallocs_totalcounterInternal GolangTotal number of mallocs.
go_memstats_mcache_inuse_bytesgaugeInternal GolangNumber of bytes in use by mcache structures.
go_memstats_mcache_sys_bytesgaugeInternal GolangNumber of bytes used for mcache structures obtained from system.
go_memstats_mspan_inuse_bytesgaugeInternal GolangNumber of bytes in use by mspan structures.
go_memstats_mspan_sys_bytesgaugeInternal GolangNumber of bytes used for mspan structures obtained from system.
go_memstats_next_gc_bytesgaugeInternal GolangNumber of heap bytes when next the garbage collection will take place.
go_memstats_other_sys_bytesgaugeInternal GolangNumber of bytes used for other system allocations.
go_memstats_stack_inuse_bytesgaugeInternal GolangNumber of bytes in use by the stack allocator.
go_memstats_stack_sys_bytesgaugeInternal GolangNumber of bytes obtained from the system for stack allocator.
go_memstats_sys_bytesgaugeInternal GolangNumber of bytes obtained from the system.
go_threadsgaugeInternal GolangNumber of OS threads created.
process_cpu_seconds_totalcounterInternal GolangTotal user and system CPU time spent in seconds.
process_max_fdsgaugeInternal GolangMaximum number of open file descriptors.
process_open_fdsgaugeInternal GolangNumber of open file descriptors.
process_resident_memory_bytesgaugeInternal GolangResident memory size in bytes.
process_start_time_secondsgaugeInternal GolangStart time of the process since the Unix epoch in seconds.
process_virtual_memory_bytesgaugeInternal GolangVirtual memory size in bytes.
process_virtual_memory_max_bytesgaugeInternal GolangMaximum amount of virtual memory available in bytes.

Prometheus

NameTypeComponentDescription
promhttp_metric_handler_requests_in_flightgaugeprometheusCurrent number of scrapes being served.
promhttp_metric_handler_requests_totalcounterprometheusTotal number of scrapes by HTTP status code.