Teleport provides HTTP endpoints for monitoring purposes. They are disabled by
default, but you can enable them using the --diag-addr
flag when running
teleport start
:
sudo teleport start --diag-addr=127.0.0.1:3000
Now you can collect monitoring information from several endpoints.
/healthz
The http://127.0.0.1:3000/healthz
endpoint responds with a body of
{"status":"ok"}
and an HTTP 200 OK status code if the process is running.
This is a simple check, suitable for determining if the Teleport process is still running.
/readyz
The http://127.0.0.1:3000/readyz
endpoint is similar to /healthz
, but its
response includes information about the state of the process.
The response body is a JSON object of the form:
{ "status": "a status message here"}
/readyz
and heartbeats
If a Teleport component fails to execute its heartbeat procedure, it will enter a degraded state. Teleport will begin recovering from this state when a heartbeat completes successfully.
The first successful heartbeat will transition Teleport into a recovering state.
A second consecutive successful heartbeat will cause Teleport to transition to the OK state, so long as at least 10 seconds have elapsed since the first successful heartbeat.
Teleport heartbeats run every 5 seconds. This means that depending on the timing
of heartbeats, it can take 10-20 seconds after connectivity is restored for
/readyz
to start reporting healthy again.
Status codes
The status code of the response can be one of:
- HTTP 200 OK: Teleport is operating normally
- HTTP 503 Service Unavailable: Teleport has encountered a connection error and is running in a degraded state. This happens when a Teleport heartbeat fails.
- HTTP 400 Bad Request: Teleport is either entering its initial startup phase or has begun recovering from a degraded state.
The same state information is also available via the process_state
metric
under the /metrics
endpoint.
/debug/pprof
The http://127.0.0.1:3000/debug/pprof/
endpoint is Go's standard pprof
profiler. This endpoint is only available if the --debug
(or -d
) flag is
supplied (in addition to --diag-addr
).
/metrics
The http://127.0.0.1:3000/metrics
endpoint serves the internal metrics
Teleport is tracking. It is compatible with
Prometheus collectors.
The following metrics are available:
Teleport Cloud does not expose monitoring endpoints for the Auth Service and Proxy Service.
Auth Service and backends
Name | Type | Component | Description |
---|---|---|---|
audit_failed_disk_monitoring | counter | Teleport Audit Log | Number of times disk monitoring failed. |
audit_failed_emit_events | counter | Teleport Audit Log | Number of times emitting audit events failed. |
audit_percentage_disk_space_used | gauge | Teleport Audit Log | Percentage of disk space used. |
audit_server_open_files | gauge | Teleport Audit Log | Number of open audit files. |
auth_generate_requests_throttled_total | counter | Teleport Auth | Number of throttled requests to generate new server keys. |
auth_generate_requests_total | counter | Teleport Auth | Number of requests to generate new server keys. |
auth_generate_requests | gauge | Teleport Auth | Number of current generate requests. |
auth_generate_seconds | histogram | Teleport Auth | Latency for generate requests. |
backend_batch_read_requests_total | counter | cache | Number of read requests to the backend. |
backend_batch_read_seconds | histogram | cache | Latency for batch read operations. |
backend_batch_write_requests_total | counter | cache | Number of batch write requests to the backend. |
backend_batch_write_seconds | histogram | cache | Latency for backend batch write operations. |
backend_read_requests_total | counter | cache | Number of read requests to the backend. |
backend_read_seconds | histogram | cache | Latency for read operations. |
backend_write_requests_total | counter | cache | Number of write requests to the backend. |
backend_write_seconds | histogram | cache | Latency for backend write operations. |
cluster_name_not_found_total | counter | Teleport Auth | Number of times a cluster was not found. |
etcd_backend_batch_read_requests | counter | etcd | Number of read requests to the etcd database. |
etcd_backend_batch_read_seconds | histogram | etcd | Latency for etcd read operations. |
etcd_backend_read_requests | counter | etcd | Number of read requests to the etcd database. |
etcd_backend_read_seconds | histogram | etcd | Latency for etcd read operations. |
etcd_backend_tx_requests | counter | etcd | Number of transaction requests to the database. |
etcd_backend_tx_seconds | histogram | etcd | Latency for etcd transaction operations. |
etcd_backend_write_requests | counter | etcd | Number of write requests to the database. |
etcd_backend_write_seconds | histogram | etcd | Latency for etcd write operations. |
firestore_events_backend_batch_read_requests | counter | GCP Cloud Firestore | Number of batch read requests to Cloud Firestore events. |
firestore_events_backend_batch_read_seconds | histogram | GCP Cloud Firestore | Latency for Cloud Firestore events batch read operations. |
firestore_events_backend_batch_write_requests | counter | GCP Cloud Firestore | Number of batch write requests to Cloud Firestore events. |
firestore_events_backend_batch_write_seconds | histogram | GCP Cloud Firestore | Latency for Cloud Firestore events batch write operations. |
gcs_event_storage_downloads_seconds | histogram | GCP GCS | Latency for GCS download operations. |
gcs_event_storage_downloads | counter | GCP GCS | Number of downloads from the GCS backend. |
gcs_event_storage_uploads_seconds | histogram | GCP GCS | Latency for GCS upload operations. |
gcs_event_storage_uploads | counter | GCP GCS | Number of uploads to the GCS backend. |
heartbeat_connections_missed_total | counter | Teleport Auth | Number of times the Auth Service did not receive a heartbeat from a Node. |
heartbeat_connections_received_total | counter | Teleport Auth | Number of times the Auth Service received a heartbeat connection. |
teleport_audit_emit_events | counter | Teleport Audit Log | Number of audit events emitted. |
teleport_connected_resources | gauge | Teleport Auth Service | Tracks the number and type of resources connected via keepalives. |
teleport_registered_servers | gauge | Teleport Auth Service | The number of Teleport servers (a server consists of one or more Teleport services) that have connected to the Teleport cluster, including the Teleport version. After disconnecting, a Teleport server has a TTL of 10 minutes, so this value will include servers that have recently disconnected but have not reached their TTL. |
user_login_total | counter | Teleport Auth Service | Number of user logins. |
watcher_event_sizes | histogram | cache | Overall size of events emitted. |
watcher_events | histogram | cache | Per resource size of events emitted. |
Proxy Service
Name | Type | Component | Description |
---|---|---|---|
failed_connect_to_node_attempts_total | counter | Teleport Proxy | Number of times a user failed connecting to a Node. |
failed_login_attempts_total | counter | Teleport Proxy | Number of failed tsh login or tsh ssh logins. |
proxy_connection_limit_exceeded_total | counter | Teleport Proxy | Number of connections that exceeded the proxy connection limit. |
proxy_missing_ssh_tunnels | gauge | Teleport Proxy | Number of missing SSH tunnels. Used to debug if nodes have discovered all proxies. |
teleport_connect_to_node_attempts_total | counter | Teleport Proxy | Number of SSH connection attempts to a node. Use with failed_connect_to_node_attempts_total to get the failure rate. |
teleport_reverse_tunnels_connected | gauge | Teleport Proxy | Number of reverse SSH tunnels connected to the Teleport Proxy Service by Teleport instances. |
teleport_reverse_tunnels_connected | gauge | Teleport Proxy | Number of reverse SSH tunnels connected to the Teleport Proxy Service by Teleport instances. |
Teleport Nodes
Name | Type | Component | Description |
---|---|---|---|
user_max_concurrent_sessions_hit_total | counter | Teleport Node | Number of times a user exceeded their concurrent session limit. |
All Teleport instances
Name | Type | Component | Description |
---|---|---|---|
certificate_mismatch_total | counter | Teleport | Number of SSH server login failures due to a certificate mismatch. |
reversetunnel_connected_proxies | gauge | Teleport | Number of known proxies being sought. |
rx | counter | Teleport | Number of bytes received during an SSH connection. |
server_interactive_sessions_total | gauge | Teleport | Number of active sessions. |
teleport_build_info | gauge | Teleport | Provides build information of Teleport including gitref (git describe --long --tags), Go version, and Teleport version. The value of this gauge will always be 1. |
teleport_cache_events | counter | Teleport | Number of events received by a Teleport service cache. Teleport's Auth Service, Proxy Service, and other services cache incoming events related to their service. |
teleport_cache_stale_events | counter | Teleport | Number of stale events received by a Teleport service cache. A high percentage of stale events can indicate a degraded backend. |
trusted_clusters | gauge | Teleport | Number of tunnels per state. |
tx | counter | Teleport | Number of bytes transmitted during an SSH connection. |
Golang runtime metrics
Name | Type | Component | Description |
---|---|---|---|
go_gc_duration_seconds | summary | Internal Golang | A summary of GC invocation durations. |
go_goroutines | gauge | Internal Golang | Number of goroutines that currently exist. |
go_info | gauge | Internal Golang | Information about the Go environment. |
go_memstats_alloc_bytes_total | counter | Internal Golang | Total number of bytes allocated, even if freed. |
go_memstats_alloc_bytes | gauge | Internal Golang | Number of bytes allocated and still in use. |
go_memstats_buck_hash_sys_bytes | gauge | Internal Golang | Number of bytes used by the profiling bucket hash table. |
go_memstats_frees_total | counter | Internal Golang | Total number of frees. |
go_memstats_gc_cpu_fraction | gauge | Internal Golang | The fraction of this program's available CPU time used by the GC since the program started. |
go_memstats_gc_sys_bytes | gauge | Internal Golang | Number of bytes used for garbage collection system metadata. |
go_memstats_heap_alloc_bytes | gauge | Internal Golang | Number of heap bytes allocated and still in use. |
go_memstats_heap_idle_bytes | gauge | Internal Golang | Number of heap bytes waiting to be used. |
go_memstats_heap_inuse_bytes | gauge | Internal Golang | Number of heap bytes that are in use. |
go_memstats_heap_objects | gauge | Internal Golang | Number of allocated objects. |
go_memstats_heap_released_bytes | gauge | Internal Golang | Number of heap bytes released to the OS. |
go_memstats_heap_sys_bytes | gauge | Internal Golang | Number of heap bytes obtained from the system. |
go_memstats_last_gc_time_seconds | gauge | Internal Golang | Number of seconds since the Unix epoch of the last garbage collection. |
go_memstats_lookups_total | counter | Internal Golang | Total number of pointer lookups. |
go_memstats_mallocs_total | counter | Internal Golang | Total number of mallocs. |
go_memstats_mcache_inuse_bytes | gauge | Internal Golang | Number of bytes in use by mcache structures. |
go_memstats_mcache_sys_bytes | gauge | Internal Golang | Number of bytes used for mcache structures obtained from system. |
go_memstats_mspan_inuse_bytes | gauge | Internal Golang | Number of bytes in use by mspan structures. |
go_memstats_mspan_sys_bytes | gauge | Internal Golang | Number of bytes used for mspan structures obtained from system. |
go_memstats_next_gc_bytes | gauge | Internal Golang | Number of heap bytes when next the garbage collection will take place. |
go_memstats_other_sys_bytes | gauge | Internal Golang | Number of bytes used for other system allocations. |
go_memstats_stack_inuse_bytes | gauge | Internal Golang | Number of bytes in use by the stack allocator. |
go_memstats_stack_sys_bytes | gauge | Internal Golang | Number of bytes obtained from the system for stack allocator. |
go_memstats_sys_bytes | gauge | Internal Golang | Number of bytes obtained from the system. |
go_threads | gauge | Internal Golang | Number of OS threads created. |
process_cpu_seconds_total | counter | Internal Golang | Total user and system CPU time spent in seconds. |
process_max_fds | gauge | Internal Golang | Maximum number of open file descriptors. |
process_open_fds | gauge | Internal Golang | Number of open file descriptors. |
process_resident_memory_bytes | gauge | Internal Golang | Resident memory size in bytes. |
process_start_time_seconds | gauge | Internal Golang | Start time of the process since the Unix epoch in seconds. |
process_virtual_memory_bytes | gauge | Internal Golang | Virtual memory size in bytes. |
process_virtual_memory_max_bytes | gauge | Internal Golang | Maximum amount of virtual memory available in bytes. |
Prometheus
Name | Type | Component | Description |
---|---|---|---|
promhttp_metric_handler_requests_in_flight | gauge | prometheus | Current number of scrapes being served. |
promhttp_metric_handler_requests_total | counter | prometheus | Total number of scrapes by HTTP status code. |