Health Monitoring
Teleport provides health checking mechanisms in order to verify that it is healthy and ready to serve traffic. These can be used by things like Kubernetes probes to monitor the health of a Teleport process.
Enable health monitoring
Teleport's diagnostic HTTP endpoints are disabled by default. You can enable them via:
- Command line
- Config file
Start a teleport instance with the --diag-addr flag set to the local
address where the diagnostic endpoint will listen:
sudo teleport start --diag-addr=127.0.0.1:3000
Edit a teleport instance's configuration file (/etc/teleport.yaml by
default) to include the following:
teleport:
diag_addr: 127.0.0.1:3000
To enable debug logs:
log:
severity: DEBUG
Restart the service for the change to take effect:
sudo systemctl restart teleport
Ensure you can connect to the diagnostic endpoint
Verify that Teleport is now serving the diagnostics endpoint:
curl http://127.0.0.1:3000/healthz
Now you can collect monitoring information from several endpoints.
/healthz
The http://127.0.0.1:3000/healthz endpoint responds with a body of
{"status":"ok"} and an HTTP 200 OK status code if the process is running.
This is a simple check, suitable for determining if the Teleport process is still running.
/readyz
The http://127.0.0.1:3000/readyz endpoint is similar to /healthz, but its
response includes information about the state of the process.
The response body is a JSON object of the form:
{"status": "a status message here"}
Example:
curl http://127.0.0.1:3000/readyz{"status":"ok","pid":47092}
Agent lifecycle states
/readyz reports one of the following lifecycle states. Each state corresponds
to a specific HTTP status code, response body, and value of the process_state
metric, which makes the state machine the basis for both probe-style and
metric-based monitoring.
| State | HTTP code | /readyz body status | process_state value | When it applies |
|---|---|---|---|---|
| Starting / not yet joined | 400 | "teleport is starting and hasn't joined the cluster yet" | 3 (starting) | The process has launched but has not yet completed an initial heartbeat with the Auth Service. New agents sit in this state until they have successfully joined. |
| Degraded | 503 | indicates the failing component | 2 (degraded) | A component has failed its heartbeat. The most common cause is loss of connectivity to the Auth Service. |
| Recovering | 400 | indicates the recovering component | 1 (recovering) | A previously degraded component completed one successful heartbeat. A second consecutive successful heartbeat returns the component to OK. |
| OK | 200 | "ok" | 0 (ok) | All heartbeats are succeeding. |
The numeric process_state values are stable across Teleport versions and form part of the public metrics API.
Use them directly in Prometheus alert rules — for example, process_state == 2 to detect a degraded process.
Heartbeats run approximately every 60 seconds when healthy and are retried
approximately every 5 seconds after a failure. Depending on heartbeat timing,
it can take 60-70 seconds after connectivity is restored for /readyz to
report OK again. Note that custom intervals may apply if health_check_config is defined in the configuration file.
The same state information is also available via the process_state metric
under the /metrics endpoint.
Querying /readyz from inside a running agent
If exposing the diagnostic endpoint on a network address is not practical (for
example, on agents running in containers without an exposed port), use
teleport debug readyz to query the local /readyz endpoint over a Unix
socket served by the Debug Service. This requires
exec access to the agent's host or pod but no additional network
configuration:
teleport debug readyz{"status":"ok"}
Monitoring agent join from the control plane
/readyz answers "is this agent process healthy from its own perspective."
The complementary question — "does the Teleport cluster see this agent as
joined" — is answered by the tctl inventory family of commands, which
report on what the Auth Service knows about each connected instance.
Use these commands when you need to monitor a fleet of agents from a central location, audit which agents are reachable from the cluster, or troubleshoot a join that appears successful on the agent side but doesn't surface as a resource in the Web UI.
List the cluster's instance inventory
tctl inventory ls lists every agent currently connected to the Auth
Service, along with the services each agent is running, its version, and its
upgrader configuration:
tctl inventory lsServer ID Hostname Services Agent Version Upgrader Upgrader Version Update Group------------------------------------ -------------------------- -------- ------------- -------- ---------------- ------------671f3c6b-f9ef-4821-a895-1ce7193be3aa teleport-node-1 Node v18.7.3 none nonedac31781-af88-46a0-9f18-01f3c2af7152 macos-node Node v18.6.4 none none
Inventory is heartbeat-based and updates periodically. In large fleets, an agent that has just joined or just disconnected may take several minutes to reflect in the output.
Show only currently connected agents
tctl inventory status --connected shows the agents that are currently
connected to the Auth Service instance handling this request, along with
the services they are running and the services that have heartbeated
successfully:
tctl inventory status --connected
In high-availability deployments with multiple Auth Service instances, each
Auth instance only sees the control streams of the agents directly connected
to it. An agent connected to a different Auth instance will not appear in
this output, even though it is healthy. Successive tctl calls may land on
different Auth instances and return different sets of agents.
For ad-hoc troubleshooting on a single-Auth deployment or when run directly on a specific Auth host, this is a useful "what is connected to me right now" view. For cluster-wide monitoring in HA deployments, use:
$ tctl inventory ls
which returns the full heartbeat-based inventory the cluster shares across Auth instances.
Ping a specific agent
tctl inventory ping <server-id> sends a request through the agent's existing
connection to the Auth Service and reports the round-trip latency. This
verifies bidirectional reachability — something that /readyz does not check:
tctl inventory ping <server-id>
A failed or timed-out ping means the Auth Service does not currently have
a live control stream to this agent. This can happen because the agent has
disconnected, because of a network problem in either direction, or because
the agent is connected to a different Auth Service instance in an HA
deployment. inventory ls may continue to show the agent for up to the
instance heartbeat TTL (20 minutes by default) after a disconnect, so its
presence there does not by itself confirm the agent is currently reachable.
Recommended monitoring approach
No single signal answers every question about agent health. We recommend combining the following:
| Question | Use |
|---|---|
| Is the agent process running? | /healthz (Kubernetes liveness probe, or a host-level check such as systemctl status teleport) |
| Has the agent joined the cluster and completed a recent heartbeat? | /readyz (Kubernetes readiness probe) |
| Has the agent heartbeated to the cluster recently? | tctl inventory ls (heartbeat-backed, with a TTL of around 20 minutes) |
| Does an Auth instance currently have a live control stream to this agent? | tctl inventory ping <server-id> |
| Did an agent join, disconnect, or fail to authenticate? | Audit events: instance.join, bot.join, join_token.create, cert.create |
For Kubernetes deployments, configure /healthz as the liveness probe and
/readyz as the readiness probe on agent pods. Do not use /readyz as a
liveness probe — it returns HTTP 400 during the normal Starting and
Recovering states, which would cause the kubelet to restart pods during
normal join and recovery windows. /healthz returns HTTP 200 whenever the
process is running, which is the correct signal for liveness.
Pair these probes with a periodic check from your monitoring system that
runs tctl inventory ls against the control plane and alerts on agents
that disappear from the inventory. Note that inventory ls is
heartbeat-backed with a TTL of around 20 minutes, so a missing agent
indicates the cluster has not heard a heartbeat from it within that window
— useful for catching outages but not real-time. For faster detection of
local-process failures, rely on the /readyz readiness probe.
Known limitations of readyz
/readyz reports the result of the agent's most recent heartbeat with the
Auth Service. It does not currently reflect:
- Per-service health. If an agent is configured to run multiple services
(for example,
app_serviceandssh_service),/readyzcan return 200 OK even when one of those services has failed to start. See #43440. - Auth backend connectivity. The Auth Service can transition out of a
degraded state quickly enough that
/readyzreturns 200 OK between heartbeat failures, even when the backend remains unreachable. See #52273. - Bidirectional cluster communication.
/readyzreflects what the agent knows about its own outbound heartbeats, not whether the cluster can reach back to the agent. See #2276.
For higher-confidence monitoring, combine /readyz with the cluster-side
checks described in Monitoring agent join from the control plane.