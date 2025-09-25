Diagnostics Service
The
tbot process can optionally expose a diagnostics service. This is
disabled by default, but once enabled, allows useful information about the
running
tbot process to be queried via HTTP.
Configuration
To enable the diagnostics service, you must specify an address and port for it to listen on.
For security reasons, you should ensure that access to this listener is
restricted. In most cases, the most secure thing to do is to bind the listener
to
127.0.0.1, which will only allow access from the local machine.
You can configure the diagnostics service using the
--diag-addr CLI parameter:
tbot start -c my-config.yaml --diag-addr 127.0.0.1:3001
Or directly within the configuration file using
diag_addr:
diag_addr: 127.0.0.1:3001
Endpoints
The diagnostics service exposes the following HTTP endpoints.
/livez
The
/livez endpoint always returns with a 200 status code. This can be used
to determine if the
tbot process is running and has not crashed or hung.
If deploying to Kubernetes, we recommend this endpoint is used for your Liveness Probe.
/readyz and
/readyz/{service}
The
/readyz endpoint returns the overall health of
tbot, including all of
its internal and user-defined services. If all services are healthy, it will
respond with a 200 status code. If any service is unhealthy, it will respond
with a 503 status code.
curl -v http://127.0.0.1:3001/readyz
HTTP/1.1 503 Service UnavailableContent-Type: application/json
{ "status": "unhealthy", "services": { "ca-rotation": { "status": "healthy" }, "heartbeat": { "status": "healthy" }, "identity": { "status": "healthy" }, "aws-roles-anywhere": { "status": "unhealthy", "reason": "access denied to perform action \"read\" on \"workload_identity\"" } }}
If deploying to Kubernetes, we recommend this endpoint is used for your Readiness Probe.
You can also use the
/readyz/{service} endpoint to query the health of a
specific service.
curl -v http://127.0.0.1:3001/readyz/aws-roles-anywhere
HTTP/1.1 200 OKContent-Type: application/json
{ "status": "healthy"}
By default,
tbot generates service names based on their configuration such as
the output destination. You can override this by providing your own name in the
tbot configuration file.
services:
- type: identity
name: my-service-123
/metrics
The
/metrics endpoint returns a Prometheus-compatible metrics snapshot.
See Prometheus Metrics below for more information.
/debug/pprof
These endpoints allow the collection of pprof profiles for debugging purposes. You may be asked by a Teleport engineer to collect these if you are experiencing performance issues.
They will only be enabled if the
-d/
--debug flag is provided when starting
tbot. This is known as debug mode.
Prometheus metrics
The
tbot process exposes a number of Prometheus metrics via the
/metrics
endpoint of the diagnostics service.
In addition to exporting the standard Go runtime metrics,
tbot also exports
custom metrics that reflect the health and performance of the various
configurable services.
Advice
When monitoring the health of
tbot, there are three categories of metrics you
should consider:
- The health of the
tbotprocess itself. For example, how much CPU time and memory is it using? These can be strong indicators of overall health and provide early warning signs of potential issues (e.g. memory leaks).
- The health of the internal services that
tbotrelies on. For example, has
tbotbeen able to successfully renew its internal identity? If these internal services have become unhealthy, then it is likely that user-defined services within
tbotwill also become unhealthy.
- The health of the services you configured within
tbot. This will indicate whether
tbothas been able to successfully perform its intended functions.
For monitoring the health of the
tbot process itself, a large number of
metrics are provided by the Go runtime.
For monitoring the health of internal and user-defined services, there are two key metrics:
tbot_task_iterations_failed: the total number of task iterations that have failed. This will have a
servicelabel indicating which service within the
tbotprocess the task belongs to.
tbot_task_iterations_successful: the total number of task iterations that have succeeded. This will also have a
servicelabel. This metric is a histogram, and will also indicate the number of retries that were required before the task succeeded. For a perfectly healthy service, you would expect this number of retries to be zero, or close to zero.
Metrics
Generic
These metrics are generated by more than one service within
tbot or may be
generated by the core supervisor within
tbot itself.
|Name
|Description
tbot_task_iterations_total
|The total number of task iterations that have been performed. This will have a
service and
name label to specify which task.
tbot_task_iterations_failed
|The total number of task iterations that have failed. This will have a
service and
name label to specify which task.
tbot_task_iterations_successful
|The total number of task iterations that have succeeded. This will have a
service and
name label to specify which task. This metric is a histogram, and will also indicate the number of retries that were required before the task succeeded.
tbot_task_iterations_duration_seconds
|The duration of the time taken to perform an iteration of the task. This will have a
service and
name label to specify which task. This metric is a histogram.
ssh-multiplexer
These metrics are generated by the SSH multiplexer service.
|Name
|Description
tbot_ssh_multiplexer_requests_started_total
|The total number of SSH multiplexing requests that have been started.
tbot_ssh_multiplexer_requests_handled_total
|The total number of SSH multiplexing requests that have completed. The
status label indicates whether the request completed successfully (
OK) or with an error (
ERROR).
tbot_ssh_multiplexer_requests_in_flight
|The number of SSH multiplexing requests that are currently in progress.