Machine ID Troubleshooting Guide
A bot failed to renew a certificate due to a "generation mismatch"
The bot will log an error like this:
ERROR: renewable cert generation mismatch: stored=3, presented=2
Subsequent connection attempts by the bot may see errors like the following:
ERROR: failed direct dial to auth server: auth API: access denied  "\tauth API: access denied , failed dial to auth server through reverse tunnel: Get \"https://teleport.cluster.local/v2/configuration/name\": Get \"https://example.com:3025/webapi/find\": x509: cannot validate certificate for example.com because it doesn't contain any IP SANs" "\tGet \"https://teleport.cluster.local/v2/configuration/name\": Get \"https://example.com:3025/webapi/find\": x509: cannot validate certificate for example.com because it doesn't contain any IP SANs"
In particular, note the message
auth API: access denied.
The Teleport Auth Service will also provide some additional context:
[AUTH] WARN lock targeting User:"bot-example" is in force: The bot user "bot-example" has been locked due to a certificate generation mismatch, possibly indicating a stolen certificate. auth/apiserver.go:224
Machine ID (with token-based joining) uses a certificate generation counter to detect potentially stolen renewable certificates. Each time a bot fetches a new renewable certificate, the Auth Service increments the counter, stores it on the backend, and embeds a copy of the counter in the certificate.
If the counter embedded in your bot certificate doesn't match the counter stored in Teleport's Auth Server, the renewal will fail and the bot user will be automatically locked.
Renewable certificates are exclusively stored in the bot's internal data
directory, by default
/var/lib/teleport/bot. It's possible to trigger this by
accident if multiple bots are started using the same internal data directory, or if
this internal data is otherwise being shared between multiple bot instances.
Additionally, if a bot fails to save its freshly renewed certificates (for example, due to a filesystem error) and crashes, it will attempt a renewal with old certificates and trigger a lock.
Before unlocking the bot, try to determine if either of the two scenarios described above apply. If the certificates were stolen, there may be underlying security concerns that need to be addressed.
Otherwise, first ensure only one bot instance is using the internal data directory. Multiple bots can be run on a single system, but separate data directories must be configured for each.
Additionally, ensure the internal data is not being shared with or copied to any
other nodes, for example via a shared NFS volume. If you'd like to share
certificates between nodes, only copy or share content from destination
/opt/machine-id) rather than the internal data directory
Once you have addressed the underlying cause, follow these steps to reset a locked bot:
- Remove the lock on the bot's user
- Reset the bot's generation counter by deleting and re-creating the bot
To remove the lock, first find and remove the lock targeting the bot user:
tctl get locks
message: The bot user "bot-example" has been locked due to a certificate generation
mismatch, possibly indicating a stolen certificate.
version: v2tctl rm lock/5cee949f-5203-4f3b-9805-dac35d798a16
Next, reset the generation counter by deleting and recreating the bot:
tctl bots rm exampletctl bots add example --roles=foo,bar
Finally, reconfigure the bot with the new token and restart it. It will detect the new token and automatically reset its internal data directory.
tbot shows a "bad certificate error" at startup
tbot process outputs a log like the following:
INFO [TBOT] Successfully loaded bot identity, valid: after=2022-07-21T21:49:26Z, before=2022-07-21T22:50:26Z, duration=1h1m0s | kind=tls, renewable=true, disallow-reissue=false, roles=[bot-test], principals=[-teleport-internal-join], generation=2 tbot/tbot.go:281 ERRO [TBOT] Identity has expired. The renewal is likely to fail. (expires: 2022-07-21T22:50:26Z, current time: 2022-07-25T20:18:33Z) tbot/tbot.go:415 WARN [TBOT] Note: onboarding config ignored as identity was loaded from persistent storage tbot/tbot.go:288 ERRO [TBOT] Failed to resolve tunnel address Get "https://auth.example.com:3025/webapi/find": x509: cannot validate certificate for auth.example.com because it doesn't contain any IP SANs reversetunnel/transport.go:90 ERRO [TBOT] Failed to resolve tunnel address Get "https://auth.example.com:3025/webapi/find": x509: cannot validate certificate for auth.example.com because it doesn't contain any IP SANs reversetunnel/transport.go:90 ERROR: failed direct dial to auth server: Get "https://teleport.cluster.local/v2/configuration/name": remote error: tls: bad certificate "\tGet \"https://teleport.cluster.local/v2/configuration/name\": remote error: tls: bad certificate, failed dial to auth server through reverse tunnel: Get \"https://teleport.cluster.local/v2/configuration/name\": Get \"https://auth.example.com:3025/webapi/find\": x509: cannot validate certificate for auth.example.com because it doesn't contain any IP SANs" "\tGet \"https://teleport.cluster.local/v2/configuration/name\": Get \"https://auth.example.com:3025/webapi/find\": x509: cannot validate certificate for auth.example.com because it doesn't contain any IP SANs"
In particular, note the log line: "Identity has expired. The renewal is likely to fail."
Token-joined bots are unable to reauthenticate to the Teleport Auth Service once their certificates have expired. Tokens in token-based joining (as opposed to AWS IAM joining) can only be used once, so when the bot's internal certificates expire, it will not be able to connect.
When a bot's identity expires, certain parameters associated with the bot on the Auth Service must be reset and a new joining token must be issued. The simplest way to accomplish this is by removing and recreating the bot, which purges all server-side data and issues a new joining token.
Remove and recreate the bot, replacing the name and role list as desired:
sudo tctl bots rm examplesudo tctl bots add example --roles=access
Copy the resulting join token into the existing bot config—either the
--token CLI flag or the
onboarding.token parameter in
restart the bot. It will detect the new token and rejoin the cluster as normal.
SSH connections fail with
ssh: handshake failed: ssh: unable to authenticate
When attempting to connect to a node via SSH, connections fail with an error like the following:
ssh -F /opt/machine-id/ssh_config [email protected]
ERROR: ssh: handshake failed: ssh: unable to authenticate, attempted methods [none publickey], no supported methods remain
ERROR: unable to execute tsh
executing `tsh proxy`
exit status 1
kex_exchange_identification: Connection closed by remote host
Connection closed by UNKNOWN port 65535
In particular, note the
ssh: unable to authenticate message.
This can occur when attempting to log into the node as a user not listed as a principal on the SSH certificate.
You can verify this by viewing the
tbot logs and looking for the log message
when impersonated certificates for the matching destination were renewed.
In the following example, the only principal listed for the identity in
alice (via the
INFO [TBOT] Successfully renewed impersonated certificates for directory /opt/machine-id, valid: after=2022-07-21T21:49:26Z, before=2022-07-21T22:50:26Z, duration=1h1m0s | kind=tls, renewable=false, disallow-reissue=true, roles=[access], principals=[alice -teleport-internal-join], generation=0 tbot/renew.go:630
However, the SSH command attempted to log in as
Ensure the bot identity is allowed to log in as the requested user by taking any of the following actions:
- Changing the SSH command to log in as an allowed user
- Modifying the
accessrole to allow the
- Adding a role granting login via the
Note that if roles are added or modified, the certificates will need to be
renewed for the changes to take effect. The bot will renew certificates on its
own after the renewal interval (by default, 20 minutes), but you can trigger a
renewal immediately by either restarting the
tbot process or sending it a
# If using systemd, you can restart the process:systemctl restart machine-id
# Alternatively, you can send `tbot` a reload signal directly:pkill -sigusr1 tbot
Database requests fail with
database "example" not found, but the database exists
When requesting Database Access certificates, the certificate request fails with an error like the following:
ERROR: Failed to generate impersonated certs for directory /opt/machine-id: database "example" not found database "example" not found
However, the database exists and can be seen by regular users via
tsh db ls
Name Description Allowed Users Labels Connect
---------- ----------- ------------- ------- -------
example [alice] env=dev
Unlike regular Teleport users, Machine ID bot users are granted only minimal Teleport RBAC permissions and are not allowed to view or list databases by default unless granted permission via one or more roles.
Per the Machine ID Database Access Guide, ensure at least one role providing database permissions has been granted to the destination listed in the error.
For example, note the
rules section in the following example role:
kind: role version: v5 metadata: name: machine-id-db spec: allow: db_labels: '*': '*' db_names: [example] db_users: [alice] rules: - resources: [db_server, db] verbs: [read, list]
Ensure the bot has a role that grants it at least these RBAC rules. If desired
you can examine bot roles with
tctl to ensure the necessary
rules have been
tctl get role/machine-id-db
If the role is missing database permissions, it can be modified:
# save the role to a local filetctl get role/machine-id-db > db-role.yaml
# edit the role as necessarynano db-role.yaml
# replace the existing role with the modified copytctl create -f db-role.yaml
By default, destinations (like
/opt/machine-id) are granted all roles provided
to the bot via
tctl bots add --roles=..., but it's possible to grant only a
subset of these roles using the
roles: ... parameter in
If permissions are unexpectedly missing, ensure
tbot.yaml requests your
database role, either by relying on default behavior or adding the role to the
roles: ... list.
Once fixed, restart or reload the
tbot clients for the updated role to take
If the bot was not granted the role initially, the simplest solution is to
delete and recreate the bot, being sure to include the role in the
tctl bots rm exampletctl bots add example --roles=foo,bar,machine-id-db
Kubernetes connections are failing with
Unable to connect to the server: x509: certificate signed by unknown authority
A self-hosted Teleport cluster is connecting Machine ID to Kubernetes clusters with the following errors. This can happen for non-TLS configured Teleport clusters.
E0322 22:53:31.653051 1699 memcache.go:265] couldn't get current server API group list: Get "https://teleport.example.com:443/api?timeout=32s": x509: certificate signed by unknown authority
To confirm the TLS routing mode check the value of the
key with this command, substituting your proxy address:
curl https://teleport.example.com:443/webapi/ping | jq
If the value is
false then this is a non-TLS routing configuration.
Proxies configured with non-TLS routing use specific ports for various types of traffic. That requires that a Kubernetes connection use its designated port. Currently Machine ID requires that the Kubernetes public address is set to use the correct port. Otherwise it will use the Proxy web port which can cause these type of errors.
The Kubernetes public address is via the
kube_public_addr within the
proxy_service configuration by administrators. The proxy will
require a restart after the configuration is updated.
proxy_service: enabled: true kube_listen_addr: 0.0.0.0:3026 kube_public_addr: teleport.example.com:3026
Retrieve the configuration listing from the proxy web address to confirm the
Kubernetes public address is populated in
curl https://teleport.example.com:443/webapi/ping | jq