How We Built Machine ID

Apr 8, 2022 by 

Sakshyam Shah

Machine ID Architecture

The DevOps workflow is all about automation driven by machine-to-machine access. To maintain the automated DevOps pipeline, engineers configure service accounts with credentials such as passwords, API tokens, certificates, etc. The issue is that engineers often fall into the security mispractice of creating long-lived credentials for service accounts to facilitate automation and lessen manual intervention. This is risky because compromised long-lived credentials allow unlimited access to adversaries.

Teleport Access Plane already enables an automated way to provision short-lived certificates for all infrastructure access requirements, including SSH, Windows, Kubernetes cluster, databases and web applications. However, the "automated" certificate management was only available to users and servers enrolled directly with Teleport. With Teleport 9 and Machine ID, we bring an entirely automated way to provision machine-to-machine access with a short-lived certificate, even for user accounts and server resources that are not directly managed by Teleport. Think of EFF's certbot for infrastructure access management at scale.

This blog post will explain how we've implemented Machine ID.

How does Teleport Machine ID work?

The core functionality of Teleport Machine ID is a certificate renewal bot that communicates with the Teleport cluster and securely renews access certificates in an automated fashion. The tool that facilitates this feature is implemented as a lightweight agent tbot. Below are important points on tbot implementation.

tbot authentication

To handle certificate issuance and renewals, 'tbot' must first be authenticated with the Teleport cluster. This can be achieved with the following two methods:

  1. Using AWS IAM join: A secure AWS IAM joining method based on AWS IAM identity verification which does not require provisioning security tokens to join Teleport clusters. This method works similar to our existing AWS Node joining method. tbot joined to the Teleport cluster using this method performs continuous re-authentication to receive new short-lived certificates.

  2. Using one-time join tokens: During the initial setup, administrators need to manually authenticate tbot with the Teleport cluster using one-time join tokens. Upon successful authentication, tbot will receive a short-lived renewable certificate. Based on this renewal certificate, tbot will generate secondary end-user certificates (access certificates) and place them in a directory accessible by programs such as SSH clients, API clients, database clients, etc.

Renewal cycle

tbot supports and initiates certificate renewal in four scenarios:

  1. When some fraction of the certificate's TTL has elapsed and nearing its expiry.
  2. When user and/or host CA rotation is taking place.
  3. Manual certificate renewal request.
  4. Upon tbot startup.

Security considerations for Machine ID

Considering the capability of tbot to auto-renew certificates by design, a stolen certificate by a rogue service account or compromised machine can be used to renew certificates indefinitely, bypassing the security benefits of a small ttl window. To counter this, we have incorporated the following primary security implementations to tbot:

Locking compromised tbot

In any situation where an administrator suspects a compromised tbot, Teleport allows tbot locking (i.e. session locking), preventing compromised tbot from performing any certificate renewal requests, thus containing a threat quickly.

Separating renewable and access certificates

To prevent a situation where a compromised application could steal a renewable certificate and use it to renew certificates indefinitely, tbot manages two distinct sets of certificates — access certificates and renewable certificates. Access certificates can only be used for authenticating with servers (e.g., OpenSSH server, CI server) and are re-issued rather than renewed. Renewable certificates are only used by tbot to authenticate with the cluster. By restricting the application's access to non-renewable certificates (which can be implemented with file access control in Linux), we can avoid the chance of getting the renewable certificate stolen while facilitating access using a separate access certificate.

Certificate generation counters to detect compromised certificate renewals

It is challenging to detect compromised certificates. Besides separating renewal and access certificates, we have also implemented certificate generation counters that would help us detect and lock out compromised tbot certificates. The certificate generation counter increments each time a certificate is renewed. The counter values are attached to the certificate as a certificate extension and stored in Teleport’s backend. If a renewable certificate is compromised and renewed, the generation counter will not match when the next renewal takes place and Teleport will lock out the bot. For this reason, the shorter the renewal period, the more secure.

Audit events

Given the sensitivity of the renewal process, we have added tbot audit logging, which captures the events in the following cases:

  • When a new tbot user is created
  • When a new renewable certificate is issued for the first time.
  • When a certificate is renewed
  • When tbot is locked
  • When tbot is removed (tctl bots rm)
  • When a certificate generation counter conflict is detected (certificate possibly compromised)

Since tbot is assigned with a user and role (applying RBAC to bots), most of the audit events are emitted under that user and role. For example:

  • tctl bots add emits role.created and user.create
  • tctl bots rm emits user.delete and role.deleted
  • No specific audit events for initial cert or renewal, we just emit cert.create and user.update whenever a new cert is generated, which happens in both cases.
  • Locks emit lock.created just like user/role/etc locking works against human users

Where does Machine ID fit in modern cloud infrastructure?

The primary use case for Teleport Machine ID is to facilitate scoped and short-lived certificate-based access for machine-to-machine communications. This means Teleport Machine ID can be used to protect service account access to CI/CD pipeline, API server access, automated database access, and secure automated access by remote configuration management tools such as Ansible, Chef, Puppet, etc.

By assigning a short-lived certificate and supporting an automated certificate renewal process, Machine ID helps to enforce just-in-time access and just-enough-privilege principles to machine-to-machine access. For example, suppose a service account needs to access the CI/CD server between 9 AM and 6 PM. In that case, administrators can create a short-lived certificate that renews every 6 hours and then automates role assignments during working hours and revokes them during off-hours.

Additionally, all the security features of Teleport privileged access management can now be enforced on machine-to-machine access including RBAC and audits.

What's next for Machine ID?

With Teleport machine ID, we are a step closer in consolidating privileged access management requirements for both user-to-machine and machine-to-machine access. Teleport 9 is available from our downloads page. Read the documentation and watch an introductory video on Machine ID to get started. Join the Slack channel where Teleport users and developers hang out for community support.

Teleport is an open-source project, and everything we design and develop is discussed in the open. If these kinds of problems and solutions sound interesting to you, consider joining us at Teleport.

Try Teleport today

In the cloud, self-hosted, or open source
Get StartedView developer docs
Hacker NewsTwitter