Anatomy of a Cloud Infrastructure Attack via a Pull Request

CI/CD security disclosure

In April 2021, I discovered an attack vector that could allow a malicious Pull Request to a Github repository to gain access to our production environment. Open source companies like us, or anyone else who accepts external contributions, are especially vulnerable to this.

For the eager, the attack works by pivoting from a Kubernetes worker pod to the node itself, and from there exfiltrating credentials from the CI/CD system. Critically, this vulnerability exposed production AWS credentials that could be used to alter release artifacts and access production infrastructure.

Once discovered, we immediately fixed our CI system to prevent this attack from occurring. Our response team’s analysis found no evidence this vulnerability was exploited or any data was tampered with. Even so, we encourage customers to upgrade to the recently released Teleport 4.4.11, 5.2.4, 6.2.12, or 7.1.1.

In this post, I will walk through the vulnerability and focus on some key areas where we are improving our CI/CD tools and practices.

Context

Teleport is an open-source company; we maintain public GitHub repositories where the general public can interact with us and contribute code. When a developer — whether internal or external — wants to contribute, they open a Pull Request (PR) and our Continuous Integration (CI) system validates, builds, tests, and runs the contributed code.

By definition, running CI on Pull Requests is remote code execution: A contributor submits code and our CI system runs those contributions on Teleport infrastructure. Automatically executing CI against unapproved PRs from external contributors was our first mistake. From that initial remote code execution, we uncovered several isolation errors allowing a PR to escape its isolation mechanism and pivot to the Drone server and its secrets.

Technical details

To understand the attack, it is first helpful to understand Teleport’s former CI deployment. We deployed a Drone CI instance and its jobs as pods inside a Kubernetes cluster. Drone Service pods, CI job pods, as well as Release automation pods were configured to run with common scheduling parameters, resulting in a mix of workloads across homogenous nodes.

Diagram of Drone deployed inside Kubernetes

Within the above Kubernetes pods, our CI jobs run code in unprivileged containers. However, much of our automation is dependent on running Docker commands for improved determinism and cross-platform development. For jobs that require Docker, we utilize a privileged docker-in-docker (DIND) sidecar container. Code running within the unprivileged build container can talk to this DIND sidecar, allowing CI to use the same automation that developers run on their workstations.

Diagram of a build container running workflows inside DIND

With this setup, an ill-intentioned Pull Request can request that DIND run a privileged container crafted to escape Docker’s sandbox. Using any number of privileged container escape techniques, a Pull Request job could use this privileged container inside DIND to escalate to root access on the node itself.

Diagram of a privileged container escaping DIND

After the attack escapes to the root on the node, secrets from other pods on the same host can be harvested by inspecting container environments. While not all pods are “lucky” enough to land on the same node as a Drone Server pod or a Release pod, due to the shared scheduling parameters, it is a waiting game until a pod with valuable secrets is scheduled on a compromised node.

Exfiltrating secrets from the Drone Server pod is a juicy target to begin with. However, because our Drone instance served internal and external CI/CD workflows across the company, the secrets it held could pivot to a wide array of critical infrastructure: release artifacts for Teleport and Gravity, Teleport Cloud backend services, legacy billing infrastructure, private GitHub repositories and the web servers backing this blog post.

Response

We used this internally discovered vulnerability as an opportunity to practice our security incident response in a controlled manner. This tests our processes and keeps employees’ reflexes ready for a real incident. During this exercise, our incident response team completed the following items:

Mitigation

Analysis

Remediation

In total, the vulnerability in our Drone configuration existed between May 4th, 2020 and April 22nd, 2021, with potentially exposed secrets rotated by July 27, 2021.

Prevention

In addition to the mitigation steps taken above, we continue to work on the following longer-term preventative items:

If there are other critical items you’d like to see included in our response, please contact [email protected].

Advice to others

Based on our experience, here are three pieces of security advice to consider when deploying CI systems:

Vet external contributions before running CI

Require a trusted party to read and approve untrusted code before running it. This may seem obvious; however, it is not default behavior for many tools. GitHub Actions, for example, lacked approval features in their initial rollout; Jenkins and Drone require plugins to add this functionality.

When implementing approval, be careful to require re-verification each time the code changes, and avoid races related to fetching a git branch or tag that could change between approval and execution.

Reduce blast radius

Assume part of the system will be compromised at some point, and strive to keep the initial compromise from compromising the entire system. For instance, the service and jobs cohabitate in the default Jenkins setup. This creates a path from hacking a CI job to compromising the Jenkins service itself. Close this hole by running jobs via agents on a separate machine. If you use an orchestration substrate like Kubernetes, be sure to separate at both the container and machine level. Contemporary containers aren’t a strong sandbox.

Furthermore, as the recent Travis CI vulnerability shows, CI vendor security isolation is often insufficient (for example: separating release secrets vs PR secrets) . Examine all the secrets and access that a CI system has as well as potential pivots. Consider the worst case for each workflow, and design to minimize impact.

Review CI and release architecture for security

Often, CI is seen as a secondary function — especially at young companies — but the systems and habits established early may live on for many years. Treat CI and release security the same as you would your product and customer’s security, they may be one and the same.

At Teleport, this means staffing full-time engineers, completing Requests for Discussion for significant changes, and hiring external security experts to audit our CI, just as we do for our product.

Teleport cybersecurity blog posts and tech news

Every other week we'll send a newsletter with the latest cybersecurity news and Teleport updates.

Wrap-up

We hope this write-up has been helpful — understanding how an open-source security company can make mistakes, proactively find issues, and improve security posture. We learned much in the past months, and have substantial work ahead of us.

Although this vulnerability was not exploited in the wild and had no impact on customers, we choose to disclose it because transparency and openness are core to our engineering team. Security is built on trust, and trust is built upon open communication.

If you have questions or concerns about how this vulnerability impacts your use of Teleport or if you know of other vulnerabilities at Teleport, please contact [email protected].

Thanks to Kevin N., Russell J., Gus L, and Adam G. for their contributions to this post.

Related Posts

security engineering
 

Try Teleport today

In the cloud, self-hosted, or open source

View Developer Docs

This site uses cookies to improve service. By using this site, you agree to our use of cookies. More info.