Hacking Kubernetes - Overview
Key topics on Hacking Kubernetes
- Evaluating a Kubernetes cluster can occur on several levels. Standard isolation questions include examining how traffic gets into a cluster, how people can access the nodes, and whether the API server is public or private.
- The three common sources of compromise for Kubernetes clusters are supply chain risks, threat actors, and insider threats.
- Most hosted Kubernetes systems, especially cloud provider systems, come with a hardened node image.
- When companies get into feature delivery tunnel vision, security takes a back seat, and at some point, they might be left running an outdated node version.
- Without smooth continuous delivery pipelines, the responsibility of managing your own infrastructure can be too much for an organization.
- One preferred way of updating a Kubernetes cluster is to do a blue-green deployment, whereby there are two clusters behind the load balancer.
- Misconfiguration is a main cause of security incidents, and preventing misconfigurations is about testing.
- A Kubernetes namespace is not a security boundary in itself because there are things that are not namespaced, so there is no way to accurately correlate security criteria to the namespace.
Expanding your knowledge on Securing Kubernetes
- Kubernetes API Access Security Hardening
- Teleport Kubernetes Access
- Teleport Application Access
- Teleport Quick Start
- Teleport Access Platform
Ben: Welcome to Access Control, a podcast providing practical security advice for startups. Advice from people who have been there. Each episode, we’ll interview a leader in their field and then best practices and practical tips for securing your org. For today’s episode, I’ll be talking to Andrew Martin, CEO of Control Plane. Control Plane is a London-based Kubernetes Consultancy helping architect, install, audit, and secure Kubernetes clusters using cloud-native technologies. Andrew was previously a DevOps lead at the UK Home Office and has helped lead teams implementing high-volume critical national infrastructure projects for the UK Government. We’ll deep-dive into securing Kubernetes and strategies for partnering with the public sector.
Andrew is co-author of O’Reilly’s Hacking Kubernetes, a great book in progress and due November 21st. This book will help you better understand the Kubernetes defaults, threat models for Kubernetes clusters, and how you can protect against those attacks.
Ben: Hi, Andrew. Thanks for joining us today.
Andrew: Hello. Thank you very much for having me.
Evaluating a Kubernetes cluster
Ben: Control Plane is a consultancy, so you must see a lot of Kubernetes clusters. What’s the first thing you look at when you evaluate a customer’s cluster?
Andrew: That is a good question. There are a few different levels. Sometimes we provide a kind of holistic audit service, and we will ideally start with the infrastructure as code deployment. I mean, hopefully, somebody’s used Terraform, or if not, some kind of cloud specifics. Looking at how traffic gets into a cluster, about how the people can access the nodes, about whether the API server is public or private are just kind of the standard isolation questions. And then from the cluster perspective itself, when trying to assess the risk level, there’s a few different — and Mark Manning said something really great in the talk, which is that the threat model really depends on where you are in the abstraction. So if we’re in the pod looking to try and get out, well, then we care a lot about the security contexts.
Andrew: Like the old hackneyed repeated: don’t run as routes, don’t have capabilities because those things are the precursor to many attacks and many container breakouts. Then, we step back a level and kind of look at the network. And then it’s kind of the standard policies. We don’t want the flack Kubernetes network and the pod network to allow rooting to the underlying nodes to the API server if there’s no operator-like requirement or if workload identity is not being used. And network policies one thing — admission control policy is really the other. Once we know what should be coming into the cluster, we want to maintain that high bar and be sure that people are not circumventing controls or indeed finding creative ways to deploy things, which is perhaps the same thing. So yeah, definitely the way the cluster is built, the way traffic gets to the cluster, and then the way that administrators interact with the cluster. And from there really the next most standard thing is, how is RBAC configured? Because once those things are in place and we agree that we have a secure baseline, if we’re then shipping service accounts that have the ability to create pods in any namespace, we’re probably looking at being able to circumvent some of those bits of policy. And yes, those are the first few layers of the Kubernetes onion.
Ben: And then how long does it take to fix or secure up a Kubernetes cluster when you work with your clients?
Andrew: That’s a very good question. Well, if you watch Rawkode do it on YouTube, it takes a matter of minutes or hours. From our perspective, when we provide a kind of more holistic audit service, we will always go through and build a threat model. We will ingest as much information about the way the systems are built and then use that to go back and run a workshop with the team. That means that we’ve got a shared understanding. We can take some of the existing threats that we know. We can use sources like the Microsoft-extended attack matrix, the MITRE ATT&CK for containers, and some of our own sort of threat models and attack trees that we’ve generated previously and superimpose those onto the cluster architectural diagram and build attack trees. That gives us something that potentially can take a lot longer to fix because some of those may be especially things like open-source ingestion, the supply chain angle. Those are generally a lot more difficult to fix.
Andrew: If we’re talking about just locking down bits of policy in the cluster, it can be fixed more quickly. It really depends upon the — and I’m going to hesitate to say DevSecOps, and instead say — defensive DevOps, which is the new small flag that I’m attempting to fly. It depends upon the DevOps patterns and practices. Because if an organization has acceptance tests for their containers, for their container builds, if they have configuration testing for every piece of YAML that they throw towards a cluster, including Docker files and all that good stuff, if you want to make a substantial configuration change, it can be automatically verified. And really the difference between a company or an organization that can remediate an audit or pen test report in days or weeks and one that might take a couple of months to fix is that level of automation and testing in the middle. Based upon the fact that we present everything in this kind of uniform threat-modelled manner, that normally gives enough confidence for a security team to say: “Well, we understand exactly what this attack is and why it makes sense to fix it. That control is fine. That makes sense.” And sort of remove some of the kind of abstract he-said-she-said ships in the night elements that security is often plagued with.
Protection from sources of compromise for Kubernetes clusters
Ben: And so I know in your book you talk about the three common sources of compromises in the Kubernetes supply chain or for Kubernetes clusters, and that starts with supply chain risks, threat actors, and insider threats. And I think you’ve touched a little bit on supply chain. But can you talk about how a company can protect against these three different types of risks and also what they are?
Andrew: Yeah. Absolutely. The supply chain is the producer-consumer problem. Everything that exists must start somewhere. At some point, it’s a raw material. And whether that’s a kind of farm-to-table analogy, where we are picking the fruits of our labors in the field and then shipping them to a fruit bowl on someone’s table and hoping that nobody interferes with them in a negative way between those two points, or we are downloading source code from GitHub building it into a container and running it in production, we have exactly the same class of problem. And that problem is nobody checks what’s going on at each link of that chain. Now, obviously, farm-to-table, we kind of trust that the farmer’s field is in some way safe or secure. But really, that’s more of an assumption than —
Ben: Well, I guess it’s the organic, what chemicals classed as organic and not —
Andrew: Well, yeah, very much so.
Ben: You may have different trust in the product.
Andrew: Yeah. Precisely. Who then certifies that between the field and the warehouse, for example, or the flash freezing [inaudible] harvester or whatever it be? Yes. The flash freezing device. So the idea of supply chain security is to say, for each link in the chain, I will use some stamp of authenticity. And I guess historically that would have been a cygnet ring and some wax. These days, obviously, it’s all cryptographic. And the nature of those cryptographic signatures, while it’s quite easy to sign things, re-validating them and making sure that the signature relates to somebody that we trust, that we trusted at that point in time, and that is still fresh enough to be a valid signature is a far more difficult problem than it sounds.
Andrew: We have the situation where, again, the financial incentives the same as Node.js, the same as Kubernetes. Open source offers this huge incentive for us to, as an enterprise organization, not build all our own code, not buy licenses for libraries, but instead find out which community-sponsored effort has sort of risen to the top by virtue of social proof and other people using it and the quality of their GitHub and their issue triage and how they fix bugs and deal with security vulnerabilities and then we choose the best one. And then we say: “Okay, let’s bring that into our organization.” In doing that, we are suddenly implicitly trusting everybody who has got access to pull request and have those pull requests merged into that source code. So suddenly pull requests become a security boundary, which is not really something we were configured for. Obviously, Git, by design, is meant for patches to come via email. Linux doesn’t accept pull requests. They’re a GitHub concept. They don’t exist in Git itself.
Andrew: So suddenly we have this new security boundary that’s not really very well-considered or monitored. And we only have to look at things like the event stream vulnerability, an NPM issue from three or four years ago, where there was a maintainer who had built event stream, and they were maintaining hundreds of different NPM repos. And they accepted help from somebody who openly and generously offered their help in open source. That person then, after a period of months, went rogue and shipped a payload that tried to steal crypto wallets. Now, okay. Fine. That’s not so bad. If that had attempted to steal GPG keys or SSH keys and perpetuate itself via different registries as a worm, well, that could have infected a huge amount of build infrastructure. And because a lot of build service will build things on timers or on-demand and don’t pin their dependencies, this is a really effective way to get malicious code into an organization.
Andrew: Obviously this year, we’ve seen the SolarWinds attacks. Suffice to say, supply chain security is here to stay and a necessity to defend against this new breed of attacks. I’m very pleased to say that one of the communities that I thoroughly enjoy being a part of, the CNCF’s STAG or TAG Security Group, have written a supply chain security White Paper this year, and it goes into a lot of detail on how to effectively defend your systems, cloud-native systems specifically, with source analysis, signing, trying to verify identity, isolating builds, dealing with hermeticism, and avoiding sharing resources and revalidating all the pipeline steps once we actually hit production. It’s probably safe to say that unless an organization ticks off the majority of those boxes, there are still potential supply chain areas of intrusion. And therefore, it is a difficult problem. It’s not intractable, but it’s troublesome.
Ben: Oh, yeah. We could do a whole podcast on supply chain. And then what about threat actors?
Andrew: So threats actors take a number of different forms. It’s useful to categorize the potential threat actors to a system because if I’m running a WordPress site, that is for my mom-and-pop shop, then the kind of people who might attack me are probably very different to if I’m running a cryptocurrency exchange, or if I’m actually in the traditional financial system. By grading the type of people who may attack our systems — and we start with a kind of script kitty. The kind of graffiti kid of the modern internet, and then we move through disgruntled individuals, people who have maybe felt hard done by for some reason, or who have a vendetta or some particular view on the line of business that’s undertaken by the organization. Those kinds of people are likely to try and pop a WordPress site because we can download Kali Linux and we can run proof-of-concept attacks against published CDEs or misconfigurations. That means that to defend our WordPress site from those kind of threat actors, we probably use a hosted WordPress that patches for us. It’s probably a reasonable kind of defense.
Andrew: But above that level of threat actor — and actually, I’m very pleased to say we defined an archetypal eight-bit nemesis, Captain Hashjack, as our nemesis for hacking Kubernetes. That level of threat actor where we’re dealing with organized crime, who may or may not have either state-sanctioned behavior or state intelligence service links, they have a far greater capability; they have financial resources, and they actually have a higher level of skill and are often closely affiliated with educational institutions. Dealing with that level of attack, we’re talking about things like the Colonial Pipeline hack from earlier this year. That was a state-sponsored group. Were able to use not particularly advanced techniques in all honesty but with more efficiency and better management. Once the crypto ransom goes in, they then provide you with a professional-looking chat portal, so you can negotiate on the —
Ben: With customer support to get your ransom paid?
Andrew: Yeah. Precisely.
Ben: And they’re normally very polite in getting their crypto ransom.
Andrew: Precisely. I mean, you can’t exactly throw a wad of cryptocurrency at someone’s feet. Yes. They certainly make you feel that you’ve been politely swindled.
Ben: And then I’m guessing there’s probably one level above that which is probably nation-state?
Andrew: Yes. You have it. And there we’re talking about a very different kind of security. The kind of things that probably most of us don’t consider it worth being involved with. The kind of organizations that have a different type of power. So when Edward Snowden fled, for example, from Hawaii and I believe started in Hong Kong, he would log onto his laptop from underneath his bed sheet in his hotel room. Because he knew that the way he would be observed was not by tapping his machine, it would be a camera just watching him type in his password. And it’s once the whole security question gets turned on its head because physical access is no longer guaranteed as impossible or actually secure that — yeah. So modelling foreign intelligence service behavior is a very different kettle of fish. And again, once we try and defend against these kinds of things, the cost of implementing some of the controls can be so high that an organization that is not genuinely affected or concerned on their own risk-based profile has no reason to try and implement some of those more advanced controls.
Ben: I guess the last type of threat is insider threats?
Andrew: Yes. I heard on the Risky Business podcast — shout-out — that 1 in 40,000 employees is potentially a hostile insider. Now, I think that number could probably flex up and down with — I don’t know, maybe by 50% or more either way. But suffice to say, in a large enough organization, there will be one, two, or more hostile inside threats. And where they come from really is dependent upon the nature of the organization. We can have the kind of traditional sleeper agent view where somebody is kind of moved to a country and ingratiated themselves and naturalized, or we can have a more academic espionage perspective where somebody comes from a different educational institute and then maybe joins a meteorological institution or some way of conducting advance research in some way and then is able to exfiltrate data by virtue of the trust that they have from —
Ben: And that will be sort of intellectual property theft — would be the classic case?
Andrew: Precisely. Yeah. Each of these things is difficult to manage. Kubernetes, as the runtime, is obviously subjected to all of the same issues and concerns as any computer system would be. But with a well-known declarative interface and a well-segregated and understood runtime, it gives us a different opportunity to model these threats and try to provide a more generalized solution.
Ben: Can you talk more about how it’s specific to Kubernetes and what Kubernetes provides?
Andrew: Yeah. Absolutely. So one of the great innovations in Kubernetes land was admission control being added to access control. We authenticate, we authorize, and then the payload of the request is passed to a third system. And that system can have its own webhooks. It can run all of these validation steps in parallel. We could also mutate the request at that point as well, to enforce secure defaults if we require or perhaps additionally annotate or sort of add metadata. And this gives us something that’s a bit like a web application firewall for orchestrator level interaction. So the payload of the request, which at this point is a YAML or actually a JSON blob in the API server, can then be inspected. So it’s like having a restful service except for before the application receives the request, we have the opportunity to interrogate the headers, to look at the path, and to actually see the payload that is going to the service. And we can then enforce different types of policy.
Andrew: This in itself is an innovation that should be added to all access control systems, in my opinion. So we have this well-known interaction style and format and the structs of the data in the JSON, and then we can apply any arbitrary policy to it. We’ve gotten so far in the past few years as having tools like OPA, the Open Policy Agent, that allows us a generic expressible policy so we can define more or less whatever we like. Or just pull it off-the-shelf policy that would say, for example, “No RBAC.” Or just pull an off-the-shelf policy that would say, for example, “No privileged pods. You must run this pod as non-root.” Or, in fact, something a little bit more esoteric, like, “It’s past 7:00 PM. Why are you still at work? You shall not pass.”
Running Kubernetes yourself vs. using a hosted solution
Ben: Well, many customers — they have a range of options for running Kubernetes. You can run Kubernetes the hard way or use a host solution. If you do run Kubernetes yourself, how much more of a responsibility are you taking on?
Andrew: I think firstly, most hosted Kubernetes systems, especially cloud provider systems, come with a hardened node image. They come with health checking and monitoring of that node, and they will upgrade the thing for you as well. And those are the three big ones. If someone can break out of the container, they need to land somewhere on the host operating system or the host file system. And if they’re dealing with a mostly immutable base image, if they’re dealing with well-applied SELinux controls or AppArmor, there are not many places that you can actually break out onto. In fact, SELinux blocks the majority of container breakouts. So having a hardened base node is super-important. Secondarily, having the thing anywhere near the current patch level is, of course, kind of basic security hygiene.
Andrew: But when organizations get into — and I say this as somebody who’s worked in startups and big enterprises — when people get into feature delivery tunnel vision, security takes a back seat. To some extent, it makes sense because the raison d’être of the organization is to stay afloat and stay alive. And if people are running out of VC runway or competitors are launching better products, then perhaps that makes some sense. But at some point, that well of credit expires and we’re left running an outdated node version. And then if we look at some of the historical — well, an API server, sorry, the control plane — is the important part. Then, if we’re running an outdated Kubernetes version, and we have one of the big vulnerabilities — and Kubernetes is awesome. I would much prefer that we have a fast-paced and fast-moving platform that has bugs because of innovation, rather than something that is calcified and stationary, but is so secure because there’s no features, that balance is accepted.
Andrew: If we get to one of these big internet-melting or at least cloud-native-melting bugs, the Billion Laughs Attack, would take down any API server with YAML recursive deserialization. If you went beyond a — if you were behind your patch cycle and you had a publicly exposed API server, well, that was your cluster down for as long as somebody thought it was funny. We have web socket handling issues where there were essentially a level of escalation for anonymous users on the API server. We’ve had things like net filter vulnerability recently, which allowed a cap net admin enabled container to break out onto the host. And while two of those require a control plane fix, of course, that last one requires a kernel updates. But if we don’t have these well-lubricated continuous delivery pipelines, and again, it’s just a function of the quality of infrastructure and application testing we have on the way to production, then the responsibility of managing your own infrastructure is probably too much for an organization. So it is predicated on that solid DevOps action.
Andrew: On the flip side, obviously I’m going to opine the majesty of managed services. Fully integrated with a suite of software tools, observability, networking controls, and the things I managed for you. By removing access to the control plane, that also brings an extra level of stability. And default integrations with KMS services mean that even if people are able to get to etcd by some mean feats, the whole thing is well-protected, and those keys are rotated frequently. The security model, the topology, the observability — and frankly — the ease of scripting in infrastructure as code means that I’d much prefer managed services, but of course spent half my time on hybrid infrastructure, anyway.
The responsibility involved in running Kubernetes yourself
Ben: What kind of client would you recommend going for running Kubernetes themselves?
Andrew: That’s a great question. There are real cost savings associated with bare metal. If an organization, as most have, still have a data center, it’s probably quite tempting to run your own clusters. Those economies of scale really only turn up with very widely scaled workloads.
Importance of keeping Kubernetes updated
Ben: Yeah. Kubernetes has changed to a different release schedule. Can you talk about how often and how quickly you should update your clusters based upon the official Kubernetes releases?
Andrew: Yeah. Absolutely. My preferred way of updating its cluster is to do a blue-green deployment. So if it’s possible to have two clusters behind the load balancer, maintain as little state in the cluster as possible. And at the point that we want to upgrade our clusters, we deploy exactly the same application stack onto the secondary cluster. So if we’re blue already, we deploy onto the green. Ideally, we’re deploying with something like GitOps. So we can actually have an identical deployment from the same repository. Now, obviously, there are some questions here around shared states. If we are backing off into cloud provider data stores, which is my strong preference because managing data stores is that bit more difficult, if we are backing off onto those data stores, we want to be sure that we can, for example, share the storage. That we have appropriate locking. That we’re not making schema change as part of our updates, for example. So we’ve put a truly compatible system that we can run from multiple places at the same time. And we can obviously test that by having the same version cluster deployed twice. Then, to perform the upgrade, we just change the load balances target, and it can there [inaudible] gradually across to the secondary cluster.
Andrew: That will prove to us, first of all, that the application can deploy onto that new cluster. That we haven’t got any breaking API changes. That there are no fundamental underlying orchestrator changes or security configurations, or new feature gates that we either need or are testing or experimenting with. And once we’ve got them both running happily in parallel, but with 100% of traffic to the old blue system, we start to move that traffic across slowly and monitor our vitals, our cluster health checks, and identify if that new cluster can take the traffic and take the load and that our update has worked correctly. The big advantage of doing things in this way is that we can move over to that new cluster and we can leave traffic there for one, two, three days. If at any point we start to see signs of degradation of performance or something has changed or you suddenly see errors, we can just flip back to the original cluster that ran happily for the past three or four months.
Andrew: My preference is to be — and again, I will just underpin everything I’m saying with it helps to have a crack DevOps team. I was going to say, you can’t buy it. In many ways, you can, but really these are the infrastructure experienced professionals who know how to wire everything together, who know how to apply tests, who know what a technical compromise is and where it’s best made. And it’s those kinds of people that are best off running Kubernetes’s. So from that update cycle, my preference is to always have the capacity to easily upgrade between versions. What normally stops people updating or upgrading is application stasis. They are unable to do a blue-green deploy or it will cost them downtime too, not physically lift and shift, but to actually move stateful sets between nodes, for example. As soon as data is involved, we pay a much higher price. We have to shut down rights access to the data. We have to back it up. We have to move it to the new place or maybe kind of constantly keep it synchronized and then change where the primary is pointing. If we have the platform that can run from multiple places against the same data and state stores, it makes the upgrade story significantly easier.
Solutions for reducing misconfiguration errors
Ben: And I kind of liked that you touched on competent DevOps professionals as one of the key elements, which is kind of a good segue to my next question. So in RedHat’s 2021 survey about the State of Kubernetes Security, it said 94% of respondents admitted to a security incident in the last 12 months. And of those incidents, the main cause was misconfiguration. So what are some of the solutions to reducing misconfiguration errors?
Andrew: I’m very glad you asked. Control Plane have a sideline in training. My head of training, the ineffable Lewis, is my co-teacher for SANS SEC584: Attacking and Defending Cloud Native and Kubernetes. The whole course is based on popping systems and then building the controls that prevent you from doing so. And also putting the depth in place to make sure that those controls are effective on the developer’s machine, in the pipeline, and again at admission control. From an AppSec perspective, which is where DevSecOps is kind of originally entangled and generated from, we would call that shift-left. And that’s the process of having tooling running in the developer’s IDE that gives them the ability to perform static analysis against their code.
Andrew: We are strong proponents of the infrastructure shift left as well. And again, OPA’s a great example. Gareth Rushgrove built Conftest for OPA, which basically takes a policy and runs it against your local files. Again, Docker files or Kubernetes manifests or whatever they would be. So we can run Conftests locally, which can run the same policy as production. And that means that we can say: “When I’m building a new service in Kubernetes and I’m writing a deployment manifest and I forget to set a security context, or I try and grant a capability that is not permitted at runtime and production.” And I do my pre-commit hooks. Conftest says: “This will not get into production. Check your privilege.” And at that point, the developer is in the same cognitive space. They haven’t gone through the 2 to 20 minutes push cycle to see what CI returns true or false with.
Andrew: Really, preventing misconfigurations is about testing. And it’s not just about testing from a kind of QA team perspective. It’s about bringing the software development rigor that we’ve learnt over the past 20 years, and I guess sort of the past 15 years in DevOps, and saying: “Okay. We know how these things work. We’re good at doing them for applications now. Why are we not building more tests for infrastructure and policy in security code?” And in my opinion, the answer is because it takes a long time to run them. The answer to that is to parallelize things to give people access to build infrastructure that will run these things on their behalf. And again, that becomes a DevOps problem. So there’s kind of an enablement catch 22, where we need good DevOps teams to build out good infrastructure development platforms to allow security teams to ship policy and changes faster and more safely. So it’s a little bit difficult. And that’s our sweet spot, of course.
On writing the book Hacking Kubernetes
Ben: You’re also currently working on a book on hacking Kubernetes. Can you tell me why you started this book?
Andrew: Had I known what is entailed, I may not have commenced in the first place. I started because my delightful and lovable co-author, Michael Hausenblas, just kind of wondered if it would be a good idea. And I was deeply enticed by the prospect. Obviously, this is a passion of mine. I thoroughly enjoy what I do. Being given the opportunity to write a book with somebody who is already well-established and took me under his wing, showed me the proverbial ropes and was also unafraid to embrace my love of piratical metaphor, meant that we were able to write this book, as I say, with this nonsensical Captain Hashjack, with all these stories and sort of subplots woven into the narrative, but also a lot of technical demonstration. A lot of red team tools and techniques. And it’s because ultimately, it’s the book that I would have liked to have read. And certainly, working with Michael and having the wealth of his experience on some of the chapters that I was less familiar with, kind of less keen to write by default, has brought out the best of our strengths. So I’m very pleased with it. It has just gone into production editing today. There’s a degree of trepidation. I feel like I’ve buried a time capsule and I can’t remember what I put inside.
Approach to Kubernetes training and best practices
Auditing the config file
Ben: A book is one way of sort of learning about Kubernetes and you also do a lot of training sort of professionally. When people are sort of getting started with Kubernetes, many sort of how-to guides, they give you a config file or some Helm charts. “Take this. Run it. See what happens.” Do you have any concerns about a health audit to see whether these sort of scripts you take off the internet are safe to run?
Andrew: That is an excellent question. Okay. So I guess let’s start with curl bash. Archetypal, “Let’s get this thing up and running nice and quickly.” What should we do instead of curling an unknown script and piping it straight to a shell? We should pass it through some sort of hash tool that gives us an identifier and then we should compare that to an identifier that we’ve got from a side channel, from the same provider, to make sure that it is the thing that we expect. But that doesn’t really mean anything. We should probably also check the actual code. And herein lies a good supply chain question. Because open-source and this intrinsic web of trust has grown from a place where we have Debian, for example, my favorite distribution. We have Arch Linux. We have Nicks at this point as well. With huge communities of people who are putting code into something central, we trust it because there are so many eyes upon it. Even though there have been some attacks against these things generally over the course of sort of 15 to 30 years across a few of them, we haven’t seen any internet-melting attacks. So we just trust that by default.
Andrew: With curl bash, we don’t read the Bash script before running it. Because we’ve been conditioned to say, “Well, it’s open source and generally that’s been safe and secure.” Now, we get to the point that we have Helm charts. We have operators which come necessarily with highly privileged RBAC configurations because they are performing on behalf of a human and their controlling resources may be inside or outside the cluster for the operator. In these cases, this is not analogous to open source. It’s not even that close to curl Bash. Because if we’re curling and piping to Bash, we’re probably just impacting the node that we’re on. With just the secrets that are on there, maybe a SSH key. Maybe we would accidentally install a rootkit. We’re still kind of perfecting one node. To err is human. To truly fubar requires one or more computers.
Installing Helm charts
Andrew: When we’re installing Helm charts or an operator on Kubernetes, we’re installing into a distributed system. And potentially that system is also connected to the cloud, and we have a workload identity. So the potential for compromise for a malicious operator or Helm chart is huge. This is a piece of research work that we’re actually undertaking at Control Plane. My dear friend and colleague Kevin Ward will be talking about tweezering Kubernetes operators at KubeCon North America. It’s actually a multitiered answer, and I’ll let him go into more detail. But again, there’s static analysis. There is identifying a steady-state for the thing that we’re bringing in. Once the manual review has been performed, we have our static analysis profile built for the thing. We need to make sure that it stays within the tolerances of what we’ve allowed without, for example, also being very noisy and saying that every time a small thing changes, we require another manual review. It’s also a case of running intrusion detection.
Ben: So be wary of just downloading any Helm chart off the internet?
Andrew: If I was to make a recommendation to everybody, “Analyze everything they download from the internet,” you know as well as I do, it’s an impossibility. For highly privileged things going into distributed systems, there is an elevated level of risk.
What Kubernetes Simulator does
Ben: Another tool that you have as far as teaching and learning is your Kubernetes Simulator.
Andrew: Yeah. Absolutely. After running various toys and — after having a parallel life of running highly resilient, often air-gapped, difficult-to-manage-and-maintain Kubernetes clusters from 1.2 — was the first system I was in production with — I was also spinning up my own clusters at home all the time in my own hobby lab. At some point, it kind of became clear that this was still not an easy enough thing to do for most people. So Control Plane began hosting Capture the Flag events and debugging events where we would spin up 20 clusters on DigitalOcean for people, and people would have groups of sort of 2 to 5. And I love the way that I’ve always taken so much from pairing with better engineers and being in groups and sort of mobbing because you see so many different cognitive styles expressed via the medium with the keyboard, and it really opens the mind to the different approaches and mental models that we can build to a problem.
Andrew: So we were doing these workshops. Dragging Kubernetes up for every version — sometimes through point releases — just took a lot of time. We were very lucky to be commissioned by one of our clients to actually build the thing into an all-singing, all-dancing simulator. It stands up Kubernetes in an offline kind of air-gapped with bastion [inaudible] and then deploys 1 of, at this point, kind of 20 or 30 different scenarios. What initially began as quite nefarious cluster-debugging scenarios where we would put typos and label selectors and have intermittent network policy, we were running operators in inverted commas that were more like chaos monkeys. And they were just interesting, kind of almost like escape rooms for Kubernetes. We took that, and we rebuilt the thing with a load of security-focused scenarios. So what went from production debugging became essentially red-teaming Kubernetes.
Andrew: We were then very lucky to have the opportunity to run the Capture the Flag events at the SIG Security Day for the various KubeCon’s for the past couple of years. We’ll also do KubeCon North America. If anybody’s interested, SIG Security have a day-long Capture the Flag. It is going to be epic this time. Unicorns, glitter, and really quite irritatingly difficult problems. We use this tool. It’s all open source. People can spin up and learn these things by themselves. But I think the real benefit comes from joining a team, maybe of people that you don’t know, and understanding how different people approach an unknown problem. An intractable issue. So, yeah. That’s how I love to learn. My own personal arc is optimized for failing very quickly. And by typing fast and building tests, I know when I fail and then I just iterate and move and that’s how I learn. In order to encourage people to kind of find their own learning style in the context of the Capture the Flag, the simulator is there for them to do their worst or their best.
Ben: It’s the perfect tool, especially if you have Kubernetes at work and you don’t necessarily have the opportunity to cause chaos or wreck it and you don’t get in trouble for it.
Andrew: I mean, that’s an excellent point. We were actually commissioned to build it by a regulated organization that couldn’t spin up anything vulnerable in anything that they were financially attached to. It’s a perfect point. We actually have a hosted version that we’ve been running in beta for a little while.
Kubernetes namespaces as a security boundary
Ben: So to prepare for the podcast, I sort of pinged our Dev team for some questions. And I got one back from Andrew, who actually completed our Kubernetes integration and sort of improved it greatly. Can a Kubernetes namespace be a strong security boundary if configured correctly?
Andrew: That is an excellent question. And I think I’ve learned more from the excellent reviewers of the book who gently and softly questioned some of the things that I wrote around Kubernetes namespaces than I have in the past few years. The agreement that I’ve come to with myself and the shared understanding of the reviewers is that while a Kubernetes namespace is not a security boundary in itself, and that’s because there are things that are not namespaced, that there is no way to accurately correlate security criteria to the namespace. Unfortunately, many security functions are namespace-bound. So we’re talking about things as mundane as network policy, as admission control, as RBAC. Of course, there’s also cluster-wide RBAC. But we’re also talking about application-level things that affect the way that we architect our application topologies.
Andrew: So for example, if I have a web-facing workload and a batch workload, the web-facing workload probably wants to scale up and down with the horizontal pod autoscaler based upon incoming traffic. If that is within the same namespace as a batch workload and they’re constrained by a limit range that prevents them from accessing too many resources and exceeding their limits and requests and cgroups (control groups) ultimately, that batch workload is going to be constrained by the width that the horizontal pod autoscaler scales the web pod out to. So even from an availability perspective, kind of CIA triad style, the availability of the system and therefore its performance and its security and its conformance to expectation is affected by the way that we architect on namespaces.
Andrew: So the answer is somewhere in between. It’s definitely a grey area. There is no strong linkage. I suppose the other useful one is the network in general. Networks have no capacity to appreciate namespaces until we start to use CNI specific things. Obviously, network policy was introduced by Calico. Kind of abstracted from the interface that Calico generated. As I say, I was challenged, conflicted, and ultimately reconciled with some better opinions. There is really no strong namespace tenancy, let’s say. We can’t say that a namespace creates a perimeter of security, but we can use namespace-bound security functionality in order to give us isolation between tenants. All that being said, Kubernetes multitenancy is really difficult. And even if we’re going to go that far, we can or we would have to run our own DNS servers within namespaces because otherwise, we leak records between them. Probably, we get down to node pool isolation, and actually, node selecting groups of pods onto nodes. At that point, we’re kind of a bit closer to hard multitenancy, but we’ve bound namespaces to node pools. And then, arguably, which is actually the stronger isolation boundary? It’s probably the fact that we’re isolated on dedicated hardware.
Leveraging open-source projects
Ben: Cool. That’s a great answer. So you’ve leveraged lots of CNCF projects. Why do you think it’s important to leverage open-source projects?
Andrew: Because the open-source community has revolutionized the way that software is consumed is maintained and is security-patched. When we saw internet-melting things like Shell Shock and Heart Bleed, the number of closed-source hardware vendors and box sellers that suddenly just patched and upgraded their systems, but they didn’t produce software bill of materials. They didn’t adhere to the licenses that said that they should be publishing the fact that they’re using this code within their systems. They just all kind of uniformly upgraded their OpenSSL version or their Bash version at the time. Open-source software essentially runs the internet. It powers the world. And it’s built from people who genuinely believe in — I was going to say intellectual curiosity, but it’s not quite that, is it? It’s —
Ben: Open-source Kubernetes projects or CNCF’s projects, are there any ones that you’re really excited about? Either that are incubating or that have been around for a while?
Andrew: Man. Yeah. I mean, it was a surprise, I think, to everybody that Isovalent only just put Cilium into the CNCF. It’s been an exciting eBPF based networking technology for I guess, three or four years, and super pleased to see that make its way into the CNCF now.
Ben: You said open policy agent, but that’s been around for a while.
Andrew: It has, yes.
Tips for companies looking to provide services to the public sector
Ben: We can just segue. You work with lots of public sector companies. And then this can be often a very difficult industry for startups to get into either through consulting services or selling, whatever, software. Do you have any tips for companies looking to provide services to the public sector?
Andrew: My personal arc was guided by a desire to just keep on figuring out what was next. And I moved to London as a consultant for the opportunity, and I always believed that there was something more interesting. And to be honest with you, I always wanted to be in security operations centers with the pewpew maps. Yeah. That James Bond-esque vision of London was very much my kind of Jack Whittington streets paved-with-gold experience. Well, projection. So I got to London and started to work around some places. Did a bit of work for media organizations just as paywalls didn’t exist, but Google was introducing all the news aggregation. And then kind of saw how things operated and then just kind of kept on digging, thinking about, “Well, what’s the most next secure thing?” In this naïve imagination that at some point I would get to the kind of super-secure thing.
Andrew: And just as a one-man consultant, started to move into financial services and credit agencies and that kind of thing, it became clear to me that actually security is really difficult. And I never wanted to get into security because I found it so fascinating, but I was conscious that the buck has to stop somewhere, and it’s got to stop with the security team. If people are going to do their jobs, ultimately, they bear the responsibility. And I just thought: “Oh, I don’t think I could deal with that weight of pressure.” Once I actually got into the industry and saw how difficult it was to secure things and how risk balances this over multiple departments and approaches, and it’s really not a case of kind of one man standing there with a keyboard swinging a mouse trying to bat away the pew maps, then I thought: “Well, actually maybe this is slightly more approachable.” And so I started to put my best foot forward in more difficult security contexts, I suppose.
Andrew: It was that that eventually led me kind of moving through financial services to work in government. And it was from being there and working on critical national infrastructure projects and understanding, again, that everything is a compromise. It’s all very well to physically isolate your servers and sort of work on an air gap basis, but then the risks change and they’re different. And actually, you’re looking at how you protect boot sectors of USB disks more than kind of worrying about how your network is configured. It was doing that with Kubernetes 1.2 that put me in a position of understanding whereby I realized everything is difficult. There is no one true way. Everything is a compromise. And then understanding that the difficult journey that I’ve been on with the critical national infrastructure would be echoed and amplified by organizations for the next 10 or 15 years put us in a good position to start Control Plane, my cofounder and I.
Andrew: And from there, it was a case of ultimately just demonstrating that we’re aware that these things were tricky, but we’d been through the mill. We’d understood what it’s like from inside those other organizations. So I’d say that’s probably the easiest way — is to put yourself through the hardship of having to work under the constraints of a regulated organization or [inaudible] and once you can kind of — yeah, once you can share war stories about how a lack of access made you sort of skirt firewalls or middleware boxes pulling critical patches to prevent a CVE from exploiting production, I suppose that’s the kind of lingua franca of battle-worn regulated industry DevOps folk.
Ben: Yeah. So to summarize, your tip is start in the heavily — the security problems are everywhere. Start in the heavily regulated ones and then you can sort of take those learnings elsewhere to other industries? The same patterns sort of apply across all industries.
Andrew: Yeah. Absolutely. And it’s so much more difficult to do it in a regulated fashion. That once that sort of competency is demonstrated, it’s broadly transferable. I think also part of it is an innate resilience. It is difficult to work through layers of risk-based bureaucracy. That large organizations are often calcified or glacial in process. And I think a cheerful optimism is probably the most devastating foil to that level of regulation.
Ben: Yeah. I’m sort of surprised that they even pick Kubernetes to start off with.
Andrew: Yeah. That’s an excellent point. It is very much; I don’t want to say a dart in the quadrant, but the sea change I think is clear too, or was clear in the past four or five years. People knew containers were coming. Then, there was the orchestrator battles. Were we going to go Mesos? Was Nomad going to make an appearance? Did Swarm hold the middle ground? And actually, all of those things seeded ground to Kubernetes. That combined with the supposed cost-saving benefits of cloud —
What’s next for the Kubernetes ecosystem
Ben: As Kubernetes grows in adoption, what do you think is next for the Kubernetes ecosystem?
Andrew: Awesome question. Okay. So the Kelsey Hightower view, of course, is Kubernetes is a platform platform. So it’s for building other platforms on top of and the fundamental distributed systems control theory exercise of reconciliation loops. There’s this amazing quote from Tabitha Sable, which is Kubernetes is a friendly robot that uses control theory to make our hopes and dreams manifest so long as your hopes and dreams can be expressed in YAML. And I love this because it means that Kubernetes is just an eventually consistent system. You give it a request, and you wait for the request to be made reality. And if we decompose it just to that basic level, then the Kelsey Hightower platform platform aspect is, well, why don’t we just run everything on Kubernetes? Why don’t you reconcile your external systems states with Kubernetes and say: “Okay. I want to spin up a database, or I want to run this application stack, or I want to run this network configuration”?
Andrew: So you have an eventually consistent Terraform-esque infrastructure as code. I mean, you’d call it an operator in Kubernetes. I love that idea. I think there is a lot of complexity there. And the inversion of control that is required to run an application in a cloud native way. So no longer is the application king, but actually, the application is one workload amongst many that is potentially going to churn. It may not survive the next 1 minute, the next 10 minutes. That inversion of an application from the most important thing to just one of a herd is the same thing that needs to happen for cloud infrastructure in general. Of course, that’s how we should treat it, but it’s often stateful and snowflake-esque. So I really love that as a control theory metaphor.
Andrew: The other thing I’m really stoked by is Knative. Out of all the serverless systems — with Lambda, we have Azure fuzz — we also have Google Cloud Run. Now, Google Cloud Run is a Knative- compatible container runner that actually runs on Borg, which is Google’s global data center management software that is the progenitor of Kubernetes via an experiment in another orchestrator called Omega. But Cloud Run is actually compatible with Knative. So Knative functions as a service for Kubernetes. So we have this hybrid best of all worlds because we kind of see the cloud-native evolution as lift-and-shift into virtual machines on EC2 or GCP. Then, repackage and configure our workloads for Kubernetes and operate in a distributed and bin pact style with increased availability and scalability and resilience guarantees.
Andrew: And then the next stage for many organizations is “Let’s run serverless.” Serverless is not my favorite thing. I find the lack of observability difficult. Security monitoring is less easy. I mean, there are now sort of more options to do this. For me, the best of both worlds is running something like Knative on Kubernetes. Where we can still run intrusion detection on the underlying hosts. We can still take full-stack observability because we’re in control of the underlying virtual machine, but we have scaled to zero. We have functional decomposition and sort of single responsibility for each of the things that we’re running and deploying. That there we go. Across the range, Kubernetes for controlling the cloud, Kubernetes for sticking in small boxes at the edge, and Kubernetes for running functions.
Closing tips for staying secure
Ben: Cool. There’s a lot happening. To just close it out, do you have any closing tips for staying secure?
Andrew: Minimum viable cloud native security is scanning your container images before they get to production. That means that we can scan the operating system package dependencies, your programming language package dependencies, and anything else that’s been installed. That’s a good way of protecting the first entry point of an attacker. As the inestimable Brad Geesaman likes to say, “Workloads are the soft underbelly of Kubernetes.” We can use the declarative configuration, and we can lock things down because it’s well-known, and it’s well tested. Our applications are built from code that is constantly changing. We’re shipping things because we’re trying to build value for the businesses that we’re part of. So making sure that we scan those container images is the first port of call because those web-facing sockets are where people will attack us. Secondarily, making sure we have correct policy so that when things come into the cluster and when they operate within the cluster, they can form to a well-known baseline. If the baseline’s wrong, well that’s okay. We can fix it in the future. But at least we have a baseline. Building out a secure supply chain is a tricky thing to do, but by signing the things that we build in our CICD, that means that we trusted them at that point in time. And even generating and signing a software bill of materials is better than not having any at all.
Ben: Thank you, Andrew. That’s a perfect closing tip. It’s been good to have you today. Thank you so much.
Andrew: Much appreciated. I’ve thoroughly enjoyed being here.
Ben: [music] This podcast was created by Teleport. Teleport allows engineers and security professionals to unified access for SSH servers, Kubernetes clusters, web applications and databases across all environments. To learn more, visit us at goteleport.com.