Teleport Named an Overall Leader in Zero Trust Platforms by KuppingerCole Analysts

Try for Free Contact Sales

Background image

Navigating Access Challenges in Kubernetes-Based Infrastructure

Published: September 19, 2024

Navigating Access Challenges in Kubernetes-Based Infrastructure

Organizations often find that as they deploy their K8S infrastructure into production and across their company, what worked well for managing access during development does not scale efficiently. Research shows that this often leads to serious security risks including breaches.

So, new access challenges emerge, particularly as teams scale. Join us for a 30-minute deep dive into how to secure access to Kubernetes-based environments including clusters, databases, and applications in a scalable way.

Join Dave Sudia, Senior Product Engineer at Teleport, on September 19 to learn more about:

Today’s most common Kubernetes access approaches
Access obstacles that could hinder your team’s efficiency and security as your Kubernetes footprint grows
Setting up Role Based Access Controls (RBAC) across mixed infrastructure and multi-clouds
How the Teleport Access Platform offers a streamlined solution to these challenges, ensuring secure and seamless access to your critical infrastructure

Whether you’re just beginning to navigate Kubernetes or looking to refine your current approach, this session will equip you with the insights and tools needed to future-proof your operations.

Expanding your knowledge on Navigating Access Challenges in Kubernetes-Based Infrastructure

Transcript - Navigating Access Challenges in Kubernetes-Based Infrastructure

Dave: All right. We'll get started. Just want to respect everybody's time. And welcome to Navigating Access Challenges in Kubernetes-Based Infrastructure. My name's David Sudia. I am a senior product engineer here at Teleport. Before here, I was CTO at a nonprofit where we used Kubernetes, and I was a platform engineer at a startup before that where we transitioned all of our infrastructure from Heroku over into Kubernetes, and I led that effort. So it's a topic that's near and dear to my heart. And one of the things that I like about working at Teleport is I saw this, and I was like, "I wish I had that, especially for this use case." So I'm excited to talk about it today. So let's just talk briefly kind of what's the problem that we're trying to solve here. I mean, anyone watching this is probably aware of all the breach reports that happened. I just got an envelope two days ago from a healthcare company about my son's data being breached. And so that's something that is very close to us here at Teleport, that this is what we try to prevent. And Kubernetes, in particular, is a very difficult thing to secure. It's not secure by default, by any means. I think a lot of people think it is. And you can definitely make it secure, but at its base, it's not. It's really easy to deploy pods that have way too many privileges, including access down to the actual underlying host. It's really easy to accidentally have your credentials exposed in a code repository. There's a lot of people out there trying to get into your systems.

Dave: And I think the most insightful, impactful thing ever told to me about attackers versus defenders was said by Ian Coldwater, who's a very prominent Kubernetes and cloud-native security person. And in a talk at KubeCon 2019, they said, "Blue teams think in lists, and red teams think in graphs or maps." And that really stuck with me. It's sort of when you're on defense, you're sort of thinking, "Well, I'm hitting these points for compliance. I need to protect these things." And attackers, red teams, are going, "Okay. I'm here. Where else can I get to?" And it really stuck with me in a way that kind of made me think, "Ah, that's why I'm not in pen testing because that's just not the way I default to thinking." But for me, it illustrates why these are problems. Right? If you have access into a cluster, if you then have, via that access to the cluster, access to a pod, and that pod has privileges, then you're going to be able to get down into something else. And from there, you can figure out where else you can get to. And all of a sudden, you're everywhere in someone's infrastructure.

Current industry trends in Kubernetes security/access and common access challenges

Dave: And so this number down on the bottom left here, where 61% of Kubernetes signals — this is telemetry sent back to Elastic in one of their research studies — were potentially attackers trying to expand their access. The detail there is those are service accounts that were trying to access things that those service accounts were not supposed to be accessing. And that can just be someone misconfiguring something, but it can also be a sign of someone poking around, seeing, "Okay. I'm here. Where else can I get to?" And that happens a lot in Kubernetes, and there are a lot of ways to poke around in there, and that's part of what we're trying to solve. So the challenges we have around providing access are — you need to provide credentials to developers and infrastructure folks to work in a cluster. Sometimes those are hard coded. Sometimes they are something like you're logging in through the AWS or G Cloud or Azure CLI, and you get a credential dropped onto your system, and that sits there for a long time. It's hard to know who has that, especially when you're trying to manage lists within those cloud providers, especially across cloud providers. And the number down here that 76% of security professionals are not confident that ex-employees no longer have access to infrastructure, I think a core thing there is really — if I could see you all raise your hand, which I can't, but I'd sort of do it. Raise your hand if you can tell me right now who's got access to what. Right? And that is a really hard question to answer a lot of the time.

RBAC mistakes lead to breaches

Dave: So what we end up with are clusters that are misconfigured, have more exposure than they necessarily need. Some need to be accessible via the public internet but maybe just the services and not the actual control plane API. And you've got secrets that end up getting leaked, whether that is on employee hardware or in a code repository, whatever. You've got these long-lived secrets that end up being exposed. And at Teleport, what we're trying to solve across the board is this idea of long-lived secrets and getting rid of those. So you have role-based access controls that can lead to breaches. Like I was saying earlier, securing a cluster, we say it's manageable. Even securing a single cluster, it's kind of a pain — and I'm going to get into that a little bit later — but especially when you're running multiple clusters across multiple clouds. When you're in one cloud, I think you can even rely on the RBAC controls within that cloud provider. When I was doing that migration at the startup, we were completely within Google Cloud, and so we just used Google Cloud's RBAC system. Great. But if you've got clusters across multiple ones, now you need to use different file formats in different security systems across those clouds, try to make sure that all the permissions are lining up across them. It's not fun.

Secrets that are not so secret, and security vs. ease of use

Dave: And if you have a single misconfiguration, it can lead to a breach, especially if the things inside of the cluster are not using the correct security policies and pods are overly privileged. You end up with secrets that are not so secret. Things get shared. And again, for us, it's really we don't want you to have to manage secrets. Right? We don't want to have these long-lived things, these API keys, these certificates stored. It's about shrinking the attack surface by getting rid of those entirely. Things can be really hard to use. Right? There's always tension between security and productivity. You can lock things down completely and make it impossible for even the people who need to work on them to work on them. And so for us at Teleport — again, also about finding the balance of security versus ease-of-use versus operability versus letting people use the tools that they need to use to do their job. So when we were configuring our own RBAC solutions, it's about making them purpose-built. It's about making them easy to configure for the teams that need to do that, whether that's the DevOps team or the security team, whatever, and allowing the end users to use tools that let them access things using those rules in an easy way without having to send them down a different path, a different tooling solution, that kind of thing.

Dave: And I'm going to show this in a very specific instance later here, but creating users and groups in Kubernetes is not a great experience. Kubernetes does not actually really natively have the concept of users and groups. It really just has the concept of a certificate. You send an X.509 certificate up to the cluster that is associated with some kind of user group, but then you have to get that certificate to people's devices for them to use. And tracking what certificate is assigned with which user group is not very fun. And so in the end, a lot of places end up just kind of using default configurations. Especially if you're not able to use the native RBAC solution in your given cloud, it becomes a huge pain to distribute certificates. If you can just sort of do AWS login, all right, that might work for your EKS cluster, but especially if you're multicloud, it becomes a really big pain. And that's another thing that we're trying to solve in terms of just having a centralized location to manage this stuff.

Teleport: reducing friction between engineering and security

Dave: So take a second. And for those of you who are not familiar with Teleport as a platform, kind of go over just briefly what we're about, and then take it more concretely into what it looks like for Kubernetes. What we're trying to do is make your infrastructure easier to access for the right people and the right machines while making it more secure. We're trying to really cut that balance, right, of security versus productivity. So for your workforce, we're trying to increase productivity. We're trying to close gaps. In onboarding and offboarding, make it easier to get people on and make it easier to make sure that people who have exited the organization no longer have access. Make it really easy to find who has access to what, and simplify that access by making the tooling simple for the developers and secure for the security team, eliminating long-lived secrets and bringing it back to short-lived certificates, cryptographic identity for people. On that security, again, removing credential-based attack surfaces by keeping it to short-lived certificate access, eliminating standing privileges. We have a lot of other webinars and information about how we provide access requests.

Dave: It's one of my favorite things about our platform because I am the person who accidentally deleted prod one time when I thought I was working in dev. And I had standing access to prod. It wasn't a breach. It was just me making a mistake. And I really would have loved to not have that access and be able to quickly say, "Hey, can I have access to prod for an hour to do this thing?" Giving auditable data that you can use either in our system or in an outside system, and being able to instantly respond to breaches through locking people out, locking systems out, and knowing when that's going on. And then on the compliance side, providing tools that make it really easy to prove compliance and, again, kind of going to the blue team/red team thing, not just checking things off but really knowing that you have truly protected systems. But when you're in those audits, making sure that you can quickly answer audit questions by just pulling up the data that you need.

Teleport Access Platform

Dave: And this is kind of — this is the big picture. Right? Of all the things we secure, we kind of have three core pillars to the platform. We've got identity, which is who's getting in, whether that's people or machines; policy, making sure that the correct things are protected, are accessible by the right people, and people can't change what they shouldn't be able to change; and then just the access component of—I can very easily access the things that I should be able to access. And we can do that across many, many platforms, right, all the major clouds, major databases. Machine ID allows us to do this from continuous integration and continuous deployment tools and out to Kubernetes clusters, regular SSH servers, etc. To address the problems we talked about earlier, here are the concrete things that Teleport really provides for that. So first is a single entry point for all your clusters. You can go to your Teleport cluster, right, and have every Kubernetes cluster that you have enrolled in the Teleport cluster, and everything flows through Teleport for access. So you're no longer having to manage multiple entry points, whether it be multiple CLI tools for the clouds or whatever. You can just go through this.

Unified and scalable secure access controls

Dave: The architecture here is that you have an auth server for Teleport which provides the certificate signing in that, but you also have these proxy servers that proxy the traffic. So when you're communicating with the control plane API, that's going through a proxy, and we can manage who has access to what within a cluster. If you have Auto-Discovery, we have a Discovery Service you can set up to automatically enroll clusters as you create them. You're able to lock down endpoints. And this is a really key thing for me, right, is that when you're engaging with that control plane API, like I said earlier, you might want the services to be public, but you don't want that control plane API to be public and exposed and reachable. So the Teleport Proxy system works through a reverse tunnel. So once you install Teleport into a cluster, you can completely close off that control plane API from public access, everything flows through Teleport, and you really reduce your attack surface through that. And then again, cryptographic identity, that's way easier to manage because, technically, Kubernetes uses cryptographic identity as well. You have certificates, but we make it way easier to manage and distribute those certificates out to devices. With all that, then that's the security side. Let's talk about the productivity side.

Integrated with DevOps workflows and streamlined least privileged access

Dave: It's really still easy to access things then. You log in with Teleport, and from there, you just use kubectl. You have access to the correct clusters. Within those clusters, you have access to the correct resources and namespaces, and you don't have to — developers don't have to change their tooling, which we'll see a little bit in a demo. We can provide automatic discovery of Kubernetes apps. So if you have internal applications that are running in a cluster, you don't even have to expose those services publicly or even into a VPN. You can get rid of the VPN and go through the Teleport Proxy to talk to those applications. And because you can use SSO to log into Teleport, we have that set up. Whether it's as simple as GitHub or as complex as a very complex Active Directory or Okta integration, you can bring over all your rules, all your groups, all that sort of thing. And we can provide mTLS connections for machines like those CI systems. It really streamlines having, again, least privileged access. Right? It's not just about the right access to the right clusters. It's the right access to the right pods, the right namespaces. And as I was mentioning before, I really wish I'd had this when I was doing this work. We have just-in-time access requests, so you can very easily say, "Oh, I need to debug something in this namespace in prod. I just need to get access for an hour. I think that's all I need," right, and get that approved very quickly by a team member, come back in, do that work, but know that you're never going to accidentally switch contexts in kubectl over to the prod cluster and accidentally delete something.

Demo: scalable secure access for Kubernetes

Dave: We have per-session MFA. So even when you try to access a given pod, we can make sure that you need a hardware key or some other WebAuthn token. Within our Access Graph piece, you can find access patterns. And within our access auditing, you can find access patterns, pods that are overprivileged, people that have too many privileges within a cluster. And again, a key thing, I think, for us versus other solutions here, is namespace separation. You can really make sure that people only have access to specific namespaces within a cluster, not an entire cluster. So even within dev, people might have full permissions in a dev environment but only for the things that they need to work on. So let's look at it in practice. I have decided here not to try to appease the demo gods, but rather we've got a prerecorded thing that I've got so that it's definitely going to work.

Dave: So let's look at it first from a user perspective. This is a developer user I've got in a Teleport cluster I made just, again, for kind of showing this. All they can see is a Kubernetes cluster that is in the dev environment. I have more clusters attached here, but they don't need to see production on a regular basis. So from here, the developer can work with whatever resources they have access to, click Connect, and we're going to give them instructions for how to access it in a really simple way. I can copy that command, come over into my terminal, paste it, enter my password. I'm going to go through MFA here. And at this point, I'm connected into the cluster. I now have a short-lived cryptographic identity on my device. You can see that it's valid for eight hours only. So it's my workday. I come in. I log in. When I leave at the end of the day, that is no longer valid. If I lost my device overnight, no problem. Once I'm logged in, I can come back over and copy this `kube login` command, paste that. I now have a short-lived certificate representing my identity in my kubeconfig. So now I can just use kubectl. I don't have to do anything special. At this point, I'm back to my regular toolchain, and I haven't had to change any of my practices. So I can run version, make sure that I can talk to the API. And then I'm going to try to get the pods in a namespace that I have permission to talk to. So I'm going to look at `workload-id-demo`. This is a demo environment I'm running for something else. And I can see pods there. I could work with them. I can edit them. We'll see how that all works in a moment.

Dave: Now I'm going to try to get something in a namespace that I shouldn't have access to, kube-system. Really shouldn't be able to access that as a developer. And we can see that I don't have permission to list resources there, pods in particular, but really anything. Now, if I think I should, it gives me instructions on specifically what I should ask for access to, and I can run some commands to check if I have those permissions. But this is mediated through the Teleport Proxy that I'm not able to do that. So I can exec in. I can run commands. Maybe I'm trying to debug something in my dev environment. I can exec into this pod. And I can run some commands, check, "Is the code that's in there the code that I think it should be?" right, kind of do those things. And we're going to see the import of this in a moment when we look at this from the security side. And I'm just going to check some stuff out.

Dave: Now, on the management side, I'm now in a window where I'm running as an admin sort of setup user. And we can see that the user admin here now has “access” and “audit” and “edit” privileges. These are kind of the standard things you have as an admin in Teleport. But the user I was before just has this Kubernetes dev role. And so again, role-based access control, familiar to everything but centralized within Teleport, where this role would be true whether I was trying to access an EKS cluster or a GKE cluster, any kind of thing there. So let's look at the role itself. Within `allow`, the thing I really want to point out here — and this kind of comes back to the “it's a huge pain to make users and groups in Kubernetes” — you'll notice that the Kubernetes group here is cluster admin. Now, Teleport has to map your user to some user or group within a Kubernetes cluster, and we've mapped this role to cluster admin. Now, I'm giving an extreme example here just for illustration, but I think it works because it's an extreme example. And you just saw a few moments ago that even though I mapped to the cluster admin group, I wasn't able to list pods in the kube-system namespace. And that's because what we kind of found is that people have a horrendous time correctly configuring Kubernetes RBAC, internal RBAC roles, with setting up cluster roles, setting up namespace roles, mapping users and groups to those roles.

Dave: And so what we've built is the ability to do the RBAC in Teleport. And I can map myself to cluster admin, but then I can say, "Only show this user Kubernetes clusters that have the label environment, development, and dev, and only allow them to access Kubernetes resources in the namespaces `workload-id-demo` and `tbot- attestation`." Again, things from my cluster, but you can extrapolate out. Right? And here, I've given this user global privileges within those namespaces, but we can tighten this down, too, to say, "They can only look at deployments and pods or services. They can't touch config maps or secrets, etc." Right? Maybe I want my secrets within a cluster to only be managed by my continuous deployment system, and I want them to be able to view those secrets but not edit. All of that is configurable here within this Kubernetes resource block.

Dave: So the next piece here, as an admin or as someone on a security team, is that we have a built-in audit log, and once someone has gone and done things, I can go check this audit log. The audit log gives me information about every event that has occurred through the Teleport Proxy. So I can see session data. I can see users connecting and disconnecting, certificates being issued. That certificate being issued is when I just logged in to get in as this user. I can see commands being executed on a cluster. If I open these up, I can get all kinds of detailed information about it: what cluster, what user, what login were they using, what namespace, what cloud was it. And this can all be exported into Sumo Logic or Splunk or whatever larger SIEM you have to go through these and process alerts and stuff. For certificates, similarly, I can see when does it expire, when was it issued, what privileges does it allow, like logins and Kubernetes groups that are mapped, that sort of thing. And so you really have fine-grained reporting over what is happening in your system for not just Kubernetes but SSH, for databases, etc.

Dave: The other really cool thing we have is we have session recording. So when I exec’d into that pod earlier, maybe that's something that you expect to happen, but maybe it's not. And maybe you're thinking, "Wait a second. Why is someone exec-ing into this pod to check things out?" And you want to see what they did. Right? There's a potential that there's a breach going on here. And so we can actually come in and replay the session that occurred in that pod exactly as it happened, right, and see just, "Okay. He was just coming in and debugging code. Great. We're going to let that happen." But maybe I view this, and it's not just that, but the person is LS-ing and checking things out and popping around the file system and trying to execute commands that are not commands they should be executing. And at that point, I might want to go in and take some action.

Dave: So we also have the idea of locks. And I see that the person is taking action I don't want them to be taking. I can come in and add them as a target for a lock. And I can set time on this. I can put in a message just saying, "Hey, we're investigating a potential breach. There's some suspicious activity here. I locked this person." And then I might Slack them and be like, "Hey, I just want to talk to you real quick about what was going on. Were you the person who was exec-ing into that pod?" If so, then great, then maybe we remove the lock right away. Otherwise, we can set it to expire at a certain time, and we can instantly protect our environment from any further intrusion if someone has gotten in. You see it's locked. And now back in my terminal, I try to run anything and access is immediately denied because, again, things are traveling through that proxy, and we can manage all of that communication. There's a bunch of other really cool features within the platform, but those were some I just want to highlight around Kubernetes specifically.

Client testimonials and Q&A

Dave: So a couple quotes just blowing up how great we are. You can learn more. We've got some great case studies that you can check out on the website and just see more specifically how real companies are implementing this and getting advantages from it. And the thing I'm going to say right now so that I don't get reminded later is Lexi, who's my wonderful backstage producer, will be putting up a survey either now or towards the end of this, and I would really appreciate it if you do the survey. And now I'm going to just check the Q&A and see what's going on. So from Mario, "We recently changed the licensing for the open-source version. Are the capabilities and functionality discussed in this presentation part of the open-source Teleport software?" So I will say that some of them are. There are portions, like Access Requests, that are in the Enterprise-only version. But the core functionality of role-based access control to Kubernetes is absolutely included in the Community Edition, like the audit log is there. You can check the website for more details on which pieces are Enterprise versus Community. But protecting your infrastructure is in Community. A lot of the things that are made more for checking compliance and for larger teams that don't just inherently trust each other internally are in the Enterprise Edition.

Dave: "Can you walk us through deploying to Kubernetes from a CI/CD tool, e.g., GitHub Actions?" Yes. I'm not going to be able to show it necessarily here with a demo, but I'd love to show you maybe in another context. Reach out in the Community Slack to me. So Teleport has this feature called Machine ID, and basically, there are a couple of different binaries that are part of Teleport. One of them is tbot. And tbot is what runs Machine ID and Workload ID. And the idea is that tbot can authenticate into the Teleport cluster using metadata about where it's running. So that could be an AWS EC2 instance using metadata about that. From GitHub Actions, you can create a join token for Teleport that is authenticated to a specific GitHub or a specific GitHub repo. So only if an action reaches out that is associated with this repo, may it do these things. And the permissions for those bots are fully controlled through the same RBAC system as people. So you could say, "This bot is only allowed to take action for this system or this cluster, this SSH server, etc." What you then are — that bot will then drop certificates into place. So from a CI system, if you are running Helm, if you are running Ansible, anything like that, then you basically configure that tool to use those certificates, and you're able to execute commands and take appropriate action on infrastructure from that machine, whether it's GitHub Actions or a server or whatever it is. I hope that answered it. Please let me know if it didn't.

Dave: "A lot of external tools need access to the cluster, like Snyk, for scanning purposes. Does your RBAC support that?" Yeah. So I think that kind of comes to — it depends on how you run it. So I think if the scan is being run on your hardware, then you can use the mechanism that I just mentioned. You can run Snyk from a CI system, run the binary, have that system authenticated via tbot. If it's running automatically on Snyk's servers, then that's something I'd love to talk to you about in more detail. We're always looking for potential integrations to create and wanting to know users' use cases. I'd say, again, reach out in the Slack and we can talk more about that.

Dave: "Some users have reported issues with session expiration when running long commands and scripts in Teleport. This can disrupt workflows and lead to unexpected interruptions. Could you help us resolve this issue?" So I'd say if you are a customer, I'll talk briefly about what could mitigate this, but I think for resolving the issue, reach out either to support or if you're a community user, just reach out in the Slack, and we'll try to get your question answered. I think generally here, session expiration is set within Teleport by admin saying, "The max time that something is allowed to have a session for is X hours." And you can set defaults. When you log in or authenticate in some way, if it's a bot, you can request a shorter time than that max, but the max is set. So I think just generally, if you're having issues with expiration, you just need to visit what the timing is set for, like this user, this bot, whatever — this role can only have the max duration of this. And again, there's that trade-off between security and productivity, where the longer you set it for, theoretically, the larger your attack surface is. But if things run for four hours, that probably just needs a four-hour or a five-hour default maybe or a max that you can set.

Conclusion

Dave: I'll give it a couple minutes just to see if any other questions roll in. But if not, then — and if anyone's got a follow-up from the answer that I gave to their question, I'm happy to see to that too. But otherwise, thank you for attending. Thanks for watching. It's always appreciated when people come and hear what we've got to say, so. And I hope you learned something. And again, I'm in the Community Slack, Dave Sudia. You're welcome to reach out if you've got any questions beyond this, and I'd love to hear from you.

Join The Teleport Community

Background image

Try Teleport today

In the cloud, self-hosted, or open source

Get Started View developer docs