SRE-Powered Dev Productivity - Overview
Key topics on SRE-Powered Dev Productivity
- The DevOps movement is about how we can all work together to build a better pipeline for building, running, and shipping software and give our customers a better experience.
- In the financial world, security is probably number one, followed by latency, and then resiliency/reliability.
- It’s important to determine what aspects one gets by default from their cloud platform and what the blind spots are.
- Security in the context of Kubernetes has become an important issue.
- Teleport is a Certificate Authority and an Access Platform for your infrastructure.
- By working closely with developers, SRE teams can make sure that developers are getting what they expect.
- By giving service ownership, service empowerment, and confidence to developers, SRE teams can enable them to manage their own applications.
Expanding your knowledge on SRE-Powered Dev Productivity
- Kubernetes API Access Security Hardening
- Teleport Kubernetes Access
- Teleport Application Access
- Teleport Getting Started
- Teleport Access Platform
Transcript
Ben: 00:00:00.415 Welcome to Access Control, a podcast providing practical security advice for startups. Advice from people who’ve been there. Each episode, we’ll interview a leader in their field and learn best practices and practical tips for securing your org. For today’s episode, I’ll be talking to Mario Loria. Mario is a senior SRE at Carta and has been leading their move to Kubernetes and other cloud native technologies. Carta helps companies and investors manage their cap tables, valuations, investments, and equity plans. As a user of Carta, I’m happy that their security is top-notch. Today, we’ll be chatting about orchestrating Kubernetes, training team on cloud native technologies, and optimizing the developer experience. Hi Mario. Thanks for joining us today.
Mario: 00:00:39.715 Hey, man. Thanks so much. I’m really happy to be here. I love everything that Teleport is working on and doing. I’m so glad you’re a customer of Carta. We’re working to make sure that your shares are very well secured as you grow as a company. I’m so happy to be talking about cloud native and security and the developer experience. So many great topics we’re going to touch on today. So let’s get started.
The DevOps movement’s appeal
Ben: 00:00:58.699 Yeah. So to sort of kick things off, can you tell me a little bit about what drew you to the DevOps movement?
Mario: 00:01:03.483 Yeah, for sure. I think it was definitely very organic. I think starting in my career as more of a system administrator back in college and having such a strong interest in systems and how we run — and at that time; it was very much focused on the lower-level infrastructure components. And I think you saw — I remember learning virtual machines very early and then the kind of change over to AWS and the cloud taking over of how workloads are run. I think in that regard, I very much was interested in containers at a very early point. I actually remember when Docker’s first webpage with the video from Solomon Hykes came up and just being in college and seeing that and being so fascinated by this whole thing. And instead of doubling down on the things that I was taught that were told to me that were in a book, I decided to start exploring more and more.
Mario: 00:01:54.458 And I’m so glad I did because I learned containerization. I learned maybe the point of developers and how the software development, maybe life cycle works. And getting to that kind of nirvana of shipping things quickly and all, kind of, the DevOps methodologies that we have come to know and love now. Those were very much something I had to learn, and I enjoyed that a lot more than just managing systems day-to-day. Nothing wrong with doing systems work. And I learned a lot there, but I think the DevOps movement — and I’ve actually helped out with the DevOps Days in my locale here, Detroit, Michigan, in the US. And I think the culture — it’s a different sort of mentality. Instead of, “I just need to run my systems to do my business.” It’s more of a “How can we actually all work together to build a better pipeline for building, running, and shipping software and give our customers a better experience?” I think that’s what turned me on to it. And I’m so glad to have followed more of a route into engineering as I’m doing now, versus just maybe IT or whatever, your traditional running systems and managing many computers. I think my love of the cloud and all things cloud native and really running resilient infrastructure is what — with Kubernetes and orchestrating, I wouldn’t have it any other way.
How Carta is run and where Kubernetes fits in
Ben: 00:03:16.212 Actually, I was looking on your LinkedIn profile, and it says your focus is around distributed service-orientated architecture but on bare metal and cloud platforms. Does that mean that you’re currently running this kind of run Kubernetes the hard way, or do you use cloud providers?
Mario: 00:03:31.781 Throughout my career, I’ve done a bare metal. Of course, I remember racking servers in my basement when I was doing startup work and running Tectonic, which at that time was from CoreOS and helped you provision Kubernetes clusters. This is very early days. I forget, 1.5, 1.8, somewhere around there, of Kubernetes. And I think having that experience gives you foundational knowledge and you get this understanding of what doing it the hard way looks like, and you build on top of that with further skills. And then you understand when you go to the cloud world, if you look at EKS, you don’t actually have access to those control plane nodes. But you know what they’re doing. You know what they can do. You know if you need to work with AWS. You know kind of what’s going on there. So I think it’s incredibly important to have the foundational pieces and build on top of that.
Mario: 00:04:17.716 I think that’s what doing bare metal earlier on very much gave me and working with containers earlier on very much gave me that sense of, “This is what a life cycle looks like. This is what some of the nuanced pieces of configuration looks like. This is how Docker runs from a runtime perspective and configuration options.” Things like that. There’s a lot that you don’t consider. And I think that there’s so much when you come into this world and that you’re just — you go look at the Kubernetes docs and there’s just so much. Like what’s a PSP or a PDB or a HPA? You’ve got all these pieces. And I think if you just look at all of that you try to learn it there, that’s going to be one of the hardest things you can do. So I think it’s really important to start with those fundamental pieces. In terms of Carta, I won’t get too much throughout this talk into what we exactly do in terms of systems we use generally, but we actually are relatively cloud. As we discussed here — is how do we keep security top of mind as one of our number one pieces? How do we keep that stature in a cloud world? So, yeah. We’re pretty much, I think, for the most part, AWS. There’s a few nuanced pieces of infrastructure that do some different things. And, of course, the traditional monolith and some microservices and some not so micro and things like that, so.
Why Carta uses Kubernetes
Ben: 00:05:30.461 I know you said you can’t get into too much detail, but I sort of wouldn’t expect Carta to have too much throughput of processing. There’s kind of funding announcements and some papers go in. Why use Kubernetes? Why not use another technology?
Mario: 00:05:43.911 Yeah, that’s a great question. I remember when I joined Carta, I actually didn’t have very much background in running a financial platform before and worrying about that world of financial services. I very much come from a scale world where we have so many millions of users every second, and we might have certain push notifications that — this is e-commerce. We send out a push notification to millions of our customers. There’s a new drop or there’s something else going on. And in the scope of five minutes, you’ve got millions of people making a request. And so —
Ben: 00:06:16.995 Yeah. Yeah. So it's kind of DDoS'd for —
Mario: 00:06:18.774 Exactly. It’s your own homegrown DDoS. It’s your great QA. I think that’s the world I come from. So there’s a very different set of concerns. And in that world, it’s scale. We were doing a lot with auto-scaling, both of the cluster and of the individual workflows. There’s a lot of throwing compute at the problem. In the short term, at least. There’s a lot of optimizations. You’re always making optimizations and looking for the next set of things you can do. When you pit it over to the financial world, I like to think of it as — security is probably number one, latency, and then probably resiliency/reliability, of course. I think that’s kind of fundamental to what an SRE is always thinking about and always trying to optimize for. But I think the other side of this is there’s actually periods where, it’s not that we don’t care about production necessarily, but we are willing to make certain trade-offs if it impacts production for some very time-limited piece. We’re okay doing that as long as it’s managed well.
Mario: 00:07:17.598 And I think if you look at a production cluster in an eCommerce company that’s scaling massively and getting so many requests, you don’t want anything to happen. You’re not scheduling maintenance any time remote during the actual day. You don’t want to be doing that because everything is about those customers and them coming in and those services being available. And you change anything, and there’s the possibility that you’re going to have an issue. Whereas our platforms, they’re always on, but there isn’t — I think this is more on the CartaX side of where I’m at, which we’re building kind of a private equity world, and we don’t have a kind of a constant user need at every given point.
Mario: 00:07:59.068 So I think when you log in to look at your shares, I’m not talking about the general Carta platform. I’m talking about more of some of the projects and other initiatives that we’re working on, as we build new financial services and new product offerings. We have a little bit more flexibility there with that. We can focus on security, latency, resiliency. And really, I don’t want to say anti-fragility necessarily, but with resiliency is not just staying up, but it’s being able to also understand what’s going on with the system. And we can get into this as well, like observability and what are the best tenants there and the best dashboards and the best visibility aspects for us as a team?
Top security concerns
Ben: 00:08:37.817 Since this podcast is focusing on security, what’s sort of your top concerns regarding security?
Mario: 00:08:44.275 Yeah, of course. There’s always your standard parameters security. And of course, we’re doing everything we can there. It’s something that most of us have a decent understanding of. I actually come from the DDoS world. I worked at Arbor Networks. It was a DDoS mitigation company. And I think that’s helped me understand the network layer really, really well, and what it looks like from an ingress, from a WAN perspective of the wider internet traffic. So we’re doing your general hardening in that world. I think one of the big things that we’re starting to think about is security in the context of the systems that we use. And primarily, of course, that’s Kubernetes. And so we’re starting to think of, “What are the pieces that we need to worry about as a team where we actually need to expend our effort to build or install or manage security solutions for our Kubernetes environments? What are the things that we get by default from our cloud platform? And what are the other pieces? What are our blind spots?”
Adopting Teleport Access Platform
Mario: 00:09:40.513 Being cognizant of the areas that you’re not very well off in understanding and bringing in the talent to handle those for you. Help you one up in those certain areas is huge. And so we’re trying to be cognizant of that in all of those different planes and make sure that we’re very well-rounded. And I think this is kind of coming into it Teleport and what Teleport is offering. I have been solving for access almost my entire career. And I actually worked doing a start-up for a short time building kind of a cloud-based network VPN solution. And this is when BeyondCorp and some of the other solutions from Google and papers, etc., were coming out and talking about zero trust and inching into what it looked like to at least privilege and other concepts that we were building on top of. And I think Teleport is doing one of the best jobs I’ve seen in being able to offer the flexibility and the feature set all under this zero trust, a completely unified, SSO-enabled pattern, if you will.
Mario: 00:10:46.414 And when we talk about scaling — I mean, we’re hiring left and right. Please go to our jobs page. We’re looking for every type of role — we look at that. We need to be able to onboard people relatively quickly and have some assurance that they’re going to be able to access only the systems we intend them to access and they’re going to be able to do it securely in terms of transport wherever they’re at. Whether they’re working at home or they’re working abroad. That is something that we need to get inherently right. And we’d have our own team on all of that if we wanted to. Not just InfoSec, but an access team. That could be a whole thing. And I think with Teleport, we don’t need that. And of course, there are other solutions that we’ve looked at and played around with other solutions for a myriad — application access or database access, Kubernetes access. But I think Teleport makes it relatively so easy. And also, being in finance, some of the FINRA or SEC and other requirements that we have. Auditability, right? The repeatability of playing back sessions and understanding exactly what was happening at any given moment is incredibly critical.
Ben: 00:11:52.185 Teleport itself, we help customers access their infrastructure. In many ways, this is sort of an anti-pattern in sort of modern DevOps. People getting routed to machines or even using kubectl locally sort of like an anti-pattern. What happens when your engineers get access to these machines and what do you let them do on these hosts?
Integrating security into developer experience
Mario: 00:12:11.829 Yeah. Great question. You know what? I talk about least privilege. But in reality, least privilege is actually incredibly difficult to not just do at the point of setup, but actually maintain. A lot of the times we’re in crushed periods. We’re in periods where we’ve got so much going on, and we don’t want to block developers. You never want to block a developer. Sometimes developers come to us and say, “I just need access to this. Get the access.” And instead of going and going in AWS — this is just an example. Instead of going in AWS and just saying, “Okay. You have to re-write full access to S3.” There aren’t policies really in place that are perfect around read-only for this bucket and can’t lists or anything like that. You have to build that. And I think getting really good at templating out some of the things that we infer that we’ll need, making sure that we are logging what we’re doing — and then if there is room to make that better in the future — making sure we log that, so that when we do get the cycles to come back and make things more honed down or give people better levels of access, or instead of dealing with AWS, we abstract that away and give them a layer to work on.
Ben: 00:13:20.738 Do you find a pattern and then you build tooling? Sort of facilitate that pattern in a [crosstalk]?
Mario: 00:13:25.048 Exactly. 100%. I think this is what we’re talking about. From an SRE standpoint, instead of SRE being like a support team or admin team that just does tickets and gives people access to things in a general manner, we actually want to write software and build systems that can automate this process. Because we actually don’t want to be in the — going back to blocking developers. We don’t want to be in that critical path. We want a lot of that, which can be automated with — a lot of tools like Okta and, of course, Teleport enables us to automate that chain. And with that, we can — we need to be more software-centric. We need to write code to solve these problems. We need just people to stop solving these problems. And in my head, I go to — because I’ve been doing this for as long as I have and I remember what it was like back in 2007 and in 2010 and 2013, in my head, I’m like, “Well, it’s 2021. How are we still doing this the manual, old-fashioned way?” We should be making a lot more progress. And that’s very difficult in every sort of organization based on your priorities and needs. That’s the way we look at it. And maybe it’s a unicorn and rainbow sort of scenario that we’re trying to get to, but that’s what we strive for, so.
Integrating observability with security
Ben: 00:14:32.661 And you sort of touched a bit on another focus is observability. How do you sort of integrate observability with security?
Mario: 00:14:40.477 Yeah. That’s a fantastic question. We’re actually, of course, still looking for the best way to solve that. I think that’s never really 100%. One of the things that we think about is that, as an SRE team, let’s say, we need to be able to have that visibility. So let’s optimize for getting us that visibility. Whether it’s one or more dashboards, whether it’s looking at streams. And of course, we’re enabling things like audit logs for Kubernetes. That’s, of course, something we’re doing. The next question is: where do those go? Are they secure? How long are we keeping them for? Who has access to see them? There’s all of these questions around that. At Carta, most of those questions have been answered, for the most part. At CartaX, we’re also working to figure out what the best strategy is. But beyond the SRE team, we actually try to give this idea of service ownership, service empowerment, and confidence to developers so that they can manage their own applications.
Mario: 00:15:33.216 In that realm, we actually want to figure out ways to bubble up this information as it pertains to them and their services to them directly. We don’t want to give them any more than they need to, but we want to make sure that they have everything that they need so that they can make those decisions. And a good example is I’ve been in many organizations where cost is a concern at some point. Some quarter that year cost is a concern. The AWS bill becomes something that you start honing in on. And what I’ve seen is leaders that will go and they’ll do a review, and then they’ll start pinging people individually on Slack. That doesn’t scale, and no one likes that. That’s bad. If you think about a developer and a company, they don’t want to just spend all the company’s money, doing whatever they’re trying to do, playing around with their serverless.com framework, and spawning new infrastructure. They actually want to be cognizant of these. The problem is that they don’t have the information, the data, in front of them. So if in a dashboard for a particular service a developer owns, if there’s a graph showing them their cost usage and the change in the past seven days and they see something crazy —
Ben: 00:16:33.067 They’d be like, “Oh, I left this instance running.”
Mario: 00:16:34.701 Exactly. Exactly. Yeah. They will be like, “Oh, wait a second. That’s not quite right.” And they will fix it. It’s saying, “We don’t actually need to be stripping out and giving them very, very tiny — like, “Oh, you can only look at these couple of metrics.” We actually want to give them everything we can in the context of what they are trying to do and the services that they own. And I think that actually is going to help them, it’s going to help us, and it’s going to help the overall organization over time.
Ben: 00:17:03.822 I mean, this is another part of the developer experience which crosses over to sort of DevOps — is the packaging containerization. Where do pod policies or Dockers — do you run code scanners on containers and sort of whose responsibility is that? Is that the developers, or is that the SRE teams?
Mario: 00:17:22.704 Yeah. That’s a fantastic question. We’re actually right in the midst of trying to figure that out. I am on the side that we should be, again, building the tools and leveraging things like even ambassadors, developer control planes that they’ve just started releasing. There’s Spotify’s backstage offering, which is really, really cool. Upbound has Crossplane, which is kind of a way to provision things a lot easier in kind of a Kubernetes native way with controllers. These are the sorts of tools that myself and the team are looking at and saying, “Well, this removes us entirely from being in that critical path.” At the end of the day, each repo has a Helm config in it. And I think this is many organizations. There’s configuration around how to deploy the service. And there’s going to be further configuration on canary or feature flags or other pieces that are nuanced that the SRE team has brought into the mold and set up a system for. And then maybe you set up a template or some docs or how you do that. Do we do we tell developers, “Hey, it’s your repo. It’s your service. You are responsible for that code. That is code just like anything else in your repo. You’re responsible for it.” In which case, they have to learn helm. They have to learn why they should enable a pod security policy and everything in between. Is that fair?
Mario: 00:18:39.398 What we actually do in reality is we work very closely with the developers to ascertain what are they trying to achieve. And because so many of the services are very similar, the SRE team has, for the most part, taken it upon themselves to copy in and change that configuration as needed and get it to a point where it’s executable and things are working as expected for that service. Many of our services look pretty similar. Not exactly the same. Every service has different needs, though. And I think the SRE team work with developers, even workshop things, and program to not just one-up them, but to make sure that they are getting what they expect. So a lot of developers will be hands-down programming and coding something, and they don’t really know what the life cycle of their application looks like in a Kubernetes world. And that might be for a variety of things. Not because they don’t want to or because they’re not actively watching pods. As they deploy, they’re not looking at the health checks. They don’t really know the difference between a readiness and a liveness probe. Nor should they really need to. But that can cause surprises later on.
Mario: 00:19:46.667 So I think, right now, what we try to do is educate. I just did a session on educating in Kubernetes. Just a one-on-one. Very introductory. I had lots of questions to help people one-up their knowledge because everyone’s at a different point. So education is one. I think two is very much working hand in hand with them. And that’s actually better for us as a team because we learn their needs and what the services that make up our platform are doing. And so we learn those fundamentals, the systems that they interact with the most, their expectations for what their services should be able to do, and how the overall platform should work. And actually, as an SRE team, this is not just you build systems, build Kubernetes clusters, and then you’re hands off. You need to actually know: “What are we trying to solve here? What is the customer experience like here? What is the platform we’re trying to build?” And I think that gives us a great insight into that.
Managing the data layer and databases
Ben: 00:20:38.121 Talking about running applications, what about sort of the data layer and databases? How do you manage those?
Mario: 00:20:43.276 Yeah. I hate this topic, so thank you for asking about it. [laughter] Databases are so much fun. I think like most companies, we’re generally using the cloud offerings in terms of databases. I think the big thing when I think about databases is, how do we handle database demo data or example QA data in those environments? So we obviously have to be very careful with production data. We actually don’t really want to use production data if we don’t have to. The next thing is — how do we have dumps of data that we can enable people to use? We can make it useful for what they’re trying to achieve. We have developers that will make some changes in their code, and then they actually need to throw in a few different rows or add some columns or add some tables to test some of their changes. So how do they do that? Not just on their laptop, but then in an environment where there’s many other services, and the platform is deployed there and they can test those changes? What is the instrumentation that we use to make that happen?
Mario: 00:21:46.385 The big thing is I know we have an internal working group. The SRE team is very much involved in that. Because while we aren’t the ones using the databases, we want to be very, very in the kind of critical path of how is our data being used and what can we do to move data around and handle data just like a container or just like any other resource? And so when you talk about demo data, I think about it as containers. It’s just a ball of data. We can move that data around. It’s portable. We can load it on things. Maybe we use it in a container to help load that for you automatically on your laptop. You can start at a base layer of data to use with your application or a set of applications that you’re testing. We think about it in the context of — in financial services, this is something I’ve come to realize in a very short period of time, many organizations — actually, every organization — I’ve been a part of, there’s a concept of a devastating production. And there might be some nuance there, two-way, or a data cluster, whatever it might be, but that is the general pattern. And I think when you look at that, most environments, the dev, and the staging are very, very different. And there’s nothing quite like prod. And that’s exactly the case here as well because you’re never going to get anything exactly like prod. It’s a fallacy.
How to integrate security into DX without becoming a blocker
Mario: 00:23:03.256 So I think the next best thing is to zoom into the developer experience and how we actually make those environments as useful as possible as early on in the process. And so we’re already looking into and building our strategy for local development, remote development, hybrid development, virtual clusters, and ephemeral environments. Part of kind of the pipeline. The CI/CD pipeline. When someone pushes a PR, can they create an environment that many people have access to? And of course, security. There’s those components. But it’s more about giving the autonomy and the trust to our developers while maintaining those strongly held guard rails. But not in a negative, condescending, “You can’t do this because we’re scared, you’re going to screw it up.” But in a, “This is a regulatory need.” Again, we’re all working toward the same goal, and remembering that and working with people to solve for all of those goals is really important. We have many working groups on that premise that we just want to have these wider discussions. Learn what people’s pain points are and help them understand what we’re facing as well. And I think when we have those open discussions and provide that transparency team to team, leader to team, leader to leader, whatever it might be, it just has helped us so much.
Ben: 00:24:20.998 I was kind of interested about your remote V local development. And I know GitHub came out last week saying that they’re moving the majority of GitHub developers to the cloud and much larger machines. What’s your sort of thought process internally about local V remote development?
Mario: 00:24:35.738 Yeah. Actually, I remember my leader was sharing that in our Slack. There wasn’t much discussion on it. I think that’s not something that surprises me. I think part of that’s a little bit of the marketing side as well. I actually do think eating your own dog food as a company is incredibly important. So I love GitHub for doing that. And I think Codespaces is actually a really powerful platform. I think being able to extract away the repetitive pieces of setting up the development environments across different sorts of clients — I’ve got a MacBook. Someone might have a Windows machine or Linux. I think that actually helps from that perspective. It keeps things unified. It helps one developer work with another developer, even if they’ve never worked together before. It helps in all those cases. But what I will say is, I don’t think that we should be the ones that - and when I say we, maybe the SRE team or even the engineering organization — should say, “Hey developer, you need to use this. Hey engineer, this is the tool that you need to use.” In cases where it’s more the mechanical things, the just writing of the code. Like I would never tell a developer they have to use PyCharm.
Ben: 00:25:41.976 Like which [inaudible]?
Mario: 00:25:42.906 Exactly. Yeah. Exactly. It doesn’t matter. In the scheme of everything at the company, the thing a developers using on their laptop, to whatever the linting tool they’re using, that doesn’t really matter too much so long as they get the result that we’re all looking for that runs in the CI/CD pipeline. I’ve talked with a lot of our developers, and this is — so we have a developer experience working group, actually that I started. We have discussions about these things because I want to know — in every org I’ve been to, there’s actually kind of an aura of how developers work. And you just have to talk to them. You just have to work with them. Do some co-chair sessions in Zoom and learn what their workflow is. And we actually have some great developers who love running their own CentOS VM and they do a bunch of things in there. And you know what? That’s fine. I actually would rather them do that and me just say — and we actually did this recently. We’re honing in on what we want our development environments to look like.
Mario: 00:26:38.802 I actually told them that, “What we’re thinking is, meet us in the middle. We’re going to provide you the way to deploy your services locally. We’re going to solve for that. What I need you to solve for is running Kubernetes locally. Get Kubernetes running on your laptop. I don’t care how you do that. You can run MicroK8s, minikube, k3d. There’s a ton of different ways. Your CentOS VM. You can do it manually the hard way. I don’t care. But give me a Kubernetes cluster, and we will help get your services or instrument them so that they can be deployed in the local context. You just have to have the conversations. You have to have the understanding. And I think when you come in heavy-handed, that’s a sure-fire way to get people really not understanding why you’re forcing them into a position of doing something that isn’t natural to them. I think everyone has a way they work. I know I do. If someone came and told me that they wanted me to use a different terminal emulator, I’d be a little bit annoyed and perturbed. So think about the things that matter in your organization and think about the areas that you really should expend your effort discussing and putting in policy and regulation, I guess as if you will.
Ben: 00:27:45.566 Do you have any other topics that have come to the developer experience working group?
Mario: 00:27:49.355 Working groups to me are a forum for open, lively, transparent discussion. Real, I call it, heart-to-heart discussion. This is where we need to be real. This is not a “Yeah, I think I can do it.” This is a, “If you want to treat it as a session to just kind of get everything out there in the open.” Because something you’re thinking of — I, as an engineer, I might be having this thing where I don’t understand why our VPN is like this. It’s given me so many issues with connectivity and it’s just, it’s terrible. You’re not the only one having that thing. And what I find is I actually ask dumb questions. I ask some of the dumbest questions ever. And the reason I do that is because, again, I like to understand from that kind of foundational level what’s going on. And I like to hear other people explain it too. And in doing that, I actually learn a lot from that. And the other reason is because a lot of the times other people have similar questions.
Mario: 00:28:43.773 There’s so much that we do as engineers, there’s so many different tools and systems now and processes, and there’s so much time we spend in Slack, it’s hard to keep your mind kind of 100% in tune to every initiative, every project, every reasoning. And so I think talking about these things openly, and I think having — we don’t have too many meetings. It’s not like something that’s every week that we do, but what I’ll do is I’ll actually raise a discussion in our Slack channel. I’ll say, “I’ve been thinking about this a lot lately. I’ve seen some areas. Here’s some links to what I’ve been thinking about.” And we’ll just have some light-hearted maybe discussion in a Slack thread. And I’ll actually say, “You know what, we have something we can be focused on. So beyond fixing the VPN, we’ve actually zoomed in on a few things that might actually work for what we’re trying to do. And let’s have a 30 minutes and let’s just throw some things around.” Because you’ve got context, you’ve got perspective. And I’ve got my feeling on it. And maybe it’s a little more emotional at first, but let’s just have a general discussion." And almost every time we come out of those meetings and I have three pages of notes, but beyond that, we have an action plan that we can kind of go and talk with leadership about.
Mario: 00:29:51.771 And more so me being an SRE engineer, I can actually provide tangible research or put cards into JIRA where I can actually figure out, “Let’s look at this spike and see what happened.” And that’s kind of how Teleport came to be, I think. We had to solve for access. And we’re not 100% there. It’s not perfect. We’re not using Teleport for every single thing yet. But we figured out what we wanted. And zero trust was really important to us. Something that was flexible. We figured out what those things are that matter to us. One the next steps after that is writing a proposal or a project initiative document. Getting people to rally kind of around that for a few weeks and figuring out what the next step is. And that helps us get our thoughts out there, but there’s something to be said about writing something versus just vocally talking about it. There’s a dense, more spirited level of understanding that you actually have to have and optimism you have to have for that thing that you’re proposing when you write about it.
Ben: 00:30:46.523 I mean, as I imagine for Teleport, the technology isn’t really the problem. There’s multiple ways to solve it. But the real problem is you have all these people who you’re trying to onboard. Where’s the central source of identity? And then when someone leaves, how do you know that Bob's —?
Mario: 00:31:00.517 Like you said, the technology is not really that difficult. Honestly, a lot of it is just taking steps out of the process so that we do less. We have to worry about it less. I’ve been in organizations where they manage the access entirely themselves in a very manual way and it’s actually crushing to engineer. People are like, “Oh, I don’t want to deal with this anymore.” Just so much of their day gets spent doing this manual labor. I think something Teleport does — I have heard this quote before in the kind of startup realm. You don’t actually sell something and do very well with something that adds another step in someone’s process. Actually, where you add the most value is removing extra steps and making things more automated, in a lot of realms. Not every realm, but in a lot of the sort of — this is going to the experience, the UX, right? And I think that is equally true in technology. That is equally true in the day-to-day operations of a company and an engineer, the things they have to deal with. And especially, and this is why I love developer experience, I think the developers are by default doing way too much that is not coding and we hire them to code. Let them code and get the other stuff out of the way, so.
Secret management at Carta
Ben: 00:32:08.301 One topic that often comes up is dealing with secrets. I’m sort of interested what’s your thoughts on sort of general secret management at Carta?
Mario: 00:32:17.573 I think secrets is a difficult one because there’s so much to think about if you really want to deep dive that. Vaults actually incredibly, not tricky, but it’s dense for sure. It’s not something that’s just, you whip up in a day and everything’s managed for you. You’re 100% secure. There’s so much to think about. I think Vault does a great job, but I think there’s other solutions. I know there’s External Secrets. There is SOPS from Mozilla, which is you’re basically using your client, and you’re encrypting before it gets to a repo. That’s just how you manage things. There’s other considerations too, of where those secrets are going. Kubernetes still in etcd, the default is that those are displaying text in etcd. So I think you have to think about — when I think about secrets, I think about obviously user experience because the users are the ones creating the secrets or managing those secrets. The kind of long-term storage where they’re going to be and some of the trail. We’ve had stuff where it’s like just throw it in 1Password or whatever, or we’ll use Keybase, etc. Well, do we unify on one thing? Do we give people more flexibility? So where’s the secret going, and how are people sharing it?
Mario: 00:33:25.904 And then the other side of that is the whole life cycle of the secret. So if we move away from a pets model for our Kubernetes clusters, and we go to, let’s say, a cattle model and we have everything declarative, what does it look like when I go and spawn new environments? And when I’m an SRE engineer, I don’t know every nuanced secret that a developer might have for their service. I expect that to be defined with their helm or whatever configuration that they have. I can’t really worry about that. That’s a little bit like — so we have to build a set of abstractions or other software pieces or components that can handle that in more of an automated fashion. I think Vault’s a great solution. I don’t think there is a super amazing, best low effort solution out there necessarily. I think a lot of this is, what size is your organization? How important is security to you? Because I know for a fact — many company’s secrets are stored in GitHub. And maybe that’s a startup that’s 20 people. Hopefully not. I’m not saying you should be doing that, but that’s sometimes the reality.
Mario: 00:34:28.091 You can’t be 100% perfect at everything. You have to pick your battles. I’d say focus on: What is security to you? How do you instrument security? What does reviewing security look like? Do you actually have any standards you need to meet? Do you have an external company coming and reviewing or doing pen testing for you? What is that like? And then the other big one is: How do your employees both interact and share secrets? Because at some point, secrets might be needed. You might need to share them. Do you want them uploading them in Slack? Is that a good idea? Maybe it is. Maybe you say, “You know what? Slack is trustworthy. We’ve signed the contract with them. We’ve reviewed their security. So you have to consider these pieces.
Mario: 00:35:09.665 Now, obviously in financial services, we have a very strong InfoSec presence. This is something I would urge many teams to do is — a lot of the time, I’ve seen SRE receive requests from a security team. And the request is like, “You have to do this or this has to be met.” And there’s no collaboration. It’s more of a one-way sort of model. I don’t think that scales and that’s not going to work in a modern cloud native world. If you want to keep moving, if you want to keep growing, if you want to keep scaling your platform and your users, you have to work with security one-on-one. So, I mean, I meet with our head of security every couple of weeks. We’re talking in Slack constantly. He’s got some ideas around, “Let’s look at Splunk, or Cloudflare, or this or that.” And I will definitely work with him and say, “Okay. From an SRE standpoint, here’s what I’m seeing, here’s what works, here’s what we need to consider.” And he loves that feedback. I love his feedback on, “Hey, we’re going to do this to work on hardening our Kubernetes clusters, or we’re going to look at Sysdig, or we’re going to look at some of these other solutions to solve for this problem.” So it’s very problem-oriented as well. We don’t just do security to tick the checkboxes.
Ben: 00:36:13.867 You bring security in early, so —
Mario: 00:36:16.489 100%, yeah. You have to work from the ground level up. You can’t just work whenever it feels like it’s the right time to send that email or this requirement. We need —
What to look for in new hires
Ben: 00:36:27.715 Yeah. We’ve reached the next piece, or we can no longer use secrets in whatever. Carta, I mean, you have lots of job openings you sort of pitched already obviously in information security. What do you look for in new hires, and sort of how do you approach hiring new teammates?
Mario: 00:36:42.842 This has been very iterative for us. We’re trying to learn — and we actually used to do live coding interviews. We don’t do that anymore. We do a take-home model. That’s one of the major things that changed very recently. And our CTO has a lot to say on that topic. My good friend, a leader at Carta, Nabeed, has some blog posts on how engineering talent and the way you search for it is very, very broken. I think what happened is that the recruiting model — we try to apply to every sort of type of developer or engineer in technology. We tried to just say, “Let’s go through and do the normal things and then let’s throw in a live coding exercise as well.” I don’t think that scales, however. So there’s kind of key points of this is — what scales? How many people from our team actually need to be involved in that interview? Because interviews are actually taxing. There’s a research portion, there’s doing the actual interview, and then there’s feedback. They’re actually pretty time-consuming. So how many people need to actually be involved?
Mario: 00:37:40.574 How much signal to noise are we getting as early as possible? So this is one where we actually were getting very little signal until the third person talked to them, the third person in the chain, let’s say. And so how do we optimize for more signal earlier on? And you don’t want to spend too more — it’s double-sided. It’s a double-edged sword. You’re never going to get it perfectly right. But also, you don’t want to give somebody a take-home from the second person they’ve talked to and the take-home is an eight-hour take-home. Because how invested really are they in joining your organization when they’ve only talked to two people and one was the HR leader and the other one was a hiring manager? I don’t know anything about the team. I don’t know anything about the day-to-day. I don’t know anything about the people I’m going to be working with. I really don’t know anything.
Ben: 00:38:23.988 Yeah. Then, you have an eight-hour test.
Mario: 00:38:25.244 Exactly. Yeah. So why would I spend eight hours on this take-home test? So being cognizant of that. And we’re not doing that now. I think we try to minimize them two to three hours, something relatively simple. Sometimes it’s not even really coding as much as it is the soft skills. And you’ve heard this a lot where you can’t really teach soft skills as well as you can teach hard skills. We don’t focus as much as hard skills earlier on. The hard skills are something that we relatively infer in further discussions for the most part. And again, I’m talking about like how we’ve done some interviews. Many are very different depending on who we’re bringing in and depending on Carta or CartaX and the type of team. Every team is doing it slightly differently, which is fantastic. I think those tenets and those principles are kind of the way we think about it.
Mario: 00:39:12.829 I’ve been quoted telling people that in the first couple of minutes of talking to someone, I have a decent idea of if we’re going to bring them in. Because by the time I talk to them, I’m the third, fourth, or fifth person talking to them. They’re someone that’s going to be on my team. There’s multiple dimensions to that. Do they seem like someone that I would love to work with? Do they seem like someone who is driven and ready to learn? And do they have the right level of assertiveness where they actually can be super beneficial and they also can get their ideas, get project proposals, and really lend to the discussions in a lively and serious way? Because there’s a lot of things mentally and emotionally that our jobs put on us and we want someone who can handle that stuff and not just kind of be in the background. We’ll be okay with someone who’s a little bit less on the hard skills and more so incredibly strong on some of the soft skills because, like I said, a lot of that can be taught. And if they’re driven enough, that shows us that they are kind of ready to go and ready to take something they’re not very used to or not very comfortable with.
Tips for getting into DevOps or SRE
Ben: 00:40:15.607 And so for people who are quite early in their careers — maybe started college — do you have any tips for how to sort of get into DevOps or SRE?
Mario: 00:40:24.594 I was thinking about this the other day is — I try to think about my mindset when I was looking for a job out of college and when I was looking to get into DevOps. And the way things looked back then were definitely different than now. So you’ve got SRE is such a big focus. SRE didn’t exist back in that day. You were still getting — and you still have them now — is DevOps engineers, right? Yeah, exactly. And I think, yeah, that’s a hard one because obviously, the landscape has changed. You have to look at, if I was that age again and I could go back and tell myself a few things, maybe there’s a lot of things that seem like a waste of time. And there’s a lot of things that don’t directly seem like they plug into you, maybe making a lot of money or getting the title you want or getting the right amount of compensation that you want. Something that I wish I would have been more privy to is this idea of having to do things hard for a period of time because you grow as a person, as an individual.
Mario: 00:41:26.595 And it’s not just about growing technically and being able to code your way out of anything or learning the latest hype programming language or whatever it is. It’s not about that at all. It’s actually about more of a human piece. You’re going to have to endure, and you’re going to have to build relationships, and you’re going to have to be shut down many, many times. One of the worst things that can happen when you’re younger is that you get a whole bunch of success and you get win after win after win because you’ve never failed. So when you do fail later on in your career, it’s actually harder to get back up. I’ll say I had a very interesting background. I went and started a company at 25 or at least tried to. Ended up getting sued doing that. And I went through a lot of different periods, a lot of different phases in my twenties. Those things were all incredibly risky. And definitely don’t worry, my father did a great job of yelling at me about why I was doing X, Y, and Z. But I think all of those things — and I’m glad I did all of those things. And actually, I hate my twenties. I feel like I wasted them. But when I think about it, actually, a lot of what I’ve learned is the reason that I’ve gotten to where I am right now — is because I had to go through and I had to endure so much different wide varieties of emotions, mental conditions, and just trying to make things happen. Having that belief at the end of the day, that kind of methodology or philosophy of thinking, kept me focused.
Mario: 00:42:52.406 So maintaining focus, trying new things, taking risks because you’re young and you need to learn these things and failing and learning that failure is a natural part of living. And you actually learn more from failing than you do winning all the time. So I’d say that’s — and maybe that’s a little bit deeper than you were looking for. In the DevOps space — sorry. In the DevOps space, learning everything you can. Soaking everything you can. Talking to as many people as possible. Getting acquainted with the major figures on Twitter in the community. I think we’re seeing both in our world, cloud native and in cryptocurrency community is a huge thing. Relationships, learning about different sorts of people and the things that they’re working on and just trying so many different things. You’ll definitely be bound to understand DevOps. Understand why it’s such a big deal for a lot of companies. How they are trying to ascertain what that means for them and how you can fit in the mold there, so.
Building vs. buying a project
Ben: 00:43:51.187 That’s a great answer. Thanks for going deep on that. And so when do you — obviously, Kubernetes is a CNCF project for cloud native. When you’re looking to sort of bring in technology, what makes you decide sort of the — should you use a CNCF project or should you build or buy it? What’s your sort of thought process?
Mario: 00:44:13.687 That’s a fantastic question. I’ll get this out there. I personally think that there’s so much that we abstract away that if we have the money and it makes sense logistically, we should try to use third-party vendors a lot of the time. A good example of this, I guess, and my teams going to hate me for saying this is, I think that using like Datadog or New Relic or something in terms of observability is one of the best things that you can do because they are so good at it and the controls and the method by which you can hone and tune dashboards or bring in custom metrics, or even scale on any metric you want in Kubernetes. I think they’ve done such a good job of making it feel like you’re in control and you can do whatever it is you need to do. You’ve got tons of flexibility there. I think those things pay for themselves.
Mario: 00:45:06.105 And I will say that in some parts of Carta we do run things ourselves. Let’s stay on observability. We run Prometheus, Grafana. We’re looking into Loki. We do some great log. We’re running those things ourselves in the cluster that these systems are monitoring and watching for. And in doing so, there’s nothing wrong with that. I think a lot of it is that’s what was there. We don’t really have time to change it right now. We have a plan. Let’s just make sure it’s delivering the value and the needs that we have. And so that’s the big part of it. And making a change like that, when we talk to a vendor and we start to figure out the security, the compliance pieces, the long-term relationship, that’s a lot of work. That’s non-trivial. And so our leaders are very good to come to us and say, “Do we really need this thing? Is this going to make an incredible amount of value?” Because if it is, if we sell it — and that’s why we like to write project proposals and like to rally around kind of these discussions on everyone having a little bit of input, everyone having some loosely held opinions, but providing some good input — we can help everyone understand the core reasons for doing this thing that are going to matter for the company long-term.
Mario: 00:46:19.979 So of course, if we say, “Yes. This Datadog or New Relic or this third-party vendor is going to provide us so much massive value,” then we’ll go through the work and we will build the relationship with them. Or we’ll try or evaluate some other options as well. We’re not turned off to it at all. And I think it’s really important that we stay on this idea of — what’s out there in the totality? Kind of this idea of local maximums and global maximums. And I don’t want us to ever be in a sense of, “Well, in our circles, this is the right solution.” I want us to always be exploring and trialing out new options out there. I want us to understand and have an intimate relationship with what the possible global options are because there’s always more options than you’re seeing in front of you always. We have to work together to really drill down and figure out what works best for us as a team. Not just as Mario. Not just as the SRE team. But if we take on Datadog as —
Ben: 00:47:22.584 [crosstalk].
Mario: 00:47:23.342 Right. Right. What do the developers get out of it? What does the QA team get out of it? So it’s very multifaceted. There’s lots of stakeholders, which we can get into. It’s tricky. But yeah.
Ben: 00:47:33.244 You obviously spend a lot of time in the cloud native community. Are there any security products under the CNCF that you recommend now?
Mario: 00:47:39.247 Man, there’s a ton. I think the hotness right now is Open Policy Agent which we’re very much looking into. I think the main company with that is Styra, which is doing some really great work. I know a few people that work there. They believe that it’s kind of this next chapter in how we manage workloads. Being Carta, we really like what they offer. I think Sysdig, Falco, a lot of these things are not the best thing ever. They’re not perfect. But really in terms of intelligence and in terms of us making sure we are feeling good about things that we install and that we have in our infrastructure and that we understand, I think Sysdig’s doing a great job. I know Spiffy. We haven’t really looked into Spiffy, but there’s a lot of offering there as well. I have not actually looked at the CNCF landscape, the huge image document, with all the —
Ben: 00:48:34.706 The diagram’s always growing all the time.
Mario: 00:48:35.059 It’s killer. I have not looked at it, and I have not been super in tune with everything in the security space. I’m actually going to load it up right now because I’m very interested in some of the new things. And it’s probably going to crash my browser because it’s so freaking massive. Those are some of the key areas that we’re looking at. Sysdig is going to provide a lot of help there. I think Aqua Security as well is doing some great things around inherently having security kind of built into your pipelines as well. And so we’re looking at that. Orca is another one. I know we’ve been reviewing them as well. I’m not sure if we’re actively using them yet. So how does onboarding look like when you come in and you need access to things? How do we kind of inherently make sure you’re getting the access you need as well? A few others in here that — I think Sneak is definitely — I mean, I’ve heard great things all around. There’s a podcast with, I think, their founder. And I think they’ve got a great way of approaching —
Ben: 00:49:41.356 Like for container security?
Mario: 00:49:41.972 Oh, yeah. I think they’ve got a great way — I think it’s closed source. But either way, still got a great offering. We’re looking into what they can offer us there. And then you got some other more, maybe let’s say open-source utility tooling if you will, or scripts that you can run that help you. And I think Fairwinds ops — I have to say, we just did a clustered competition where Carta engineers versed the Fairwinds engineers on breaking clusters and I think they killed us. But they have Polaris and Pluto as well, which they’re helping — RBAC Lookup I think is from them as well. They’re at least providing you a way — like, “Here, run this binary against your cluster, and here’s just a bunch of information.” And is it perfect? Does it take that information and audit it and remediate things for you and 100% solve the problem so you don’t have to do anything? Not exactly. Is it a major step in the right direction and giving you more intelligence? Absolutely. It’s something that we use actively. And we know that if we need to review, we have the tools we can do to review.
Mario: 00:50:45.299 If anyone else is leading the CNCF interactive landscape, there’s a lot of things and you’re never going to account for all of them, but I think this is a great starting point if you’re just looking at what’s out there. I know there’s also the awesome Less on GitHub. I think there’s an awesome Kubernetes list as well. There’s a lot of great facilities, tools, articles, things like that, and resources that you can use to get a sense of what’s out there and what you — every organization’s different. Some people, Polaris and RBAC manager is all they need. Some people they’re ready for Snip and they’re ready for Aqua and Kube-Bench. But again, I think you have to figure out what your size is, what matters to you, what you need to expend effort on, and naturally, try these things. It’s very incremental.
Tips for keeping companies secure
Ben: 00:51:27.786 Well, as we sort of come to wrap this up, do you have any closing sort of tips, keeping companies secure?
Mario: 00:51:33.496 I think security has to be something that is; I don’t want to say everybody’s problem, but it kind of is. You’re going to have a security team, and they are going to do what they need to do to make sure that the parameter and the security boundaries and guard rails and things are in place are being considered for. The thing that I just worry about is that so many people infer security as just this heavy-handed, “Screw you. You have to listen to us.” And everyone needs to understand — if you’re a security engineer listening to this, or if you’re an SRE, or if you’re just a developer, maybe you’re a product person, security is — it’s the company’s problem. And you work for the company. You are a representative of that company. And to offer the services or product that you are trying to offer, your company is saying, “Security is important to us.” And by that InfoSec team, by having these tools being brought in, by using these third-party platforms, by making you sign into Okta every day, it’s part of an initiative. It helps everybody stay safer at every juncture.
Mario: 00:52:42.342 I have a trip to Chicago this weekend with some really good family friends, and I just know they’re going to ask me about password managers. Something I was thinking about last night is — because — sorry, password managers always comes up, and most people are still using the same passwords — they’ll use the same core password and they’ll just add the service name next to it, like Google email, and then the core password they use. But I just always get those sorts of questions, of course, around family. And I was thinking about this last night. The things that I remember doing from a security standpoint just for my own, like using a computer 10 years ago, are so different from now. I mean, I could have never dreamed that we’d have an SSO like Okta with two-factor authentication going to my iPhone. It’s just crazy. And the reason is because the stakes have changed. The internet is a different place than it was 10 years ago and it’s going to keep transforming. We have to do these things to keep not just ourselves, but everyone secure.
Mario: 00:53:35.662 And your company, for the most part — they’re not doing this to make your life a living hell. They’re doing this either because there’s regulation, there’s kind of a duty to their customers and they want to be able to say it. Maybe it is kind of checkboxes or some marketing to say, “We’re at a certain level.” But those visibility pieces matter. And the third part, though, is they do actually care about everybody in their ecosystem. In the Carta ecosystem, it’s not just me because I’m an employee I get Okta, but we want you Ben to feel comfortable that anything you’re doing on Carta or with Carta or reading about or interacting with Carta is an amazing experience. And in that experience, inbuilt is a focus on security, and you should be able to kind of comprehend that in just using our platform. And I think that’s a very healthy mentality.
Mario: 00:54:30.262 I’ve been a part of companies that have had security breaches. You can do some googling. It’s not fun at all. And it’s actually not even the employees — it’s not terrible from a, “Oh, we have to remediate this.” It’s terrible because it’s like we’ve let down our team. And our team being everybody. Our customers, the people we work next to every day, everybody. Our investors. We let people down. So I’ll end with this: security is not anymore, just something we farm out. It’s not something we consult about. It’s not something that we just go through the motions anymore. It is something that we have to fundamentally, realistically consider at every layer of whatever we’re trying to do. And if we just be open, transparent, communicative, and work with people, work together, do so much.
Ben: 00:55:20.385 That’s a great ending. All right. Thank you, Mario. Thank you for your time today.
Mario: 00:55:23.327 Thanks so much, Ben. This was so much fun. Good luck to you and the Teleport team. Love the podcast. Thank you so much.