Panel: Access at Scale at Jump Trading Group
This podcast is a live recording of a panel from Teleport Connect 2022, it's a discussion about how teams have scaled Teleport to support thousands of users and hundreds of servers.
Learn more about Panel: Access at Scale
- Teleport Connect 2022
- Get Started with Teleport
- Teleport Server Access
- Teleport Labs
- Contribute on GitHub
- Join our Slack community
- Participate in our discussions
Transcript - Panel: Access at Scale
Ben: 00:00:00.000 For this panel, I'm going to be talking about access at scale. To my left here, we have Joe and Chris from Jump. I've worked with the Jump Trading and the crew for a few years, and it's honestly, post-COVID. It's great to be in person and discuss some of their items about accessing infrastructure at scale. And before we sort of dive too deep into things, Joe, you want to give a quick introduction?
Introduction
Joe: 00:00:30.643 Yeah, sure. So my name is Joe Conti. I've been with Jump Trading for a little over nine years. Prior to that, I worked for Barclays and part of that I worked for Lehman Brothers. My beginning experience with finance was Lehman. I joined Lehman about a year before the bankruptcy, so it was fun. And prior to that, I did a bunch of small business consulting. I started with Linux in 1993 with Slackware 3.6. I've been a Linux enthusiast for many, many years.
Chris: 00:01:02.610 My name is Chris. I've been with Jump for just over a year, but I've been in finance just like Joe for quite a while. Just over 10, 11 years. I did a brief stint in between proprietary trading firms at some place called AWS. All right, I guess, I don't know. And then part of that, actually oil and gas. I'm from Texas, so unlike Joe, I've not been a Linux expert my entire life. I actually used to be a Windows engineer. The guy [inaudible] earlier was making me really sad. But no, I've just been assistant engineer and a systems leader in my whole career. So cool to be up here talking about Teleport.
Overview of Jump Trading
Ben: 00:01:42.615 Nice. And for people unfamiliar, can you just give a little of a background to what Jump is — Jump Trading?
Joe: 00:01:47.281 Sure, sure. So Jump Trading is a world-class, research-based trading firm that serves as a market maker enabling stocks, bonds, and currencies. We bought and sold in orderly manner in markets worldwide. We often think of ourselves as a technology-first company because we are often on the bleeding edge of technology, and we work with Teleport as well.
Ben: 00:02:08.441 And do you have a traditional finance and are being more familiar with Jump from the crypto angle as well? [crosstalk].
Joe: 00:02:14.978 Yes, I mean, we are a vocal player in the crypto markets today. We haven't always been, yeah.
Scale at Jump Trading
Ben: 00:02:25.184 And so today's sort of fireside chat is about access at scale. Can you sort of paint some pictures about what scale means at Jump?
Joe: 00:02:33.907 Sure. So we're often talking with many hundreds of users and many thousands of servers. And working across those large scales, we often find interesting problems. One of which is if you look at a slice across all of our users, we have people that are engineers that are accessing servers. We have people that are quants that are accessing servers. We have someone in purchasing logistics who barely knows how to use SSH, but they're using that to access the system and get some data out of that. So we have lots of different access patterns at Jump. And trying to take that and lift that and move that into Teleport from our legacy systems is certainly a challenge.
Chris: 00:03:18.959 And to be clear, I mean, Jump is a consumer of the remote SSH piece of Teleport, right? We've looked at it for Kubernetes and database map, but when we're talking about systems and we're explaining what we've done, it's strictly we've replaced SSH with Teleport for interactive access.
Ben: 00:03:35.954 And are these systems mainly on-prem, or are they cloud?
Chris: 00:03:39.442 It's a mix. Yeah, we've got —
Joe: 00:03:40.259 Yeah. It's at 99.8% on-prem.
Chris: 00:03:44.210 We've got the need for certain systems to be colocated geographically around the world near electronic exchanges. But then we have a lot of back-office systems and things that kind of handle the traditional business that everyone has. And some of those are virtualized, some of those are physical, some of those are cloud. Cloud's been kind of, I would say, laggard for a lot of financial firms, so we're still kind of getting into it. But obviously, as you mentioned, we're in the crypto space and a lot of the crypto stuff happens in the cloud. So that's been a huge accelerator for us to get that stuff going.
Some systems used at Jump Trading
Ben: 00:04:20.279 Yeah. And you said you don't use Teleport Kubernetes Access, but you use a sort of container orchestration system such as Kubernetes, so —
Joe: 00:04:27.891 Yeah, so we use many different systems at Jump. We like to say that we have all of the standards and all of the systems because we do have lots of various things deployed at Jump. But yeah, we've been using Kubernetes since about 2016, 2017. And we've built our own kind of access pattern for Kubernetes, and that works well for us today. The goal is to eventually move to Teleport, though. That's a day-two objective.
Ben: 00:04:54.284 And so when you're operating at sort of Jump's scale, we've gone a long way from the pets to cattle, and you've very large fleets. And I guess when you have many thousands of cattle, interesting things are going to happen. Can you tell me about some interesting things that you see operating at the scale that you do?
Joe: 00:05:11.223 Yeah, sure. I mean, there's lots of interesting things. Yeah, and I think one of the more recent things in memory that was fascinating is as you're deploying many, many servers and you've got many, many thousands of CPUs, we had one particular server where a particular workload ran on it and it would fail every once in a while. And it ended up being a specific core on that specific system, and it ended up being a bad CPU. So that's one interesting problem. But tracing that across our massively parallel workload is really tough to do, so we've built tools around that to try and enable us to fix these problems at scale. I know Chris, you've got some other things to talk about.
Chris: 00:05:51.568 I mean, but extending that to Teleport, the simplicity of onboarding a node with Teleport allows us to treat those systems like cattle. I think for us, our user community are the pets, right? They all have different access patterns. They all don't —
Joe: 00:06:09.112 They might take offense to that.
Chris: 00:06:10.012 Don't take that the wrong way. Good call. No, no, no. I mean, only on Halloween, they dress up. But no, the uniqueness because we support multiple trading teams inside of one big trading firm, and a lot of them have grown organically over decades, and you end up with really weird access patterns. And so it is nice for us that we can kind of set it and forget it on the server side. Occasionally, we run into some interesting scaling challenges, but I'm sure we'll talk about those at some point while we're up here.
Managing a development team
Ben: 00:06:44.632 Great. So I think that kind of is a good segue in another aspect of managing thousands of servers or however many you have, is also managing a development team. And according to LinkedIn Tools, it says you have 400 engineers at Jump. I don't know if that's true or not. Let's take that number.
Chris: 00:07:04.536 I think every employee at Jump thinks they're an engineer in some way, shape, or form, but —
Joe: 00:07:10.175 I can just say that we've doubled in size over the past two years. So we have lots of users of Teleport at Jump and lots of different unique use cases, right? So to Chris's point earlier, where we've got lots of trading teams. Some of them have been with Jump for many, many years, since inception. Some of them are new. They all seem to have different patterns. And I think as we've gone through this migration to Teleport, it's been really interesting to see all these different patterns and things that, as you kind of go and gather requirements before you execute a massive project like this, all these things are missed because people just assume things like this would work, right? Like SFTP, X11, things that are native to SSH that have worked for ages. As you kind of lift that and move that to Teleport, some things just break, and it's been an interesting discovery.
Chris: 00:07:55.646 And because they're not engineers by trade — so I mean, we obviously do have software devs and infrastructure engineers and people, but we have a lot of quants, people that come in with a PhD in physics, scientific background, and they just expect things to work like it did at Rockefeller or like it did at Berkeley or wherever they came from. And it's like, yeah, I guess we can make that work. And we have to kind of shoehorn things then to fit the access patterns that they're used to. They're not all just Jupyter notebooks and everyone runs the same access pattern. It's very, very different. Some of these people, frighteningly, know as much about SSH or more than we do at times, where they're like, "Oh, here's my bashrc file." I'm like, "Oh, my God, what are you doing? Does that even work?"
Joe: 00:08:43.399 Or yeah, they'll come to you with a problem and they'll give you this snippet of OpenSSH code path that they've been used to for so many years like — why is this not behaving like this?
Ben: 00:08:52.599 And is that many people who come from high performance computing like a HPC space?
Joe: 00:08:58.325 We do have a few of those, yeah. And it's certainly interesting to watch them kind of complete this journey to Teleport because they have such niche use cases. And they've used OpenSSH for 20, 30 years, so it's fascinating to watch this pattern that emerges. And each team is different, and each person within that team is different, so there are so many different use cases as you're kind of going through this migration path.
Philosophy of access
Ben: 00:09:25.696 And then what's your philosophy as sort of production engineering team to give your people access to infrastructure?
Chris: 00:09:33.499 Step one is assign a ticket to me.
Joe: 00:09:37.685 It's a loaded question. So yeah, it's really tough, frankly, because we have — I mean, even after the HPC use case, we have lots of developers, we have lots of engineers, and they all have very prescribed access. And oftentimes, they'll be in their normal bucket of access. And then you'll often going to take it like I need access to this one server here because that one server has a new FPGA in it or a new async in it. And managing that type of access at scale with Teleport is a bit difficult. It's a challenge. It's something that is evolving every day that we're trying to learn that. What we find most of the time is that we don't really want just RBAC. We want RBAC plus ABAC or RBAC plus some other type of access control mechanism.
Chris: 00:10:30.101 But I think one of the ways that we've gotten to the bottom of things — there are a lot of days that I joke that we're like Scooby Doo and the gang trying to figure out what's the big mystery behind this trading team's problem today. And we've taken some of the groups that have struggled a bit with onboarding and put them into Slack rooms. So we've got Slack channels with these folks. We've got several people on the product engineering side where we can give a little extra kind of white-glove treatment to get them on. And then what we found is across the large number of employees that we have, all of a sudden, we get internal advocates. So you have someone that's well-respected within the quant community, the trader community, the dev community. And they're talking about, "Oh, well, you just need to do this in your config file or let me show you how to get this to work with that Python script." And then now I've got an advocate sitting in with that team day in and day out that is very familiar with what we're doing with Teleport. So it's been fun to see that transformation.
Joe: 00:11:24.328 Yeah. And I think to expand on that, one of the things that we did was we created a public — because we have lots of private —
Chris: 00:11:30.411 Way too many.
Joe: 00:11:31.172 Yeah, we have way too many private channels in Slack. We're doing Slack wrong. But we created a public channel for Teleport specifically with the goal of achieving this users-helping-users community. And I think just the other day, we had one of our first examples, or hopefully the first of many, where one user helped another user with a similar problem that they had over time.
Chris: 00:11:53.892 Well, because as you can imagine, we were both on a plane coming here, so no one was responding in the Teleport help channel.
Joe: 00:12:00.721 They figured it out themselves.
Chris: 00:12:02.042 Yeah, it was great.
System used for source of identity
Ben: 00:12:03.169 And what kind of system do you use for source of identity? Do you have a SSO provider or something else much —?
Joe: 00:12:10.626 Yeah, so we're using OIDC right now, and that's been an interesting path. One of the reasons why I say that is because we deploy many proxies globally for latency reasons. We want to provide our users the lowest latency to their proxy, for obvious reasons. And one of the things that wasn't being handled particularly well was the redirect URL in OIDC. So if anyone's familiar with OIDC, you need a static redirect URL. And Teleport didn't really handle that too well. It does now. There's a feature that we kind of championed for that exists today, but we're actually using some GeoIP-based redirect effectively to point people at the right geographically close proxy. There's an upcoming chat on misusing Teleport. We have misused Teleport since day one. There are lots of things that we do that are nonstandard. Most people probably put a bunch of their proxies behind a single FQDN. We have many globally distributed proxies and they're all kind of unique.
Chris: 00:13:14.888 But we've also gotten to drive a lot of the roadmap for Teleport which has been fun. We've partnered with you guys on so many different just feature enhancements and things because we've been hammering at this thing from 14 different directions.
Ben: 00:13:27.828 Yeah. And how is that different working with a sort of open core company for your other vendors?
Joe: 00:13:33.545 I think it's beneficial, frankly, because some of the things that we are doing — I watch the GitHub issues pretty religiously, and I see lots of people using features that we've come to you guys and said, "Hey, we really need this." X11 being one of them. It's one of those things where we knew there was X11 usage internally. And as we've kind of started this migration path, we were surprised to see how much X11 usage is happening around the firm and —
Chris: 00:14:03.453 In 2022.
Joe: 00:14:04.317 Yeah, in 2022. And it's mostly HPC, right?
Ben: 00:14:07.177 I mean, can you describe to the audience who aren't aware of what X11 is?
Joe: 00:14:11.642 Yeah. So it's using SSH as a tunnel for graphical applications. So there'll be an X server running on a local endpoint, whether that's a desktop, laptop, whatever. The X server could be — Xming for Windows, XQuartz for Mac, or there's a bunch of them for Windows. I won't list them all, or obviously, X11, XWork for Linux. But then that is basically — you run an application on the remote end, and you're presented with a local display of that application.
Chris: 00:14:43.520 And sometimes these are data engineering applications where people are doing data mining. There was this one team where they're like, "No, we use xclip." I'm like what? "Well, it's a clipboard in X." And I'm like, "Why don't you just copy paste out of the terminal?" They're like, "We use xclip." And I'm like, okay.
Ben: 00:14:59.780 Yeah, we actually provide the service in Europe that uses it with X11 as well, so yeah, the high-performance computing use case.
Joe: 00:15:05.939 Yeah, we're familiar with them.
Ben: 00:15:09.228 And then going back to the — you see OIDC. You have groups. You have in teams. Do you have any other attributes that you sort of categorize, sort of teammates that you bring externally from your identity provider?
Chris: 00:15:21.899 I mean, the main thing from our identity provider that — we're just using groups. I mean, occasionally, we've explored some more creative options. But one of the things that's been nice is we've been able to, so far, despite sometimes cursing and other interesting words tossed around, we've been able to solve pretty much all of the access requests with some combination of labels and groups in RBAC, which speaks a lot to the flexibility there. So yeah, I mean, on a claim side, the only gotcha there is — there's that race condition where it's like, "Oh well, I got added to this group, and I still can't log into the server." And it's like, "Well, because your cert isn't coded with your roles from five hours ago."
Joe: 00:16:10.317 The last time you logged in.
Chris: 00:16:11.888 Yeah, don't forget to pull a new cert. But yeah, no, otherwise, I mean, it's been okay.
Joe: 00:16:18.949 It's been okay.
Impact of access requests on access philosophy
Ben: 00:16:19.850 How has access requests sort of changed your access philosophy? You talking about just-in-time access or principle of least privilege?
Joe: 00:16:27.210 Awesome. Yeah, so just-in-time access is going to solve a lot of our problems, frankly, because going back to that point I made earlier, we have these well-prescribed buckets for access. Then we have some buckets that work well with access requests, native access requests, but resource requests — I think as it was originally known or just-in-time access as it's known now — solves a lot of those problems for us. Because we can say this particular set of developers can access any of these servers with this label, and we can even go so far as to auto approve some of those, right, if we want to build our own plugins, which we're going to do. But frankly, I mean, it takes us out of the decision-making process, and we can kind of encode a lot of — we can basically describe all of our policy and handle that, whether that's approval via Slack bot or auto-approval, depending on a certain set of metadata criteria.
Chris: 00:17:21.578 But the other approach we've taken to that though is on the internal side with those groups, right, the security groups, as many times as we can — we've given the owning team the ability to add and remove users. So as long as the Teleport role is built, we can kind of step away. We find ourselves more involved when it's like the Teleport role doesn't scope quite enough for what they're trying to do.
Joe: 00:17:47.790 Right. And I mean, to add to that, labels only really take us so far, even dynamic labels, right? I think one of the things we're really excited about is Predicate and where that's going. We've talked quite a bit about building our own DSL and kind of layering that on top of Teleport. But we very much want our vision of what this should look like to be a first-class feature in Teleport so that everyone can use it. And it also helps us for supportability over time, right? We don't want to build something that is subject to the Teleport API changing over time.
Chris: 00:18:17.184 I like writing YAML. I don't know.
Recommendations for access at scale
Ben: 00:18:19.843 Yeah, there's lots of Kubernetes-inspired things [inaudible] one of them. And Predicate. There'd be a session later today about that. Policy-as-code project from Teleport. So as you were sort of operating these sort of large systems, what have you found has worked really well for you?
Chris: 00:18:41.653 So I mean, one of the things that I can't take credit for because Joe did it. One of the things that I would recommend anyone else trying to do this at scale is put your RBAC into some sort of repo, right? Whether you're using Terraform, whether you're using in-house code, whether you're just logging into a box and running tctl manually — please don't do that. Whatever your access pattern is, the truth should be — oh, the running truth should be your auth environment, but the written truth should be in a repo somewhere. And so all the —
Joe: 00:19:19.274 All protected.
Chris: 00:19:19.642 All the ammo for all the roles, I mean, we have tens of roles. Are we at triple digits?
Joe: 00:19:25.401 We're getting there.
Chris: 00:19:26.544 We have tens of roles — we'll just say that — that we've created and you have to map all those back. And we have a review process that we go through. We have some automated checks that we go through. And then we push that in with some scripts to commit to Teleport. And —
Joe: 00:19:42.774 We pull that in so we actually have nothing pushing to auth for obvious reasons.
Chris: 00:19:47.400 No, it's pull —
Joe: 00:19:47.916 So everything is done with pulls from auth from our sources of truth.
Chris: 00:19:52.522 It's a web hook that, yeah, then — yeah, so and that saved us so many times because when you've got multiple operators going in and trying to solve a problem — the gentleman from DoorDash was talking about doing it in the GUI, and I almost fell out of the chair. Sorry.
Attendee 1: 00:20:14.221 Have changed it.
Chris: 00:20:15.106 Yeah, of course. Of course.
Joe: 00:20:17.279 Yeah. And to add to that, I mean, it helps you from a disaster recovery perspective. I mean, we've run into a couple of gnarly etcd bugs, some of which were Teleport's fault, some of which were not. And we had to restore a snapshot from months ago because we hadn't realized that silently under the covers, all of our etcd snapshots were corrupted. So we restored like a three-month-old backup and all of our state was essentially in our IDM provider and encoded in Git, so we were able to just restore that snapshot. We had some interesting errors for a small period of time, but we just quickly roll that back to that state and that moment effectively.
Experience with Teleport deployment
Ben: 00:20:58.660 So this is a good segue to my next question which is: what are some of the other rough edges that you've experienced with Teleport or rolling out Teleport?
Chris: 00:21:06.494 I think one of the biggest challenges for us is because we have users all around the globe, Asia Pacific, Europe, here, and they're accessing resources all over the globe, we've tried to deploy proxies and things. And it starts to take you — and these are largely on-premises machines, so it takes you out of some of the access patterns that I think Teleport was written for. Some of the cloud-native things that you can do with application load balancing and autoscaling groups don't work when there's a machine in Tokyo and there's a machine in South America or something. So finding solutions to those things has been a challenge, I think, for us. And I think we're still learning, and we're still figuring things out there.
Joe: 00:21:59.406 Yeah. And I think one of the other big rough edges that we've had is trying to take our existing policy. And when we first started this journey, it was like these lofty aspirations of like, sure, we're just going to review everything we have. We're going to take it. We're going to cut it all into nice and neatly defined roles, and that's all we'll ever need. And the reality of it is that's never the case. And I think I spoke earlier about reducing silos and kind of bringing everything into one source of truth. That's great in a perfect world, but the perfect world doesn't always align with reality, right? In the real world, we're going to have many sources of truth. We've talked about making decisions within Teleport from multiple different providers, whether that's our identity provider or Workday or CMDB. So it's kind of like gluing all these different external sources of truth together and using that within Teleport to make decisions. And some of that is going to happen externally via Teleport plugin, and some of that is going to happen natively within Teleport, hopefully over time, especially with Predicate.
Ben: 00:23:05.037 Yeah, that's good. So I think both of you are in engineering. Can you talk about how you work with your partners in security and how Teleport sort of fit into that?
Joe: 00:23:17.547 Sure. I mean, we're kind of glued at the hip. It's one of those relationships that is incredibly valuable. Our infosec team has grown over the years and encompasses people of many different talents from red team, blue team. We've got lots of different varied talent on our infosec team. And I think we meet with them regularly, and we solve a lot of problems that they want to solve. And conversely, they're actually using Teleport in lots of ways for hardened systems that we don't even have access to, so it's a good relationship, I'd say. You want to add to that, Chris?
Chris: 00:23:54.384 No, I was just going to add that there's actually people from security, they come to us and they're like, "Hey, can you help us with this thing that we're not going to give you access to?" And it's like, sure, yeah, because they've found that Teleport's pretty flexible for granting access to one-off things. So yeah, we actually use it for a lot of things beyond the large number of systems that are centrally managed.
Ben: 00:24:15.510 And do you export your logs to a central SIEM solution in-house?
Joe: 00:24:19.537 We do, yeah, we do. And being able to kind of look at patterns — I know there was a chat earlier about exploring patterns in your logs, and we are every day finding interesting patterns, whether it's user accesses or just being able to kind of provide tooling so that given a user, we can look at — are they trying to log in with an expired cert? When did they last log in? Some —
Chris: 00:24:46.481 What are the roles when they last logged in?
Joe: 00:24:47.770 What are the roles?
Chris: 00:24:48.538 Which is something I built into a —
Joe: 00:24:49.790 It's some tool that Chris had built into reporting tool.
Chris: 00:24:52.528 A log dashboard, yeah.
Joe: 00:24:54.679 But those centralized logs are really helpful to kind of look at patterns over time, both from a security auditing perspective and from an operational perspective.
Challenges faced as production engineers
Ben: 00:25:07.019 Great. And so what are some other challenges you face as production engineers outside of Teleport?
Joe: 00:25:13.643 Oh, I was going to mention some problems with some challenges we're facing with Teleport, but expand on that last question, the other question. I think one of our big challenges I think is observability. We're constantly talking to Teleport, trying to get more metrics out of the proxies and the auth servers and make sense of some of those metrics and make sense of some of those logs. But outside of that, I think one of the bigger challenges for us is identifying where problems are. And also as we go to roll out fixes for problems, how we can do A/B testing, right? When we're dealing with the scale that we're dealing with, first of all, it's very hard to test for bugs at that scale. And secondly, if we want to roll out a change to a proxy or we get a debug build to test some fix to a proxy, how do we roll that out? Anytime we touch a proxy, we're impacting anywhere from 200 to 4,000 sessions. It's a real challenge.
Managing upgrades at scale
Ben: 00:26:17.981 How do you think about upgrades then, in that case?
Joe: 00:26:21.002 They're very carefully orchestrated.
Chris: 00:26:23.491 We get a lot of flack any time we try to do any upgrades. That's why patches are painful. But it's being transparent with your user community, I think. We've gone on a rocky road where I think we've earned a lot of trust with them at this point because we do respond quickly to things. We do let them know what's coming. We do explain, okay, we've got to take this down because remember, you've been asking for SFTP for six months. You're finally going to get it. Let's do this maintenance. But the part where we do run into challenges is, like Joe said, sometimes you just can't test every scenario when you've got thousands of systems in one cluster. But rolling upgrades, I don't know. Maybe we could find a better way of doing this.
Joe: 00:27:19.247 Yeah. The challenge is impacting users, right? Everyone likes their long-running sessions, right? They've got their connected box or they're connected to some remote box, TeamBox open. They've got a bunch of X11 windows.
Chris: 00:27:34.605 If they were using TeamBox, they'd be less angry.
Joe: 00:27:36.197 But they got a bunch of X11 windows open. So the second you kill that proxy, you're killing all that connectivity. And people are always unhappy. I mean, we tried mitigating. We provide tons of advanced warning, and we usually tend to coordinate upgrades with some sort of [inaudible] like this feature is going to drop, or you're going to get this additional thing, so that usually worked.
Chris: 00:27:56.450 But the cool thing is Teleport tells them when we haven't upgraded.
Joe: 00:28:01.525 That will be stopped soon. Everyone knows the v11 came out.
Ben: 00:28:09.423 Yeah, it's got a notification this morning [crosstalk].
Joe: 00:28:12.668 Should upgrade to Teleport cluster.
Chris: 00:28:14.712 Make sure to message somebody in the Teleport help channel.
Joe: 00:28:17.119 Yeah, we've been getting a lot of comments from our users, "What should I do? Should I upgrade something? You should upgrade your cluster." There's lots of really interesting feedback from users.
Chris: 00:28:27.521 Next person that messages me is going to upgrade the cluster.
Joe: 00:28:29.685 Yeah, exactly. Here you have it.
Other systems used at Jump Trading
Ben: 00:28:33.069 So apart from Teleport, what other systems do you two manage and maintain?
Joe: 00:28:38.438 So for me, I've been at Jump for nine years. I mean, I've touched a lot of various internal systems, a lot of infrastructure. We've got lots of moving parts and lots of infrastructure, lots of automation. So without going into too much detail, yeah, I've touched a lot of Jump's infrastructure.
Chris: 00:28:59.139 Yeah. I mean, config management, IPAM, all the basic stuff. And Joe's also a server hardware wiz. And then recently, I've been doing more cloud stuff too, so that's been fun.
Practical tips on handling access at scale
Ben: 00:29:11.693 Yeah, nice. So just to close things out, I have a podcast called Access Control. Everyone should subscribe. And this will probably go up on it. And it's about giving practical tips for startups in the security and infrastructure realm. So I always like to close out — what's one practical tip that you'd like to give to the audience here who's in person?
Joe: 00:29:32.775 Sure.
Chris: 00:29:33.367 Can I start? Sorry.
Joe: 00:29:35.233 Yeah, go for it.
Chris: 00:29:35.830 Otherwise I'll forget my train of thought. Don't underestimate your users, right? The one thing that I can't say enough, whether it's 100 people or 1,000 people moving or 10,000 people moving to use Teleport, try to understand their use cases because there was a really great comment this morning. If they don't see their use case, they're just going to work around what you're doing. And I think that rings true.
Joe: 00:30:02.429 Yeah, for sure. For me — I mean, I'm going to give some general advice, not necessarily specific to Teleport — but invest in hardware keys, YubiKeys, Nitrokeys, whatever you want to do. It's incredibly important for your personal self, for your company, for your organization. We kind of give them out like candy for both personal use and for professional use. There's a reason they exist, and they will protect you from ransomware or any other sort of attack.
Chris: 00:30:33.072 If you implement them correctly.
Joe: 00:30:34.229 If implemented correctly. And please, please do implement them correctly. Yeah.
Q&A
Ben: 00:30:39.075 Awesome. All right, thank you two. So we have about 10 more minutes left. Do we have any audience questions? I think we may have to get the mic.
Chris: 00:30:47.548 No, Jay's not allowed to ask questions.
Joe: 00:30:48.718 Yeah, no Jay.
Chris: 00:30:51.330 You're supposed to be answering our questions.
Attendee 2: 00:30:55.224 It's on? Yes, it is. I first want to say thank you both, gentlemen, Chris, Joe. You guys are a big part of our community, and we appreciate you being here and sharing your expertise with everybody. I will share some trade secrets with you that I won't tell anybody else is that when the support guys ask me for help, which happens pretty regularly, I always say, go see what Joe posted in his channel because he probably broke it. So I have a question around scalability and specific feature which is proxy peering. Are you using proxy peering today?
Joe: 00:31:28.908 We're not. We're not but we actually hope that's going to solve some challenges. So we are having — like I said, we misuse Teleport regularly. So there is an assumption when you're in IoT mode that that IoT device is going to be able to talk to all the proxies. It's going to just do brute-force discovery. We don't expose all our proxies to that IoT device, so that is one thing that we're kind of failing at. Well, I mean, it's by design, so we're not failing at that. But when that discovery mechanism fails, we'll often end up with an IoT node that's not connected to the known proxies that we expected to be connected to. So first of all, understanding that that's actually happened with metrics, and this kind of goes back to observability that I mentioned earlier. It's tough to do today, so that'll hopefully get better soon. But secondly, proxy peering would solve that problem where if a user in, let's say, New York, that IoT node is not connected to the New York proxy for whatever reason. Theoretically from the New York proxy, they would just peer to an adjacent proxy and be able to hit that IoT node. So yes, that's going to solve the problems, hopefully. We're not using it today, but we hope to soon.
Attendee 2: 00:32:38.104 All right. So if nobody asks another question, I do have a follow-up question around etcd. A lot of on-prem customers are forced to use it today. Do you have any advice for them on how to set it up and how to use it going forward?
Joe: 00:32:53.530 Yes, that's a very loaded question.
Chris: 00:32:55.366 Validate your backups?
Joe: 00:32:57.232 Yes, definitely. It's very easy to automate snapshot validation. We learned the hard way that you shouldn't, so we're doing that today. Outside of that, I mean, one of the things that we kind of learned from various etcd outages or memory leaks is that, generally speaking, everything is done via GitOps, so infrastructure as code. And every other bit of source of truth you need is in IDM, and you don't have any local users. That etcd cluster is effectively disposable. Assuming that you're either using HSMs or you have your CA backup somewhere, you can effectively throw that etcd instance away, bootstrap a brand new one, kind of put the keys and values that you need like your CA. So oftentimes, it's actually a faster path to resolution when etcd is screwed up because it will inevitably get screwed up. I think outside of that, testing etcd at scale, right, ensuring that what you're doing — etcd has a great operations guide published online. Disk IOPS are hugely important. Network latency is hugely important. Pay attention to those things. Those are really important. Don't try and run etcd globally. Those are some lessons that we've learned. And I think, yeah, follow the etcd operations guide. I think it's the best advice I can give you.
Chris: 00:34:21.634 And don't upgrade them all at the same time.
Joe: 00:34:24.687 Yes, you can do minor upgrades and leave them in inconsistent state. Minor upgrades you can do and leave them in inconsistent state.
How HSMs are used
Ben: 00:34:31.637 Can you talk a little bit about how you use HSMs?
Joe: 00:34:36.331 Yes. So yes, I mean, obviously, we lobbied for the HSM feature functionality within Teleport. HSM's plugged into the server. Teleport uses them. It's that simple, I swear. Now, there's some CA migration, but depending on if you're starting with HSMs or not, the documentation is really good. Test it, obviously, first, please. But yeah, it just works. Yeah.
Ben: 00:35:03.676 Any other questions from the audience? Okay, so please give me a warm round of applause for — [applause]
Joe: 00:35:15.042 Thank you.