Identity-Based Data Security on AWS

Security & Compliance for AWS Infrastructure

By Allen Vailliencourt

Secretless, Identity-Based Access

By Nicholas Bergam, Michael Nowin

Why You Need to Ditch Passwords, Private Keys, and All Other Forms of Secrets

By Ev Kontsevoy & Ben Arent

Securing Critical Assets with Teleport

By Stephen Downing & Blake Brown

Identity-Based Data Security on AWS - overview

Modern data drives business value. But the speed with which it is created and accessed across a global AWS footprint increases risk considerably. The old ways of securing data – VPNs, shared credentials stored in a secure vault, offsite backups – are no longer sufficient and don't work at cloud-scale. Join Teleport CEO Ev Kontsevoy and Open Raven CEO Dave Cole as they present a practical view of modern data security in two parts. First, they will look at protecting data infrastructure on AWS to ensure that compromised applications or credentials can't be used to escalate data access privileges. Second, they will look at protecting data itself on AWS by answering fundamental data questions such as where it is located, what types there are, and how it is secured.

Key topics on Identity-Based Data Security on AWS

The challenges of securing access include human access and machine access. Companies’ typical approach to security often begins with securing infrastructure, then data, then the larger attack surface.
Classifying data is important in order to identify what type of data is lost during a breach.
In many cases, breaches happen due to human error, by "the good guys" making unintentional mistakes.
The cloud has powerful capabilities that come at a cost: complexity.
We're so dependent upon technology that protecting it through cybersecurity has become vital.

Expanding your knowledge on Identity-Based Data Security on AWS

Learn more about Identity-Based Data Security on AWS

Introduction - Identity-Based Data Security on AWS

(The transcript of the session)

Ev: 00:00:03.114 So the topic of today's webinar is data security and accessing data in the cloud. I'm Ev Kontsevoy, the co-founder, CEO of Teleport. For those of you who don't know much about teleport, we are the easiest and most secure way to access infrastructure. But the topic of today's show is not infrastructure. As I said, we're going to talk about data and identity-based data security in the cloud. And to help me talk and reason about this complicated and very important topic, we have Dave Cole joining us. Dave is the CEO of Open Raven. And Open Raven is a cloud-native data classification for security. For those of you who don't know what this means and why it matters, you are about to find out. And as Adelia said earlier, if you have any questions for us, please ask them using the Q&A option on Zoom. And we will have a Q&A session in the second half of this conversation, which I will be moderating. But before we start talking about data security and data classification, maybe we can kick this off with talking about access because at the end of the day, when hackers are trying to break into your infrastructure, it's the data they most likely are interested in. It's very unlikely that they're trying to hack you just to mine Bitcoin. So data is what we're protecting.

The challenges of securing access

Ev: 00:01:35.927 And I wanted to kick this off with kind of enumerating the problems and challenges that we see that organizations face to enable secure access to their data. So the first concept I wanted to discuss is the concept of a surface area and why this matters. So when you have your data set, which is extremely valuable to you, if it's valuable to you, it means that your employees or your customers, they need to have access to this data because otherwise, it's useless. If no one has access to data, then the data cannot actually be put into good use. So if you have this extremely valuable data, let me demonstrate the importance of the surface area concept using just one example. If that data that you're protecting is hosted by, let's say, MongoDB database — it could be anything, it could be Oracle, SQL Server, MySQL, Postgres, whatever it is used. Like the picture won't change. I'm using Mongo just because it's a cool word. So let's start counting how many different ways that data can be accessed. First of all, MongoDB listens on its own network socket. So you can access MongoDB data by simply connecting to a MongoDB address, network address, that it listens on. Which is kind of obvious. That's one way. Which means that you need to figure out how you enable encryption, connectivity, authentication, authorization, audit, role-based access control, and all of these fancy things for protecting that network socket. That's one.

Ev: 00:03:24.192 Second. If you think about it, MongoDB can be accessed actually via SSH. So it could be an engineer. Or it could be an attacker. It could be a hacker. They could actually gain access to that data by not talking to Mongo, but instead they could just establish a SSH session into a machine where Mongo is running, and then just do like a Mongo dump of a database. That's access method number two. But if the Mongo is running on a VM, on a cloud account, then you can actually do a snapshot of that VM or EBS volume using a cloud API. So that's a completely different door, if you will, to access that data. So that's number three. And then now imagine if your MongoDB is running on Kubernetes cluster. So Kubernetes itself has an API. So you could get into cluster, into Kubernetes cluster. And then from being inside of Kubernetes, you can get access to that Mongo volume. And you could steal data that way. So that's number four. But also, plenty of organizations, when they use databases like Mongo or anything else for that matter, they may have some kind of web application for managing that database, the web UI, which has its own remote access method. And I can go on and on and on, but I do want to demonstrate that the surface area to access just this one database is actually quite sizable. And it will only grow. So that's one concept that I wanted to highlight.

Human access

Ev: 00:05:00.538 And the second dimension that you could use to reason about the surface area of who has access to your data is the dimension between like — from where these connections will come from. On one hand, you have humans. And on another hand you have machines. So let's call them robots. And that is also really important to understand. So for example, when we talk about humans expanding the surface area for accessing your data, so that could be your software developers who may need to access data simply because they're building applications. Or it could be your DevOps engineers, the teams that manage infrastructure, and actually scale infrastructure. Which means that they obviously need to have like logins, identity, roles, certain roles, who can do what, and whatnot. And you need to multiply all of these people in different roles, what they do, by all of these different methods that infrastructure would be accessed because that defines the surface area. But it becomes even more interesting if we're talking about robots, like what machines do. So you have your CI/CD pipelines. You have tools like Jenkins. And Jenkins essentially is a superuser who could do a lot because these CI/CD pipelines often are stuffed with credentials because they're constantly accessing infrastructure. So if you need a piece of automation to do a backup of a database, that piece of automation has a login. Or if you have like a backup solution, so that backup solution usually has credentials injected into it.

Machine access

Ev: 00:06:42.046 And then also on the machine side, you have your actual application. You have microservices that are actually doing something with the data. And those microservices also use some form of credentials to be accessing this data. So now, if you combine all of the above, so you have the kind of different entry points into your data, and then you have humans, and then you have machines. And now all of this is kind of scaling with time. That is the challenge that we see organizations face. It is like how do you compress that surface area and make it easier to reason about, especially as organizations grow. So that is the [inaudible]. I wanted to kick off this conversation with kind of framing this access challenge first simply because we — even before we get into the nature of the data, and before we get into use cases, how data is used, it's just important to keep in mind that simply enabling, making data accessible is a huge challenge in itself. So I'm looking forward to have a conversation with you, Dave, about that. But first, let me pass the mic on to you. And can I invite you to introduce a few important concepts that matter to organizations simply because accessing data is not a binary proposition? No data is created equal. So how you guys see this problem, and maybe it could be a good opportunity to introduce the notion of data classification for security.

Companies’ typical approach to security

Securing infrastructure

Dave: 00:08:31.073 Yeah. And thanks, Ev. And I'll start with just a statement that we know that people often don't start with protecting the data. Typically, there's sort of an implied maturity model in the cloud, if not a stated one, where you start with your infrastructure. And the cloud security posture management tools have been around for a long time. That's typically where folks begin, is what's internet accessible, what needs to be locked down, where are their obvious misconfigurations and so forth. That's kind of step one. People move from there, often, into, "All right. I'm tackling things in production. I'm handling those issues. Let's shift left, as they say, and see how we can catch things before they go out in production." Whether it's in the application itself, whether it's at deployment, Terraform, or so on, they make it easier so they're catching it before it goes into production. And then what we did — and we went through the same progression, whether it was — when it was at CrowdStrike, Tenable, or now at Open Raven, then you look at it and say, "Wow, there's a lot of different ways that access can be obtained to these." I think you did a great job of laying out sort of the access attack surface. And folks use Teleport. And they use solutions like Teleport. We use Teleport in order to shrink down that attack surface and make it easier to audit, make it easier to manage, make it simpler. SREs aren't falling off trees, especially right now. We want them to be as productive as possible. So there's an obvious win there, both in terms of productivity and security.

Securing data

Dave: 00:10:07.940 And after that, you end up having to take on the data challenge. And typically, it happens because you've had a leak, you've had a breach. And you've ended up in a situation. And I can't tell you how many times I've seen this as part of an incident response, where an organization has had data they've tracked. They've followed the kill chain back. They know there's been a breach. And all of a sudden it says, "Okay. What's in that bucket? What's in that database? Because if there's European data in it, it's a GDPR response. And if there isn't, it's a different playbook." That's some of the times where we start to get — that's one of the catalysts for starting to get serious about data. The second one could be GDPR. And when GDPR hit, I can't tell you how many uncomfortable questions, all of a sudden, you're asked about where your data sits, what type of data you have, how it's being protected, whether or not you should even have it at that point, and whether it's just creating costs for you. So whether it's compliance or a leak or a breach, that's often the catalyst for that final step in the maturity model, which is getting serious about your data security.

Securing the larger attack surface

Dave: 00:11:15.523 And as you said at the beginning, yes, there's other reasons why people hack stuff, whether it's service disruption, whether it's Bitcoin mining. But every breach and data leak has exactly one thing in common, the data itself. So we looked at this and said — we basically turned inside out everything we were hearing from the security leaders when we started talking to folks about this and kicked off the company in 2019. Their key questions were, "Where's the data? What data do I have? And how is it being protected?" And it's different than on-prem. It's fundamentally different. I'm not sure we had great answers to these questions on-prem. But it didn't matter as much because it wasn't as exposed. Just like you were saying, the surface now is so much greater. You have all these different means of accessing something. How many people handle data in the cloud now? There isn't like a DBA that's sitting there jealously guarding your Oracle server. Now it's your salespeople, your product people. It's a data science team. There's a whole cadre of folks who have access to the data itself and need access to it. But having said that, it's so easy to make a mistake, to copy, to change, to misconfigure. And if you look at the leaks that have happened out there inside the public cloud, which is where 60% of the world's data is, and increasingly more, the bulk of these are — it's entropy. It's people making mistakes at scale. And just as you laid out how hard it is to actually get configuration right for access, it'll take something like S3 that's been around since 2006. It's got 16 years of fervent development, energy, and progress behind it. Which is to say it's complex as hell. It's hard to get it right.

Dave: 00:13:04.576 And here we sit. People are calling it the shared responsibility model. Phil Venables, amazing guy, and an investor in Open Raven, has recently called it the shared fate model, which I think is a little more appropriate. And what survives the shared fate model at the end of it? Well, it's identity, and it's data. Those are the areas that kind of fall squarely on you as a customer. And it's hard. It's hard to get it right. And it's hard to get it right for intrinsic reasons that aren't going to go away.

Complexity as the cost of cloud capabilities

Ev: 00:13:37.504 Dave, I just caught myself listening to you and obviously nodding and agreeing. But don't you think that, you and I, we might come across as like anti-cloud people — because you mentioned on-prem as like good old days. Like it used to be much nicer back then because it was simpler. Like I think, you and I, we both love the cloud. The cloud is a good thing.

Dave: 00:14:03.045 Yeah.

Ev: 00:14:03.369 We just need to acknowledge that all of these wonderful capabilities — they come at a cost.

Dave: 00:14:09.279 They do.

Ev: 00:14:09.806 And the cost is complexity. But going back to something you said earlier, when you try to understand what kind of data you have, so you've provided two examples. And I think both of them were kind of obvious. Like data for production versus data for staging and data for testing, obviously, completely different risk profile for those. You also mentioned data that contains information from European customers that make you exposed to GDPR. Do you have any other examples of data classification that might be useful for the audience to understand like in this context?

Finding sensitive data to identify would-be targets

Dave: 00:14:48.497 Yeah. And it's important that when we talk about data classification now, we're really specific about what we mean. What I mean is finding the sensitive data automatically across an organization's environment. So this isn't going in and saying, "What's top secret? What's confidential," which is one type of data classification admittedly. This is going through, and assuming you have swaths of data, billions of objects, if not hundreds of billions in objects, hundreds of terabytes, if not petabytes, going through that and finding the sensitive data that's out there. It might be regulated data like PII or PHI, patient health information or personal identifiable information or personal data. That's one type. It might be crown jewel data. It might be source code. It might be manufacturing files. Or it might be developer secrets that were left — SSH creds, API keys that were left in the environment mistakenly written to a log or so forth. So it's going across the environment, finding that sensitive data so that you know exactly where the would-be targets are. And you're seeing the cloud more and more becoming the target. I mean, it's like that old maxim, right. It was like, "Well why do you rob banks? It's where the money is." Like the data is in the cloud. It's not a big secret.

Dave: 00:16:04.049 And it's in the cloud because the tools are amazing. I mean, try and find great data science tools on-prem versus what you can get, the scale and flexibility that we enjoy in the cloud now. And going back to your previous point, we did a webinar a while ago because there was some of this talk. It was like, "Gosh, is cloud more secure?" Well on-prem people, like your on-prem data centers, when's the last time you updated your firmware? It's nice to have someone else update your firmware and the physical security things. That's not a bad thing. There's an amazing bunch of advantages in the cloud. But like you said, complexity and scale — they're real challenges we have to take on. It's very, very different.

The importance of classifying data

Ev: 00:16:46.748 So the couple of things you mentioned that made me realize that some people might confuse the data classification problem as, "Oh it's a product for people who don't even know, themselves, what kind of data they have," but I would argue that most of us don't know what kind of data we have. Even as an engineer, last time, when I was involved in creating software myself, I remember that we were pleasant — let's say unpleasantly surprised that application logs that we were collecting, like just for our own development purposes, accidentally contained personally identifiable information from customers. And it's just so easy to do. Like you catch an exception in your code, and you just dump the entire error object into the log so you can inspect what is happening later. And then it goes into Elasticsearch or whatever. And it turns out that you're actually dumping like a complete user profile, for example. It's just for debugging purposes. But this makes you exposed. And that's the kind of data that needs to be protected. So discovering and classifying data is really important. You also mentioned something interesting too when you said that the humans make mistakes, and then breaches happen. This actually is probably the most frequently asked question that we would have with organizations that reach out to us trying to understand like what are the risks, how are we exposed, like what sequence of events will lead to us losing data or getting our data encrypted for ransomware purposes.

Dave: 00:18:30.077 Yeah.

Ev: 00:18:30.689 And the story we always share is this — again, I don't want to sound like as an absolutist. But I will say that most breaches start with a human error. Human error is just the fundamental foundation, the cornerstone of all attacks. So they find ways to exploit a human mistake just to get in. And the second thing that always happens is pivot because usually, you only hack into one little thing, and then you try to expand from there. It's usually done through trying to elevate privileges. Or you try to exploit — you go sideways. You try to find vulnerable workloads nearby. But going back to the origins of every attack, that it's a human error. So then it pays to pay attention to the probability of that happening. So what is the probability of a human making errors? And what goes into calculating those probabilities? And the reason why the industry right now is moving away from passwords and any other forms of secrets, like private keys or even API keys, is because most of these human errors, that's about a secret leaking. So there are marketplaces on dark web where credentials are actually being sold and bought. So that's almost like an intentional error. So you have some employees that are selling their credentials.

Ev: 00:20:09.787 In fact, infrastructure engineers are getting targeted by, let's just say, the outreach campaigns by these hacking groups where they're basically reaching out to them. It's like, "We will pay you if you work at Uber and you have access to some specific thing." But also, human errors happen every time there is encryption and decryption involved, so if you, for example, have some kind of encrypted vault or solution for storing secrets in the organization. Which means that decryption takes place somewhere. So there is a moment in time where a human has possession of an unencrypted secret. And that exposes you to that probability of that secret leaking. Which automatically means that the probability of you getting hacked goes up over time as you scale. So the more secrets you accumulate in your infrastructure — and I already gave an example why that would be the case simply because you'll have more and more and more entry points, because your infrastructure is getting more complex, you're getting more and more and more components that could be used to access data. So the probability of you accumulating more secrets goes up. And then more humans you have involved in the process of computing in general, that also drives the probability up. So this is why we now believe that the presence of secrets of any kind in a data infrastructure is an ongoing liability. So you need to move away from secret-based access to identity-based access. And identity-based access basically means that there is one place where you store your secret, your identity of all of your employees. And then you have a system in place that kind of propagates that identity across the infrastructure. So the surface area doesn't grow as your organization grows. That's the value.

Dave: 00:22:02.351 Yeah. Yeah. Yeah, it's a great concept. You have something. You have a cloud environment that's horribly complex. And how do you deal with it? How do you get it down to something manageable? The only way you can get it down to something manageable with scarce human resources — which by the way, in the economic environment we're entering into, human resources are going to be even more scarce, more than they've been before. We're going to be pushed to do more with less in this environment. The mistakes, the speed at which people are moving is only going to accelerate. It's way too easy to make mistakes. How do you get out in front of it? Reduce the surface, cut down the problem set that you have to deal with. And that's part of the mantra with why you would do data classification across your environment. Why would you take a data-centric approach? Part of it is because you can't protect everything the same. You don't have time to. So why not find out where the sensitive data is, reduce it to where it should solely be, make sure it stays there, and then make sure you lock down the stuff that really matters to the greatest extent possible. You can't afford to treat everything the same. And it's just bonkers that it's taken us so long to get to this place as an industry with identity and even more so with data. But I think it's the only way to win the game with the scarce resources, is you've got to shrink the surface down for identity and data, the last two things you own in the shared fate model. You've got to shrink it down. If you try and deal with all of it and treat it all the same, you don't have a prayer of staying out in front of it, especially in the environment we're going into.

Ev: 00:23:37.156 That actually makes me think. Like when you go on Hacker News or any other place where you get your tech news, and there's this kind of news about a company being exposed, I always ask myself, why are we reporting this accident in such a binary way. Like there is an organization, and they get popped. And they lost data. But not all data is the same. Wouldn't it be great if we had a kind of standard classification, class one, class two, class three, class five? And then when you see a report about some company being hacked, then it will just say, "Well the investigation has determined that only class three data has been leaked. But class two and class one —"

Dave: 00:24:14.435 Hurricane ratings.

Ev: 00:24:15.152 Yeah.

Dave: 00:24:15.908 Yeah.

Ev: 00:24:16.641 Which probably unrealistic to expect that to happen. But if we are dreaming about standardization, do you see any changes in kind of regulatory environment that we live in that almost forces organizations to keep track of kind of different types of data, different buckets, and at least they would have a good picture internally to have kind of very similar conclusion, like, "We got hacked. And we lost class two and class three data," or whatever, "but not class four"?

Increasing attention to the laws in place

Dave: 00:24:50.897 Yeah. I mean, GDPR has been the gold standard. And if you look across the globe, the more [inaudible] — all of the better-done privacy, data privacy and data security regulations are roughly drafted off of GDPR. And it'll be interesting to see what happens in the US. It's starting to feel like there's actually an appetite for a federal data privacy law and security law because the two are kind of commingled, right. It starts with privacy. It primarily looks at consumer data. That's what they're trying to protect. But yet it blends right over into security when we talk about things like breach response and so forth. So it's interesting. I don't think we're going to see — I think what we'll see is just the continued drumbeat of GDPR. GDPR fines, I think, at 1.3 billion last year, just a massive increase. During 2020, It kind of took a little bit of — there were fines. But there was a couple of big ones. But it wasn't as noisy because we had the pandemic to deal with and other things. And I think what we're seeing now is a steady drumbeat of fines and really not something new and crazy taking hold. But GDPR is starting to get very serious. I mean, at the point to where John Oliver is doing a segment on data brokers, which by the way, was brilliant if anybody hasn't seen it and just wants to understand the problem. I mean that's really what a lot of these laws are targeting — companies like that, at least the ones in the US, because they're the most egregious offenders. But yeah, I think we're seeing the laws that are in place starting to really take hold and get serious as opposed to new regulations. I think the older ones are getting teeth. And it does feel like we're starting to head towards a federal data privacy law. It won't happen during this administration. But it starts to feel, for the first time, that things are moving in the right direction.

Regulating cybersecurity

Ev: 00:26:49.541 So you think it's a right direction. Well like some people might say like, "Well you two are CEOs of security companies. Of course, you should be interested in more regulation because that drives more business." But let's kind of put our personal self-interest aside and think about it like, let's say, from first principles. We just recently had a conversation with famous security researcher, Bruce Schneier. And he had this point that he made, and in fact, he wrote a book about it too, which you could probably find, where his point is that security is not something that is done with commercial interest. Like sometimes, people have kind of similar argument for health care, that if you rely on capitalism to build secure solutions, it's just not going to happen simply because people tend not to value security. They don't pay for security. And the industry that he was making an analogy to would be aerospace. Like it actually would be cheaper to design airplanes, but they would be kind of less safe. But if you are a passenger on an airplane, you have no idea how the airplane is built. This is why we have all these regulations. This is why it takes a lot of time and money to certify a new aircraft to be viable for transporting people in it. And his argument was that we need something similar for basically all of software simply because capitalism is failing. So what is your take on this? Do you believe that just security regulations is a good thing? And if it's a good thing, how much is a good thing? Because probably too much of anything is not necessarily good. So what's your take on kind of government stepping in and helping us develop better standards for understanding how secure and unsecure a certain solution is?

Dave: 00:28:43.737 I think we've reached a point where cybersecurity is just — we're so dependent upon technology. It's such a vital part of our lives. I don't think we have any choice. These things that touch — they're no longer cybersecurity issues. They're human issues at this point, right. When you've got a point where you have a device that's attached to you 90% of the day and can demand your attention anytime, it intersects into your health, intersects into how you digest information, how you get your job done, how you interact with your kids, at that point, these aren't cybersecurity issues anymore. They're human safety and privacy issues. And I don't think we have any choice. And I think we've — the tech industry has done a really crappy job making the case that we can regulate ourselves at the end of the day. I don't think we've made a cogent argument in that direction. And I'll say that at the end of the day, practically speaking, do we think we're better off with 40 different state privacy laws that are all different? Or are we better off with one federal privacy law that we can actually understand? I mean, you've got things like the CCPA and now CPRA in California, which are becoming de facto data privacy laws for the US. And if you understand how these laws were put together, you would not be impressed. CCPA was famously quoted as being GDPR written in crayons. CPRA is better. But CPRA doesn't come into place until next year. And even then, they've got to ramp up and actually have people who can enforce these fines and so forth. As much as I'm not an advocate of big government, I think at the end of the day, these are fundamental human issues. And we're better off with one federal law with respect to data privacy and security, provided that it's reasonable as opposed to an avalanche, a tangle of legislation coming out of the states.

Ev: 00:30:41.169 Yeah, all of valid points. Especially about fragmentation of regulations, that's incredibly annoying. I totally agree with you. My concern with the government stepping in is like if they try to regulate the how’s instead of what? Like one example, maybe it's — I don't believe there's a specific compliance standard that fits it. But for example, Gartner, when they classify you as being a privileged access management solution, like a good one, they would say like you have to have like password rotation implemented. If that is official regulation, that's not a good one, because we, now as an industry, realize that any form of password, rotated or not, is a liability. Which means that if you have a regulation that requires password rotation, that just automatically makes your solution less secure. So that would be very unfortunate. We actually have a few — questions are starting to trickle in. And this one is for you, Dave. So Vincent is asking, "At what points do you think data should be classified for better incident management?" And I'm assuming that, "At what point," that's point in time.

Dave: 00:32:02.944 Yeah.

Ev: 00:32:03.231 But if that's not the intent, then Vincent could probably correct us in chat.

Dave: 00:32:07.219 Yeah. Yeah. And I think it's at the point to where you have enough data, to where you feel like you don't know where it all is or exactly what type of data you have. It's the point to where you feel like you may not know. There's zero harm. Like we do data risk assessments for people. Just let us show you. Give us a cross-section environment, give us a number of buckets. We'll go through it and tell you what's there. And it's like you said before, Ev, it's so easy just to make a simple mistake and dump credentials and put PII and put sensitive data and accidentally expand the scope, increase risk, but also expand the scope of something like HIPAA if you're dealing with patient health information, or personal data, and all of a sudden, you've got a GDPR issue or a PCI or so forth. It's incredibly easy. So at the point to where you feel like you might want to check things out, great.

Dave: 00:33:03.958 And it's easy to do. It's not like it used to be. The old-school data classification projects and even the early cloud ones were expensive and painful. We've come up with — you have to respect data gravity. There's so much data there today. If you try and move or back all data at scale, there's no way it works. We use a Lambda-based serverless approach for doing it. And we think it's not easy. It's hard to develop. It's the big place we've invested a lot of our time and energy. But we think it's the only way to do it at scale. And it can be done cost-effectively like that. So yeah. It's at that point. And especially if you've done an IR, and you felt the pain, it's — yeah. It's really nice to know if you get a message like that fateful message from AWS, "Your account is showing signs of being compromised," if you've ever gotten that. It's really nice to be able to say, "This is exactly the data that's in that account and what might have happened and what we should be concerned about." And like you said, if you know your data is there, "Hey, it's a class five. There's nothing to see here. All they got were cat pics. Beautiful. Move on with your day."

Classifying data at rest or in motion

Ev: 00:34:13.075 Just a follow-up question, so when you were talking about doing it at scale, did you mean scanning data at rest or doing this classification while data is on the wire or both? How does it work?

Dave: 00:34:25.296 Yeah, you can do data at rest or data in motion. We chose data at rest because that's typically where you have problems, right. Like as you said before, typically, it's a mistake you made or something you did quickly that you intended to clean up. I mean, how many of us incentivize our engineering teams on cleanup and hygiene? How many of us incentivize them on getting clean burn downs and just getting stuff done and releasing things as fast as possible? When you do that, which everyone does, you make mistakes. Things get out there. That's data at rest. That data may never be in motion. But it's sitting there creating risk. The dark data, people estimate dark data, unused, untouched data at 70 to 90 percent of what's out there. If you deal with data in motion, like there's valuable use cases there too. But you're dealing with like 10 to 20 percent of the data that's out there if you're just looking at it in motion. So we look at data at rest because we think that's the bulk of the problem.

Productivity vs. security

Ev: 00:35:20.091 Yeah. And also, I'm fully aware that people who are listening to you and me right now, they're probably thinking, "Well like these two guys, they're telling me that I need to do more for the sake of security." And that's probably a fair assessment. And which reminds me that security always comes with a tradeoff, like productivity versus security. Like you will achieve more if you don't care about security. You will launch new features faster if you just build things without thinking about security. So this tradeoff of productivity and security, just overall, I'm starting to like — maybe we should address it head-on.

Dave: 00:35:56.444 Yeah.

Ev: 00:35:56.698 Like what do you think about productivity versus security, Dave?

Dave: 00:36:00.309 Well yeah. Well I think the greatest example of this is another security company where people always thought like, "Oh well this 2FA stuff sucks. And you can't get better security without impacting usability." And then you introduce something like Duo. And I think Duo did an amazing job of an Iterative improvement that was meaningful from a UX perspective. And all of a sudden, we got better security, at the time, with better usability as well. And I'm not sure that those bygone tradeoffs — I can tell you, if I took away Teleport from my head of Ops, he would scream that not only his security was impacted, but his productivity too. He would complain about both of them. And I think that's also true for what we're advocating for data security. If you go out and you find out where your sensitive data is so that you can focus there, yes, you have to do a little more upfront. You have to do something proactive. But afterwards, think about all the crap you don't have to worry about and the things you don't have to respond to because you know exactly where the important data is. It allows you to dial up monitoring in certain areas and dial it down in other places. So in order to do less, yeah, you have to do a little bit more upfront. But having said that, I think this maxim of, "Oh well you have to trade off usability for security," I think it's past its expiration date.

Ev: 00:37:24.942 Yep. Yep. Vincent in chat says that "A tradeoff should be balanced."

Dave: 00:37:33.019 Yeah.

Ev: 00:37:33.688 Absolutely, easier said than done. Just to share some stats that we have on this problem is that obviously, companies dial in different tradeoffs. So the balance is different. Like if you look at a really small company, startups that recently started, they're basically in the business of not dying. They haven't found the product-market fit. They haven't built the MVP. So they're basically coding, coding, coding, and just like trying to get something out as soon as possible. Security always takes — well almost always takes the back seat. So that's one form of balance. And then you probably can extrapolate into like you are a 50-year-old tech company. You're a giant. You have offices all over the world. You have solid processes. And so we have some stats to describe both of these kind of extremes. So on the kind of lightweight security side, we recently just did a survey. So we hired a company that went out and interviewed VP and CTO types of SaaS companies that basically run software for other people, build and run software for other people. I'm not going to mention any companies, obviously. This is all anonymous. So 86%, 8-6, 86% of organizations, they have shared with us that they cannot guarantee that their former employees, engineers can no longer access production data, 86. If you ask me, I think it should be on CNN. That is news because we all rely on software as a service to do everything. We file taxes online. We do online shopping. We trust these companies with our data. And 86% of them actually have no control over it because that's really what comes down to it. So if your former engineers probably — because they basically said, "We don't know, maybe." It means that probably, former engineers still have access to data. So that's one stat.

Ev: 00:39:30.627 But then also, kind of what happens if you tighten security too much, when security gets in the way of people getting things done, so we have another stat. And you know what, it's just like — it escapes me right now. But I believe that over half of companies, they stated that, "We implemented new security measures for accessing data. But they failed to be adopted by employees." On the surface, it's kind of a meaningless thing. It's like you're saying, "Hey, I have this building. I'm a building manager." Like building is the infrastructure, and the data is inside, okay. And we just installed a new door. But 56% or whatever percentage of tenants don't use the door. That is not possible because how else will they access the data. The answer is there is another door that you don't know about. So what this stat is telling us that companies that make it really, really hard to access data for their own workforce, they are asking their engineers to build back doors into the infrastructure. And when we look deeper into it, like how this actually happens, well it's easy to understand why this is happening with this cloud environment. It's because you're running on Amazon, not on AWS. You are provisioning infrastructure with code using Terraform, using cloud formation, whatever. Okay. Who writes that code, that provisioning infrastructure? Well engineers do. Those are the very same engineers that you then are trying to lock out of that infrastructure using these draconian measures. You're making their life miserable. But they actually want to get work done. So what they do, they provision infrastructure because provisioning is actually like number zero security event, because provisions infrastructure effectively has all control over it.

Dave: 00:41:22.396 Yeah.

Ev: 00:41:22.823 So they just basically plant another way of accessing it. So that was quite a revelation. So this is what you will do to your own organization if you over tighten security towards impossible to use.

Dave: 00:41:37.365 Yeah. And your analogy, I mean, the people who are building, who are provisioning are making the damn building today in a software-defined infrastructure world. And at the end of the day, they're incentivized. And intrinsically, they want to build stuff and get stuff done. If you put blockers in their way, that same surface, the expanse of the surface that you talked about before, is where they go. There's so many different ways of getting around it. And it comes down to their motivation, which is they want to get stuff done. That's a good thing. The trick for us, as people building security products and security professionals, is we have to make it easier for them to do the right thing than the wrong thing. You've got to make it simpler for them. And I think that's fundamentally what it comes down to. And this is why that old tradeoff of like, "Well we'll just make it less usable," no, no, it doesn't fly anymore. You'll never get there because there's too many ways to get around it. There's far too many ways.

Ev: 00:42:39.927 Yep. Sometimes, I even think that — like it's interesting that if I meet engineers who are not in the security space, and they started asking me questions about what we do and how it's getting done, I'm always amazed on the kind of focus being misplaced. Like everyone is interested in crypto, not the cryptocurrency, but the crypto for security purposes, like the real crypto. Or they would start — like it's always like technology-centric conversation. Meanwhile, what we're finding more and more of, and I think it kind of came up in this conversation, is that at the end of the day, security is about minimizing probability of a human error. It's almost like this psychological mindset that you need. Like you have thousands of people within your organization. They are generally good people who are trying to get things done. So how are you organizing their workday so they're not exposed to accidents? So if you're forcing them to rotate passwords all the time, eventually, you will have someone who is under stress, and they will just write down the password on a sticky note and put it on the monitor. So it all comes down almost to this simplified view. Then security is basically user experience. And it goes back to what you guys do, like understanding what data matters more versus what data matters less. It makes it easier —

Dave: 00:44:04.603 Yeah.

Ev: 00:44:05.206 —to reason about data and data security. And whatever is making it easier, that reduces probability of human error. Also, a random question came up through Q&A. How is data discovery and classification different from DLP? I'm assuming it's related to Open Raven, or it's just generally about the industry. And what is DLP also, for those who don't know?

Dave: 00:44:32.839 Yeah, the origins of DLP go back to — god, probably like the modern origins if it goes back to the late '90s, early 2000s, where we were very concerned about insiders leaking data out of an organization. And the classic DLP alert is Bob's trying to send a spreadsheet that has social security numbers in it, should this be allowed out of the organization? And the most common answer to alerts like that from DLP was, "I don't know, hell if I know. Is Bob in finance? Is it Susie in HR? Is it Joe in marketing? Is it Marissa in sales?" Basically, it's this category of technology that was an endpoint agent which focused a lot on leaking office documents out of an organization and the insider threat. Contrast that with where we are today, 60% of the data is in the public cloud. The bulk of that is in IaaS and PaaS and enormous mounds of JSON, CSV, Avro, Parquet, and so forth. The world is just very, very different. So we are not trying to reinvent the DLP model, which was focused on the insider threats and office docs. We reimagined data security for the biggest issues that are out there today, which is the vast quantities of data that are sitting in the cloud where there's way too much of it to keep on top of it, too many people handling it, and there was no means of getting the baseline visibility and controls around it. So we don't use the word, DLP, because we don't think it really fits. It's quite a bit different. But that's what we're after.

Minimizing human error

Dave: 00:46:13.481 And, Ev, what you said is so apropos. We spent so much time as a security industry worrying about the bad guys, right, and the bad guys being highly relative depending on where you sit, whether it's China or the NSA or — choose your adversary choice. But the problem, the biggest issue we have in the cloud, and this comes through in every report that's out there, is it's the good guys. It's the good guys who just make mistakes. That's the key problem we have. And this huge concern and obsession we have with the boogeyman inside security, I think oftentimes, kind of obscures the bigger issue — is we just have to make it easier for our own people not to make those mistakes. And we need an equal emphasis on that as we've had in the past of keeping the adversaries out.

Ev: 00:47:06.310 Yep. Yeah. It also reminds me too — is it's helpful to have this framework in mind that security is all about minimizing human error. When you go through this soup of buzzwords that we, as any other industry, have, and the one that's been driving me crazy lately, like the last few years, is zero trust. People constantly throw zero trust around, asking like, "Are you guys zero trust? Like we need to be zero trust, zero trust this, zero trust this." But then you start to realize that in half the cases, the term is used in the wrong context. So let me just — to step back, what is zero trust, why it came around? Well zero trust basically says that in the old days — and computer security and physical world security, by the way, they have always been linked, right, so just the fact that we use words like keys to describe the way to open things. So in the old days, before we had cloud, so we had physical security around our data centers, right. So if you couldn't get into the building, so that was security. Or you couldn't get physical access to something, that was your security. So that perimeter-based security that existed in a real world, we transferred that concept into a computer world. So now we have network. And there is internal network. So you're inside the building. That's like local, like LAN or VLAN or VPC, if you will. And then you have the outside world, that's wide area network, that's the internet. And there's a firewall in between, which is again, like physical security concept that we have. So that's the old model that doesn't really work in the cloud because that perimeter is eroding. So you have devices that run code like in different parts of the world. You no longer have office network. You have employees on the outside. Then you have this workload sprinkled all over.

Ev: 00:49:04.706 And most importantly, the perimeter-based security makes it easy to pivot. So if the human error is exploited by a bad actor, it makes it easy for them once they're on the inside, to go sideways because everything is wide open. So zero trust says that perimeter security is no longer relevant, period. And it's actually a very simple concept in terms of how you think about your systems. It means that every computer, be it a laptop or a server in a data center, needs to be configured with the assumption that it runs on the internet. Like there is no local network. Like every single connection, it's coming from the outside world. That's all there is. It's actually a very simple concept. And the name comes from zero trust. Like you don't trust anyone on the network. In fact, it basically says networks don't matter [inaudible]. And also why it's useful — because it helps people to avoid making mistakes. It's just like you have this culture within your organization where when engineers design systems, configure databases, configure SSH, whatever, they just assume that every machine in the organization is running wide open on the open internet. Like every single port is wide open. Again, it's a simple mental exercise that makes it easy to avoid mistakes. That's why it's useful.

Ev: 00:50:27.539 But also, at the same time, I see VPN companies that basically implement [inaudible] security. They brand themselves as zero trust network access. So that makes no sense whatsoever. It's like saying —

Dave: 00:50:39.651 Yeah. I mean, at the point where it comes out in an executive memo where Joe Biden says, "You need to do zero trust," like —

Ev: 00:50:47.576 Yeah. Yeah.

Dave: 00:50:48.460 —it's a thing. It's a thing. Yeah. And I think going back to what you were saying before, I mean, the nice thing about zero trust is what does it do? It makes it easy to do the right thing. It really does. All right, we're going to get rid of the VPN. Name one person who liked the VPN, I mean, other than the consumer VPNs that people have used in order to access content in other geos. I remember needing to see Game of Thrones in Mexico. It was the one time that I liked my VPN. Other than that, zero trust is wonderful because you get better security by removing a usability problem. Zero trust has been taken to this place of being an internet meme. But having said that, at the same time, like you said, the core principle is very good. It comes from Google BeyondCorp and some of the stuff they were doing. It's an awesome concept that gives you better security and better usability when it's done right. But dear god, the marketing buzz on this is like — I'm a vendor, and I'm tired of hearing about it. I can only imagine how [inaudible] and other people feel.

Q&A

Ev: 00:51:56.737 Yep. We also have a question for you from the audience. Larissa is asking, "Are there any technologies that you've heard about that can significantly compress data to make it easier for corporations to hold onto and maintain more data behind their network as opposed to in the cloud other than Pied Piper?

Dave: 00:52:17.839 No. No. I mean, honestly, I think the biggest bang for the buck isn't compression. It's knowing what you have and where it is and then getting rid of the stuff you don't need. At the end of the day, I wouldn't be — the bulk of it, I wouldn't be necessarily compressing, I'd be getting rid of to the extent that you can and simply holding on to what you need or the time-honored tactic of just moving it off to something like Glacier or so forth. But I'll confess, not an expert in that area. How would you answer that?

Ev: 00:52:51.986 I'm not a data expert. But my friends in the field, they generally say that like data has a tendency of growing. And generally, we just have to be okay with it. Like what has been happening though is this kind of race in how quickly data sets are growing versus cost of storage. And historically, the cost has been dropping faster than the data volumes, which is basically a good thing. So maybe the like really short and concise answer to the question that was popped up, just wait. And like at some point in time for your particular business, the cost of storing this much data or transporting it will become a non-issue. That's kind of the general industry trend. But a kind of interesting fact, so I was talking to someone who was doing a warehouse project for the company that sells to law enforcement. And I was asking him, "So if I get a ticket driving a car, and then I go take a defensive driving a class, and they say my ticket is dismissed, like is it truly dismissed? Or are they just flipping a bit somewhere?" And he said it's way worse than this because these databases, they constantly get replicated. So if you get a ticket with like a city police, then that ticket will travel to like a state database, and from there, to god knows how many databases. So even if they mark you as you took a defensive driving class — now they don't delete any data, of course. That was a naive question. But the point is that that ticket is actually going to be around. It will probably outlive you. It's going to be in so many different databases. It's just like a really hard, almost like a computer science problem to track data that has been replicated, and there is no single kind of ID anymore.

Dave: 00:54:31.937 Yeah.

Ev: 00:54:32.493 We have another question pop up about kind of shrinking attack surface area by using network filters, if it's a valid strategy. So just to elaborate on that question a little bit, yes, you can reduce attack surface area by paying attention to the address of a client who was trying to access this. However, we would not recommend this to be on top of your lists, like to-do lists, so if you think about things that are more impactful to reduce the attack surface area. So the network dimension is a good dimension. And yes, some companies do employ certain filters on the origin of a connection. But most organizations, they overlook other dimensions. And the most important one that is overlooked is time. So let me give you a couple of examples. So assume you have an individual, let's say a DevOps engineer, who's supposed to have access to a certain system. But what does it mean? It means that they do have some form of credentials somewhere that allow them to access this system. But if they have access, ask yourself, are they using this access at this particular moment in time? And the answer would be, in 99.99% of cases, no, they do not. Which means that you have a laptop somewhere that has credentials that are not even being used at the moment. Which means that you are exposed through this time. And there are different techniques you can use to revoke this access because it's not used right now.

Ev: 00:56:14.032 And there are two techniques that are very popular at this time. First is to move from static credentials to dynamic credentials. So when someone says, "I need to access this Kubernetes cluster," for example, or this database, they're given a certificate that expires after an hour, for example. So when they're done using that workload, that credential just expires by itself, nothing needs to be done. So you see your exposure in time is dramatically reduced. That's one technique. The other one, it's called dynamic access requests or access workflows. So this means that by default, no one has access to anything, which is actually a very comforting thing to have. So you know that right now, in your organization, no one has access to anything. But then when an engineer tries to access, let's say again, a database, an access request is generated and sent into Slack to their team, kind of similar to a pull request. And they will see — the other engineers, they will see in Slack that like Dave, for example, is trying to access this database, and he provided the reason why. There is like maybe a ticket he's working on. And someone or maybe several people need to click approve button. Like we're all letting Dave access this thing. And it could actually be done transparently. And if you combine the two techniques, that creates a very secure situation where the access, by default, never exists. And when it's given, it's given for a specific reason. And when that reason goes away, the access goes away. And it introduces very little friction. Going back to kind of security versus productivity, so to remove incentives for Dave to have a backdoor, all of this needs to happen automatically.

Conclusion

Ev: 00:58:02.167 So I'm actually watching time. And we are at 11:00 am. Which means that the show is over. So it has been fantastic, Dave. Thank you so much for joining me. And thank you for the audience to be here with us for this morning. This is a reminder that this webinar was sponsored by Teleport, the easiest and most secure way to access infrastructure, and also by Open Raven, a platform for cloud-native data classification. You can find us at goteleport.com and openraven.com. And that's the end of the show. Thank you, everyone, for joining us.

Join The Community

Try Teleport today

In the cloud, self-hosted, or open source

Get Started View developer docs

The Teleport Access Platform

Featured Resource

Integrations

Featured Resource

Use Cases

Featured Resource

Industries

Featured Resource

Compliance

Featured Webinar

Strategic Partners

Featured AWS Webinar

Featured Blog Post

Featured Webinar

Identity-Based Data Security on AWS

Identity-Based Data Security on AWS - overview

Key topics on Identity-Based Data Security on AWS

Expanding your knowledge on Identity-Based Data Security on AWS

Learn more about Identity-Based Data Security on AWS

Introduction - Identity-Based Data Security on AWS

The challenges of securing access

Human access

Machine access

Companies’ typical approach to security

Securing infrastructure

Securing data

Securing the larger attack surface

Complexity as the cost of cloud capabilities

Finding sensitive data to identify would-be targets

The importance of classifying data

Increasing attention to the laws in place

Regulating cybersecurity

Classifying data at rest or in motion

Productivity vs. security

Minimizing human error

Q&A

Conclusion

Try Teleport today