Chaos Engineering Experimentation vs Testing With Casey Rosenthal, manager for the Traffic, Intuition, and Chaos teams at Netflix

Transcript for episode 28 of the Test & Code Podcast

This transcript starts as an auto generated transcript.
PRs welcome if you want to help fix any errors.

Welcome to Test and Code, Episode 28. I’m your host, Brian Hawkin. Today we have an interview with Casey Rosenthal of Netflix.

One of the people making sure that Netflix runs smoothly is Casey Rosenthal. He’s the man manager for the Traffic, Intuition and Chaos teams at Netflix. We’re going to talk about experimentation versus testing, among other things. He’s got a great perspective on quality and large systems and distributed systems. I think you’ll really enjoy it. But first, let’s thank our sponsors.

Welcome to Test and Code, a podcast about software development and software testing.

I’ll just jump into it. So today we’ve got Casey Rosenthal from Netflix. And what is your title again, Casey?

So I’m an engineering manager. I manage three teams, the Traffic Team, the Chaos team, and the Intuition team. Okay.

And I met Casey in my ever long and complicated attempt to get a Chaos monkey sticker. And we’ll get into it too much.

But you did get the sticker sticker.

And I think I found out about the team through the talk. Python group had somebody from Netflix on there’s some cool stuff going on at Netflix and your team is doing some neat things. So I thought I’d have you come on and talk to my audience about whatever you wanted to talk about. You said chaos engineering and experimentation versus testing. So how should we start this?

So let’s start by fleshing out that distinction. So when you think of testing in the classical sense, in the top side classical sense, you have some function or piece of code that given certain inputs, you assert what the output should be. So pretty straightforward and very useful. But we find that in distributed systems we’re in a territory now where that kind of model isn’t sufficient. For example, we have a lot of systems where we don’t know what the output of the system should be, so we can’t make an assertion on it. So instead we look to experimentation and more kind of the follow through of principles of Western empirical science to uncover properties of distributed systems and explore outcomes that we don’t like without setting up a test that has an assertion. So I can describe an experiment in more detail, if that would help.

I guess, to clarify a little bit, you’ve got something you’re watching, right? There’s some property of the system or many properties, or do you discover those also while you’re doing the experimentation?

Depends. Usually a solidly constructed experiment has variables that you’re studying.

Okay, let’s give us a concrete example of something, some kind of experiment you might do and what you would watch.

Okay. So to set up an experiment, we want to go through a couple of steps. So we want to start by defining a steady state for distributed system, in this case, some measurable metric or output that indicates what normal behavior looks like for the system.

And then we create a hypothesis that the steady state will continue in both the control group and the experimental group. And again, I’m describing a chaos engineering experiment here. But then we want to introduce variables that reflect something that could actually happen in the real world to a software program, usually outside of the scope of the software itself, like a server crashing a hard drive that malfunctions, network connections that get severed, big increase in traffic, Ram goes bad, anything along those lines. And then we try to disprove the hypothesis that steady state will continue in both the control and the experimental group by looking for differences in the steady state of those two groups, the experimental group being the one that has the real life conditions applied to it. So Netflix is a micro service architecture. If you picture a distributed system or imagine a distributed system that has a bunch of distinct features, and one of those features might be, we don’t exactly have this, but it’s easier to imagine one of those features might be a service that just stores all information about movies.

And we want to see if you can still play video, if that service can’t connect to its database. So we can set up an experiment by code or creating two nodes of that service. We’ll call it the movie database service, and taking some production traffic and round robining it between those two nodes. And for one of the nodes will apply this failure condition of that node can’t talk to its database. And then we look at those request responses as they’re emitted from the distributed system at the boundary of the system as they’re sent back to whatever devices requested them. And we can look at those and see, are they still successful? Or we can take even a higher level look at steady state behavior and say, okay, for the people that were subjected to the control group versus the people that were subjected to the experimental group, are they watching any less movies?

Okay, there’s a couple of things that this reminds me of. Partly, it reminds me of A. B. Testing for even like SEO marketing or something, changing the title of something and seeing how many people click on the link or something in a couple of different groups. The other thing it reminds me of is medicine, like trying to figure out whether or not some sort of behavior or physical like do sit ups help people versus just taking five minutes out of their day doing a control group and somebody actually doing it. That actually sounds a lot more like science than most computer science does, to tell you the truth.

Yeah. And certainly from a proper Esque background it definitely leans more in the hard science direction and that’s one of the things that I like about it that I appreciate about it. So I would say A B testing is usually a form of experimentation. The way I’ve seen it most employed is a little bit looser because usually there’s the hypothesis is something along the lines of these two different ways of presenting a feature are equal in their effect, their effectiveness in some way like engagement might be a metric that they measure and then any difference in engagement will disprove that hypothesis. So it’s definitely an experimentation. But I often see AB testing not being used with the control.

There’s just like multiple variable groups, two or more variable groups, which is fine. I think it still gets at the same thing. But in A B testing, I would say it definitely relies more on the method to surface things that features that have better engagement and you’re looking more for more easily identifiable strong signal that one is better. Whereas in chaos engineering we have a stronger hypothesis, but that means it’s harder to prove. So it’s more in the sense of a traditional Western science of the degree to which we’re able to disprove competing hypothesis adds to the strength of our certainty for the hypothesis that we think is true, it’s a little bit more convoluted what we have to go through to prove that something specific is happening, or rather that something is happening in a very specific way. Again, to go back to that example, if we find that there are people who are an experimental group who are watching less movies, then that disproves the hypothesis. That our original hypothesis that the steady state will continue in both the control and experimental group.

But what that tells us is that there’s another area of concern for investigation. It doesn’t tell us that exactly what we failed is what caused that dip in people watching movies. It basically tells us that whoever’s responsible for that service has more to look at and we kind of pass off this validation problem back to them and they might use testing, which is more of a form of verification to tackle it to figure out what the actual problem is.

So does this happen with every change to the micro service?

No. We might head in that direction, but the way Netflix is set up, we have hundreds of micro services, and we don’t orchestrate the release of any of them. So for any given micro service, the team that owns that might do 100 deploys a day or they might do one deploy a year.


So depending on how important they think it is to run these experiments, they can decide the frequency that they run them.

So do teams development teams then get involved with your team to orchestrate some experiment that you’re monitoring? Or how does your team get involved?

Yeah. So the Chaos team originally had a consultative role where they would help set up experiments to run with the different microservice teams. And of course, that doesn’t scale very well. So we did that for about a year with some of the more prominent important we call them tier one services here on Netflix. And that allowed us to get a good sense of what a good cast experiment would look like and what the common patterns for testing, for setting up that experiment are. And once we have that expertise, we were able to automate that and we built what we call the Chaos Automation platform Chap.

That’s more of a self service model now. So the micro service owners can now set up their own experiments for their service and decide things like how often they want to run it and what failure modes it actually exercises.

Okay. Now I just briefly watched your talk from a couple of years ago in Portland. I’m forgetting everything about it, like the title of the talk and everything.

But it’s very forgettable at Pnsqc.

Yeah. Gosh, no wonder I forgot that. That’s just so easy to remember.


Pacific Northwest Software Quality Conference, right?

Yeah, it was a lot of fun.

Yeah. So there’s these great visualizations that you guys have there that you show off, and I think it’s showing the traffic between the different Amazon regions.

Now, does visualization fit into any of these other experiments, or is that just a one off? You have this really cool visualization tool, for one thing.

So the common thread between those three teams that I manage is that whereas a lot of teams here are there’s no big engineering organization at Netflix, it’s much easier to conceptualize as 100 or so small engineering teams of, I don’t know, five to seven engineers each that own their own functionality or their own domain in the micro service system.

A lot of those micro service owners, I mean, they know their functionality very well, and, of course, upstream and downstream from them, they probably know very well, but they don’t necessarily have a holistic picture of the entire system. So one of the common threads for the Traffic, Chaos and Intuition teams is that they are concerned with that holistic picture macroscopic level of how the system interacts and performs. And they’re concerned with that to the end of improving Netflix’s availability. So the traffic team has some remediation functions to improve availability.

If we have an outage or self inflicted software issue in one of the AWS regions that our control plane is deployed in, the Chaos team surfaces these threats to resiliency in the micro service level with platforms like Chap, and we run exercises where we simulate failing out an entire region on the other end of the scale, failing individual servers. The Intuition Team was created as a response to the need for the engineers on the traffic and the Chaos team to understand the state of the system holistically in a very short amount of time. So we have metric systems where we have very large metric systems that will measure just about anything, application level metrics down to server level metrics, average CPU load across a cluster, for example.

And so we have all this data.

And if we want to know, is the entire system healthy right now?

We could have dozens of charts up that show us various things about different components of the system. And what the Intuition Team has itself with doing is finding UIs that can instantaneously give us this understanding of the state of the entire system. And so I think what I showed off in that conference was our first Intuition tool, which is called Visceral, shows traffic moving between the three Amazon regions that were deployed in for the control plane. And it’s essentially a bunch of moving dots of different speed and different color.

And because the brain is very good at processing visual information, a lot of it in parallel, you can kind of just see these dots move and very quickly figure out what quote unquote normal looks like. And then if there’s a different pattern of movement or speed or color, you can just look at it and know that something’s wrong.

Well, one of the things I liked about seeing that is that the direct application of that particular tool, of course, doesn’t apply to very many people. However, the notion that we do need gradations of good and bad. There isn’t a lot of test tools now, or it’s just green or red. That’s not enough. This idea of somehow and you kind of have to play with it. I think having a lot of data about a system and being able to determine trying to help some visual tools to help people figure out whether or not this is a good thing or a bad thing. I struggle with that myself because we’ve got multiple branches working on the code base against different hardware sets, and the test development is happening in parallel. Also, I still haven’t come across some visualization tools to tell me, like the health of the software just right now over all of that. And I’m sure all systems have a similar problem. I mean, the same thing with cars. We’ve got like a check engine light, and that’s it. We need more than that sometimes.

It’s also just really cool that you have an Intuition team.

I don’t know if I’ve heard of anybody else working on an Intuition Team before.

Yeah. We like it.

One of the important things that one of the important features that we added since I gave that presentation was the ability to play back the Intuition interface at any point in time. And we found that that was really important in helping us understand the system so we could look at a time when we knew it was going through a particular failure and kind of see this pattern of, again, the dots moving between, in this case, second level view between different micro services.

It’s hard to sometimes you just have to see something to get it.

We can look at all the quantified information and nausea, and it just wouldn’t give us the same understanding that we get from having this thing that has completely arbitrary visualization, but it just makes more sense. And so I think that’s an interesting thing that’s lost in a lot of software, particularly on the Dev tool side.

Yeah. And there’s a lot of people coming into software and testing that are like in big data problems with basically large, huge data sets that are coming in at a constant rate, and they’re not slowing down. They’re speeding up and how to deal with that and how to make it into something that people can process.

I know that you’ve developed smaller scale things as well in your past. I believe this idea of experimentation as a complement to testing, traditional type testing.

How can other people can you think of a way that other people can apply this sort of thing to other realms other than network traffic?

Sure. So I think the comparison or the application to distribute systems is fairly straightforward because a lot of times distributed systems are. Well, particularly in software, they’re complex distributed systems, meaning it’s very difficult for a human to mentally model the entire thing at once. So we’re seeing, at least here at Netflix, move away from having architects or relying on chief architects as the people who understand all the different pieces of the puzzle and how they fit together so that they can bless new architectural decisions because they have context that the engineers don’t. We’re moving away from that. And certainly Netflix does not have that kind of model. We don’t have the chief architects here, architects at all, really. And by decentralizing the concerns to small engineering teams, you lose that. And with it might go some confidence. Right. Because you don’t have one person you can turn to and say, hey, this is what my team is working on. Does that make sense with respect to the whole system?

So chaos engineering, those experiments are designed to shore up that confidence. So if you have these hypotheses that subjecting your system to these real world conditions won’t change the output of the system, and you are unable to disprove that hypothesis, then you have more and more confidence that it’s true. So you’re slowly building up confidence that this distributed system is more and more resilient to things that happen in production and in a lot of ways that takes the place of certainly like an availability architect.

So that can apply to anything from any distributed system, any micro service, any distributed database, anything like that.

There’s kind of two paths that I could see this going. One is running experimentation with live traffic, usually live traffic production traffic or production like traffic similar to a B testing similar to this. The other way that analysis of these complex systems could go is in simulation, like discrete event simulations and that sort of thing. I don’t know that one of those paths is inherently better than the other. We just seem to be in a point of time in computer engineering where most of the resources have been put into the experimentation side rather than the simulation side. So, for example, I don’t see any reason why we couldn’t record the processing capabilities of all of our servers here and use that to create a simulation of how these different services respond under certain conditions and then just plug that into simulation program that will run thousands of runs of the simulation under different configurations and use that to surface threats to resiliency.

Oh, interesting.

Yeah. Just for whatever reason, the computer engineering as an industry hasn’t moved in that direction at least. I don’t know of any reason why we couldn’t. We just seem to have invested heavily in this other path instead.

Well, I’ve seen some models where there’s, like recorders that take one or two interfaces into a system and record the if I send out this, I get back this piece that sort of record and playback, but on a large scale, trying to simulate lots of parts of the system and just changing out one in a simulation. Yeah. I’ve not heard of anybody doing that.

Seems like a hard problem to do, though.

I don’t know that it’s inherently any more difficult than what we’re currently doing. I don’t know that it’s inherently more difficult than setting up cash experimentation. The way that we have this set up, it would be a little bit more difficult with AB testing or impossible with a B testing, because there what you’re trying to figure out is an effect outside of the system.

You’re looking to see if people in a sense, you would have to simulate humans. And so that’s hard to see if they react to an interface in a better way.

One of the things I was thinking about is if you’re doing an experiment, do you do your control group and your experiment group, are they just like half of the Netflix people, or do you do a smaller percentage of the Netflix users to do these?

Oh, no. Yeah. So if we were using a significant chunk of our production traffic, then we won’t be able to run many experiments at the same time. So no, a typical experiment will be zero 2% of traffic or something like that. Very small amount.

I’m not a network engineer person, so I’m not sure if these sorts of tools are obvious to other people, but is that idea of taking like putting a new service in place and redirecting some of the traffic to that service or a percentage? Is that something that Netflix had to come up with a way to do that, or is that something everybody already knows how to do that?

I know that some other companies have ways to do this, but that is something that we had to build for our particular stack. We actually open source the core components and the Netflix OSS stack. So there’s a couple of things that allow us to do this. We essentially have request context for the requests coming in, like who it’s from and things like that. Follow that request as it fans out through the micro service for the whole of its traversal through the system.

So we can say anytime this user makes a request to the service, put this flag in the request and that flag will propagate through the whole system. And that flag might say, hey, when I get to the movies service, pretends like you can’t reach the database, and so then at the movie service we can have an injection point that says, hey, if this flag comes through, we’re going to pretend to fail the connection to the database. Okay, it’s not unimaginably original idea since there is request context in Hcp headers, but it’s very similar to that, and we rely on that pretty heavily. Along with knowing what our IPC are interrogated communication injection points are and building in there the logic to look for these different flags and take different action based on whether or not those flags exist.


Also, your groups of people, those are actual people, so their behavior isn’t constant. How different do the two sets need to be for you to determine that? Do you apply statistics to this and to try to figure out whether one set is statistically significantly different than the other?

Yeah, exactly. We need a sample size that’s large enough to be statistically significant because we could have somebody who’s just behaving oddly and we wouldn’t use just one person as a signal. Again, if you were able to surface that kind of adverse condition or error based on just one person going through, then you can write an integration test that does that, because you know what your assertion is, right? If somebody ever comes through here and the database is down, then assert that that will fail the inverse of that. So that is more appropriate for testing. We really need samples and the interactions are less deterministic.

Now, my introduction to your group altogether, and where the chaos totally made sense, is this notion of either completely killing off services and seeing how the recovery happens or killing off regions. After this discussion, I’m guessing that you do a lot more than just killing services.

Killing services is kind of a unique case from the point of view of the whole service. It’s a unique case of latency going to Infinity. So we have kind of a spectrum of things that we can do. One of them is just increasing a small amount of latency in the connections and seeing if that messes things up. Just making it completely unresponsive is the equivalent of killing the service. We don’t actually have to turn it off if nobody can talk to it.

So you do things like just slow things down sometimes.

Or sometimes we’ll just slow down things down sometimes we’ll just return different response codes, status response codes like 500 or 400 instead of 200. Okay. And other things like that with database connections is a little bit more binary. Usually we just shut off access to that database or not. We find that that is a lot more common in our deployment than actual data errors, like a binary being corrupted or something like that. Particularly for a micro service that’s sending small requests around, it’s very rare that data gets corrupted in there somewhere.

But the reasons why I like thinking about this is because, like you said, there’s a traditional testing model of if I put in this data, I get out this other piece of information, or I can monitor some side effect or something that I can see changed. The other way of testing is I don’t know what’s supposed to happen, but I know what happens now when I run the system on this data and I’ll change the code or refactor or something and make sure that the output is the same as it was before or something we’re talking with. The steady state is even different than both of those. And I don’t know what’s good or bad, but we can monitor what the steady state of the system is and detect changes to the steady state and play around with different variables what we’re changing and also what we’re monitoring. And yeah, I don’t think there’s a lot of concrete ways to apply that across every different fields. But that sort of thinking, I think, is an interesting thing that people should be thinking about in your unique particular situation. So I know that I have in the past year tried to hire lots of people, and I know that you have had the opportunity to try to find lots of people to work there. I’m going to change the direction of the conversation a little bit and ask when you’re bringing people in, is there particular skill sets that you are finding it difficult to find in people that you were surprised that it was difficult to find?

That’s a really interesting question. Yeah. So for anybody listening, I’m hiring.

Netflix did not pay for this episode.

Remind me to send you more stickers afterwards. It’s an interesting space in that for the Cast team in particular, we’ve had people apply coming from a QA quality assurance or test engineer background, and there’s something helpful there in that. And having the skill set of being able of looking at a system and trying to find edge cases where it fails, that’s useful.

Obviously, we don’t want to do manual testing that doesn’t scale, and we also don’t want to do testing, so we don’t want to be looking at the micro services and writing tests to verify their correctness.

So really, we need the same attitude of looking at edge cases. But the skill set has to be more of a distributed system and maybe even automation engineering, at least that more distributed systems engineering to build these services that will run 24/7 and build experiments and create results and tear them down and provide the context to the microservice owner that under these conditions, your system will fail. It’s kind of a good distributed systems engineer with the right attitude.

Okay, now what sort of tool pieces of the distributed stack do you employ for Chap?

The application is written in Java. It’s a service that integrates with our continuous delivery application called Spinnaker, which is a big open source project.

And yeah, that just happens to be written mostly in Java. Chaos Monkey, which turns individual servers off. That’s written in Go, also integrates with Spinnaker. And really, these were just the decisions of the engineers who are working on those particular things. Chaos Kong is our simulation of failing a region, and that is mostly written in Python because the libraries that we were using to communicate with AWS were originally Python libraries. So we kind of just continued with that momentum.

So you’re not particularly picky about if somebody is like a Go expert versus a Python expert that comes in.

No, we just want good engineers. Yeah.

Netflix only hires senior software engineers, so by and large, senior engineers have worked in a few languages, and if they have to pick up a new language for something, then we expect it when that would not be an obstacle, a significant obstacle.

Okay, cool. Well, I’m sure we just scratched the surface of this, but I don’t want to drag it out too long. I enjoy having you on. And thanks a lot for talking to me today.

Likewise. Thank you so much. This was a lot of fun. Thanks, Brian.

All right, thanks.

I hope you enjoyed that interview with Casey Rosenthal.

Thank you, Casey, for coming on the show.