Flaky tests, tests that sometimes fail and sometimes pass are a huge problem for teams maintaining large test suites. We discuss flakiness, what affects it has on the team, how to manage flaky suites, and how to fix and avoid flaky tests.

Transcript for episode 50 of the Test & Code Podcast

This transcript starts as an auto generated transcript.
PRs welcome if you want to help fix any errors.

Welcome to Test and Code, a podcast about software development and software testing.

In this episode, Anthony, Sean and I discuss Flaky tests. What are they why are they bad? And what do we do about them? Good. I’m proud to have Digital Ocean sponsor this episode. Thank you, Digital Ocean.

Hey, man, how are you?

I’m doing great. How about you?

I’m working on a new package called Wiley.

Wiley, yeah. Like the coyote.

Yeah. And the idea is that it’s looking at the complexity, but I’m thinking maybe just because there’s lots of code packages which look at complexity of your code. I was thinking about making a tool that measures the complexity of your tests and then reports on the history of that over time. So what I’m adding is basically, like, it’s a command line tool and you can give it your test command like pytest or Python unit Test discover. And basically, if it’s a Git repository, it’ll go and check out each revision and go and run the tests against it and then measure the complexity.

I think that’s cool.

It’ll give you like a historical graph about the complexity of the different tests and the test modules and how it’s going up or down over time. And yeah, that’s kind of the idea.

Yeah. Because tests should not be complex.

No, they shouldn’t. They shouldn’t be Flaky, either. Was this a question that somebody raised, or is this something that you wanted to talk about?

Yeah, it’s just something I wanted to talk about. It’s something that I think a lot of people pretend that it doesn’t exist, pretend that only people that write bad tests have Flaky test suites. But I think it’s a little bit more prevalent than that. And a couple of articles came out not too long ago, one from Microsoft and one from Dropbox about how they deal with Flaky tests. So that kind of inspired me to try to tackle this topic.

So do you have any Flaky tests?


Flaky? No. It’s definitely over time, I’ve experienced Flaky test suites and Flaky tests as well. Trying to pin them down is pretty tricky. I was actually just reminded the other day when I was managing a small team of net developers. One of the rules on the team was that if you broke the build, you had to wear a T shirt with my face on it for the day, which is probably not the best policy, but it was supposed to encourage or discourage people from not checking the test suite before they published it to the source code repository.

Right. So if you do that, then in order to do that, you have to have a test suite you trust. And so a Flaky test is a bad thing. And by Flaky, I’m going to define Flaky as a test that sometimes passes and sometimes fails. And so a Flaky test suite, it might be obvious, but it’s a test that has at least one or more Flaky test inside of it. The distinction between a test and the Flaky test and the Flaky suite is important because one of the fixes is to just move all of your Flaky tests into one suite, and then most of your suites are fine and you just have one suite of Flaky tests. I wrote down a whole bunch of notes. One of the things I’d like to talk about kind of why they’re bad and some ways to deal with them. And if we kind of look into what are some of the aspects of Flaky tests, we can kind of tease out of that how to fix them and how to avoid them.

Does that jive with you? Is a test that sometimes passes and sometimes fails, or is there something else? Yeah.

Is that the same as a fragile test then? Because I’m definitely familiar with fragile tests.

I’ve also heard it referenced as a non deterministic test.

That sounds like a test where you’re trying to measure something which has an element of randomness to it. Whereas I’m thinking, like Frances, where there’s part of the system state in there, which could change or if it’s too closely coupled to the exact situation in which the test was written and passed for the first time, and then if the environment is different, then it doesn’t pass. So, like, if it worked on my machine, the test pass, and then I check it in and then someone else runs it, it doesn’t work for them. That would be a fragile test.

I guess that is different. Then. I would think of a Flaky test as if you run the exact same thing twice. Sometimes it’ll fail and sometimes it will pass.

What situation would that happen in? Because computers, they’re not fuzzy.

Something either works or it doesn’t.

Often a nondeterministic test, something changes over the time when you run the test over again. Some of the causes would be, like possibly external system that could change. Like your test that talks with an external service that perhaps has latency changes, that sometimes it answers in four milliseconds, and sometimes it answers in seven milliseconds. Noise in the system.

In some cases, there’s actual noise. Like you’ve got a system that deals with, I don’t know, an input system from a camera or a microphone or something that could be actual noisy inputs, things like that.

So hoping these are integration tests you’re talking about. Like, I hope you wouldn’t end up in a situation where you’ve got Flaky unit tests.

I think that’s true.

The more you have a larger system level test, the more Flakiness creeps in, and that’s part of it. So let’s look at some of the causes we’re talking about really is lack of isolation. So a test that tests a large piece of the system is going to be more likely to be Flaky than something that just tests one function. You would hope that there wasn’t actual random bits flipping in your computer to make something go wrong.


That’s one of the things you have to deal with is trying to figure out if the test is possibly at the wrong level. The other thing might be the state of the system. The starting state of the test isn’t completely controlled by the test environment.

Yeah. So how do you know if it’s a Flaky test or if it’s a Flaky system?

For example, the example you gave, the thing responded slower than it should have done or it wasn’t working or whatever. Now you’d hope that the application was more resilient than just bombing out and failing a test. Like, how do you know whether a Flaky test is not actually just an indication that the system is not very resilient?

That’s one of the problems is you don’t if you’ve got a handful or many Flaky tests every time, every time you run the suite, you’ve got to go through and examine and check everyone to make sure that it’s not possibly a new one. And that’s fatiguing. And what happens is a couple of things happen. The people responsible for the test suite get tired of doing that, and so they will see the same test fail every time and just assume it’s the same failure. It might actually be a new failure hiding in there. So you want to address this quickly so you’re not passing through new failures when you shouldn’t. The other thing is the development team start not trusting the tests. They don’t trust the test suite anymore. These are all problems, and so trying to address them is very important.

You test quite a lot of physical stuff in your job.

I’d expect that when you’re testing physical equipment, things would not work for other reasons, like environmental reasons, like there’s a meteor shower or a neutral that hit the motherboard or something.

We have better isolation than that. But for instance, right now I’m working with three or four different versions of the hardware in different shapes and sizes and bandwidth. And for me, sometimes when I’m talking about noise, we’re actually talking about electrical noise, like when we’re measuring electrical signals, there is electrical noise within the environment itself.

So that raises an important question as well, which is, is it good practice to split up your tests? Because I know you’d have the unit test and integration tests, and then you’d have other types of tests in terms of how often you run the unit tests. You’d want to do them as frequently as possible because then if you just made if you just change something which is broken expected behavior, you introduced the bug or something you’d want to know as soon as possible so you could identify root cause. But you wouldn’t want to go and run a full integration suite. Like every time you save a file or something, because if it takes half an hour, half a day, whatever to run the tests, you’d want to run it less frequently? I guess.

Is that pretty common practice that you’d split the two things up and then if the Flaky tests happen more in the integration testing and you don’t run them as often, then I guess it would be harder to actually identify what the root cause was because if the test wasn’t flaky before and now it started to be flaky. But there’s been 25 changes over that day by five different people. Like, how would you identify what the issue was?

That’s an issue, right? So I would want as far as the levels of testing go.

This episode is sponsored by Digital Ocean. Digital Ocean is the preferred cloud platform of hundreds of thousands of innovative companies. Digital Ocean makes it easy to deploy, manage and scale applications with an intuitive control panel and API designed for developers. Get started with a free $100 credit towards your first project on Digital Ocean and experience everything the platform has to offer such as cloud firewalls, real time monitoring and alerts, global data centers, object storage and the best support anywhere. Join the over 150,000 businesses already creating amazing things on Digital Ocean. Claim your credit today at Dot CoTest And Code.

As far as the levels of testing go, if I introduced an error in my code, I would want to run the test immediately. That would catch that error.

The problem is, I don’t know how to do that. You have levels, though. There’s an idea of a smoke test, high level, happy path testing through the entire system and a handful of those through lots of different parts of the system so that if anybody totally broke something, we would know right away. I think those are good things to run all the time. Let’s say in a situation where you’ve got five different people working on code and checking in things during the day, I think a good practice is to have the even if you have a test suite that runs for an hour is it just sits around and waits for somebody to push code on to particular branches. And if you’ve got the freedom to run multiple test suites in parallel, you can do this.

Run everything every time somebody changes some code. If you can’t do that, then you run less.

You run an amount of tests that you can. I think that that’s completely fine and a reasonable strategy, but at some point you got to run everything. And the sooner you find out, the better. Even five people changing code all day long, at least they know what they did yesterday. If they have to wait until tomorrow morning to find out that the system is broken, if you have to wait till the end of the sprint to run your entire suite, that’s bad because you may have changed the code two weeks ago and who knows what else you’ve added on top of it.

Yeah. Plus, if it doesn’t pass, then you’re at the end of the sprint, you’ve got no deliverables. So that’s the issue.

So we also break up tests into different areas because most of the time a developer can know that part of the system that they may have mucked up. Like if they’re working in the area where you’re controlling the input frequency, for instance, you’re probably not worried that you’re going to mess up the file printing service.

They’re completely unrelated systems, so you don’t have to run the entire test suite of everything, but you could run the focused tests on that subsystem, and that’s usually what we do. I’m amused by an idea that you wait for lengthy system tests because they are lengthy. So you run the unit tests because it’s possible to run the entire suite fast.

However, they’re so focused, why are you running the entire suite? The only thing you could possibly be breaking is the tests that touch your exact code.

In theory, yeah, but if I had a Penny for every time I changed something and then broken something else, which was completely unexpected.

Yeah. I don’t know. In terms of determining the scope, it’s pretty difficult. I guess in unit test it should be clearer. But there’s plenty of times where I’ve made a change and then I’ve broken a test in some other way that I could never have anticipated, which is why it’s just nice to have them there as a safety unit.


The place where I really like unit test the most is when there’s design decisions and assumptions that are made that aren’t obvious all the time. And then when you go in to add some functionality or change something, especially if there’s a workaround in there for a particular reason to fix a particular bug, and you go, wow, this is complicated. I’m going to simplify this. You take it out and the code looks cleaner, but suddenly your test case breaks and then you’re reminded of why that’s in there of like, oh yeah, we had to put that work around in place because of Customer X complained about blah. Those are great ways to maintain some knowledge of why decisions were made. One of the things I wanted to address was what do you do with them? A lot of people will just rerun the test and hopefully it will pass the second time.

Yeah, if you try turning it off and on again.

That’S rerunning is something that everybody will do anyway. There’s even a pipest plugin that will rerun failures a certain number of times, and actually I find it useful but used sparingly because you know you’re going to rerun it anyway, so why not just rerun it right after it fails and then you already have that data. It’s one of those automating things that you are going to do anyway.

In the articles we’re going to link to, there’s a Dropbox article and a Microsoft article, but I’ve heard it from other places, too. There’s this idea of quarantine. We really want to trust the test suite so that Flaky tests need to be taken out of the general test suite. The Dropbox article is interesting is where they have this automated system that if any test fails, they rerun stuff. And if the test that failed before passes the second time, they quarantine it automatically. They completely take it out of the system and put it into another area because they can’t trust the test. It didn’t tell them whether or not something is broken. However, there’s a danger there is that you’re reducing your test coverage right away and it may be drastic. So trying to keep the number of tests in quarantine and the time that they’re quarantined to really small is good.

Once you have them separated, there’s some things you can do, you can fix it. Or sometimes it’s a good idea to take a look at the test and find out whether or not it’s a redundant test. There is redundancy in test systems just like there is redundancy in code sometimes, and it may be that it’s a duplicate test. It tests a similar feature that somebody else is already testing different ways and it could just be deleted. So like refactoring code. Don’t forget about your delete button. One of the things that was interesting is Dropbox solution has this thing where they have a function called under test.

What are those things called? We use the with clause context manager. Yeah, it’s a context manager. So they’ve got test code that only parts that are actually in the under test clause can fail the test and everything else will throw a different error that couldn’t test looked at that and thought, well, that’s exactly what pytest fixtures are, because if you push as much code as you can for setup and tear down into fixtures, pytest will separate those and you get errors for fixture failures and fail for test failures. But it’s a reasonable solution to minimize the code that’s actually in the test.

Then you got to go fix your tests. So we talked about some of the causes of Flakiness, like the system state might be different or depending on noise or there’s not enough isolation.

We didn’t talk about isolation between tests though. That’s an interesting thing is that sometimes when you have a test suite and one test always fails and then you rerun it and it passes, it isn’t because anything outside of the test suite changed. It might be an order dependency.

If you’re running your tests in order in one order they pass and then another order they fail. That’s one of the reasons why Pi test has a randomized plugin that you can run your test in random order. I think we’ve talked about a lot of the causes. You just undo those. Sometimes you can determine why a test is flaky, but sometimes you just have to guess. You can say maybe it’s that the test isn’t isolated from the rest of the system. So we just don’t understand what system state it’s depending on or completely try to completely reset the system from a clean state between each test.

That’s really hard to do.

It’s hard to do on some systems, and it adds a lot of time to your tests.

I’m working with hardware, and if I had to completely start from square one for every test, it adds a lot of time.

Yeah. I remember lots of Flaky tests which were due to the system state, and the system state was related to, I guess, Windows servers and the way they’ve been configured and people jumping in and fixing things to make a certain test pass and then therefore breaking somebody else’s test.

If it’s non deterministic and you don’t reset the environment between each deployment, then it can be really difficult. And I think that’s one of the great features of things like Docker is that you can basically try and get as much isolation in the environment state as possible, like what I call immutable systems. Like, the actual container that you’re deploying is supposed to be unchangeable and the only thing it has in its state of the temporary data. There’s no way to pause a Docker container.

If you stop it, it deletes itself effectively. So all the issues you used to have with installing patches on Windows Server and then that breaks something else and you get different issues between deployments. If you got a deterministic Docker container and you know that that’s immutable, then you can deploy that component of infrastructure to different environments, but it’s just a different piece of tin hosting the same bit of code. So it should work exactly the same way. Yeah.

And that’s the case where I think it would be fine to have your test system depend on that same fixed environment as well. It is a little bit of a problem if you’ve got a test environment that is fixed and it always is the same state, but your actual deployment isn’t fixed, you’re deploying into lots of different environments. You at least should have some tests that test something in multiple different environments, if that’s where you’re deploying to.

Yes, because your production environment and your staging environment should look the same. But when your production environment cost two or $3 million, then you try and convince the person to sign a check to get a staging environment that just sits there 99% of the time for $2 million. It’s a pretty hard conversation.

Yeah, definitely. I want to talk about one of the things that one of the article talked about was a proactively rooting out Flaky test. And this was basically what they had is every time they’ve got a clean system test, so all the tests pass, you can rerun the same everything, so nothing else has changed. You’ve got the same test, the same everything, and just rerun them. And if any tests that failed on the rerun.

You’ve already said that the software is good, but now you’re looking for Flaky tests, and if any of the tests fail on the second time, automatically stick to flag them as a Flaky test and put them in your Quarantine system.

Again, you’ve got to try to get them out as fast as possible, but that sort of automatic quarantining is an interesting idea, and Proactively running, and then we can look for things like test flakiness and stuff ahead of time. Instead of wondering if something’s flaky when we actually do want to know whether or not a new change caused a failure.

Is it okay to add degrees of tolerance to the assertions, then to, I guess, make them a little bit less if that fixes the issue or not. But if you say it’s like it’s a timing issue or the state is slightly different or something, could you put degrees of tolerance into the assertions?


One example is asserting for quality of a floating point number. Don’t do that. But there is almost equals. That is somewhere in the standard library. But the pytest has approx. So you can have two numbers, you’re comparing them, and one of them you say approximately this value, and by default it looks at basically the floating point rounding errors that happen, and it’s pretty good with the defaults are pretty darn good. If it’s just you think maybe the number was represented a little bit differently and you want to make sure that it’s equal, but it has tolerance built into the function, too, so you can pass in a tolerance, and I think that’s a good idea.

It’s also just really what are you testing? Is it really important that it be five milliseconds, or are you really looking for catastrophic traffic failure? Because let’s say you’re trying to test to see if the connection to an external database service is gone.

Well, five milliseconds is probably too quick. A five second time out, like a huge tolerance, is still going to tell you whether or not the system is completely not there. And most of the time you’re not going to hit that five second penalty unless there’s a real problem for hardware stuff.

We deal with this all the time. We have to build tolerances in almost all of our measurement tests because there is real noise in the system, so we have to accommodate for that and then self calibration. So we do a lot of times at the beginning of a test. If a test really has to care about exact latencies, then we measure the latency before we start the rest of the test. So we know some latencies to take out of the system or level offsets. So a cable will drop the signal a certain amount and all the connections and all that stuff. And if we swap cables, we don’t want to have to calibrate all of them. It’s easy at the start of a test suite to just go through and calibrate, do a quick calibration of all the entire system to see how much loss or frequency offsets in the system. But then we also still have to just have tolerances. You have to know that the domain that you’re testing as well, because some tests we need to know the exact number within a very small tolerance. Yeah, definitely tolerances are good. So before we started, you were talking about something you’re working on that’s dealing with test code that might be too complex. So do you have any advice for good test and bad test design?

Yeah, just keep it small. Try and test one thing. Try and test one behavior on a unit test, obviously. And then an integration test tried by testing a few components first. Then I’ve been working on this article for real Python, which starts to be published.

Hopefully by the time this podcast episode is out, it will be public. But it’s called Getting Started with Testing in Python. And it’s like a beginner tutorial for basically how to start writing tests in Python, why you should write them. And basically, like, all the steps you should go if you’re not familiar with testing in any way. And one of the analogies I was using at the beginning was that if you wanted to test whether the lights work in your car, you’d get in the car and turn the lights on.

And then if you go outside the car and you can see the lights are on, then that means the lights don’t work. So if they don’t work, then you’d assume that the bulbs are broken. But you don’t necessarily know that that’s the case because the switch could be broken, the battery could be dead, the computer could be dodgy. Like, I’ve got the issue at the moment in my car where the blinkers don’t work randomly. Like, that’s a flaky test.

They’re just blinking at a really slow pace.

Yeah, that’s an example of a unit test would obviously be to take the bowl out and to test it by heartwarming. It’s the battery. Maybe. But that’s not a fair test to do all those integration components, because if you only test the whole system, then you don’t know which part fails. So, like, when it does break, it’s impossible. And what ends up happening is that you just either put tolerances into the tests, which shouldn’t necessarily be there. Like, oh, let’s just run it twice and see if it passes. Yes, it passes second time. Okay. Ship it. Like that’s a really classic thing that happens, or people end up saying, okay, we’ll just take out that assertion because it’s too fragile.


And then, okay, now it passes. Let’s ship it so that you need to have a good balance between not too fragile unit tests and integration tests, which are fairly resilient. But they also identify where the system is not very resilient against failure, because failure does happen.

Yeah. For the Expedia’s sake. If I know most of the time the lights usually work. The fastest test to write is to turn the lights on and go look to see if they’re on.

Yeah, absolutely.

I want to question your test environment. It’s a little more complicated than it needs to be. If you do that test in the dark, you don’t have to get out of the car.

That’s a good idea.

The hardest one is testing brake lights.

I found that three bricks is the perfect amount.

Three bricks.

Yeah. To put on the brake pedal so that you can run around the back of the car and you can see whether the brake lights on.

You have kids. Can you take one of the kids to go step on the brake pedal for you?

I don’t trust them to step on the right one.

Okay. There are any small one of the things that we talked a little bit about not flaky tests, but fragile tests. And a lot of times when people talk about fragile tests, they’re talking about user interfaces. I think my only advice really for that sort of stuff is to try to keep GUI level tests to just testing the presentation. If you want to do that, I think it’s a good idea. But doing your entire system testing for all corner cases through a graphical user interface is going to be problematic.

Yeah. There’s a couple of ways to do that as well. Like, if you’re doing Selenium or using one of the providers, you can use some cloud providers which host Selenium test for you, because setting up Selenium with multiple browsers is extremely painful and a lot of hard work. So there’s actually companies you can basically pay to go and run the test for you, and you just give them the test steps and the URLs. It does it all for you. One of those is called Ghost Inspector that I’ve used quite a bit. And the cool thing about Ghost Inspector is that in the tests, you can actually write a visual test and you can put a degree of tolerance in. So you say if you take a screenshot of this page before you try and click on anything, does it look more than 30% different to the last time? And if it does, then fail the test and you can decide what that percentage difference is. So, like, 2%. 5% is fine. But if somebody broke the CSS style sheet for the application or somebody’s, like, screwed up one of the HTML Tags or something, then that’s something that Selenium would just ignore and just continue anyway. But Ghost Inspector and a few other tools like this would catch that kind of visual error. Error.

Yeah. And these are essentially then change detector tests.


I assume that there are ways to seed them so that you have something to compare to, right?


The other thing is, like, different browsers, like somehow sometimes browser changes I’m guessing could affect you. And if all of the developers are using Firefox but most of your users are using everything. Yeah. You’re never going to see the problems during development.

Yeah. Things work with Firefox because getting Firefox to work with Selenium is really easy and getting it to work with Chrome is a pain.

There are so many automated UI tests for Firefox that’s really helped.

One of the things I like to do for Redundancy in tests is to have my tracer bullet tests run at all different levels. So I’ll have a happy Path test case that is one of these long Rambling test cases. Like I’m going to go log in and then go do this action and do some other action and so some entire act like a customer test. But not very many of these, just a handful that touch all parts of the system and then run those at every layer that’s possible through the GUI, through the API, through subcutaneous layer. If you have one or subsystems and having the same data go through all layers or all these tests seems silly and redundant, but it can isolate whether or not your light bulb is burnt out or some cable or something. So the last thing I want to touch on is just I think it’s really cool if people have awesome test practices and they never have to deal with failures or flakiness, but a lot of people do and know that you’re not alone and there’s some big companies and people all over the world to deal with it, and I think it’s cool for people to talk about their solutions. So if there’s a solution you have for dealing with flaky tests or how you’re doing automatic quarantining or let me know. And if it’s a really cool solution, maybe we’ll get you on the podcast when we talk about it or add it to it, or we’ll bring it up later because I think it’s important for us to learn from each other. So that’s why we’re bringing this up and I’m really looking forward to your write up. The more people writing about testing is good.

Yeah, hopefully soon, hopefully soon next week.

All right, well, thanks a lot and we’ll hopefully talk to you later.

All right. Bye, everyone.

Thanks again to Digital Ocean for sponsoring this episode. Claim your $100 credit at dot cotestincode. That link is also on the show notes page at Test And Code. Comfifty. Thanks to Anthony for the great conversation and thank you for listening for sharing this podcast with friends and colleagues and for supporting the show through Patreon. That’s all for now. Now go test something.