Prayson Daniel, a principle data scientist, discusses testing machine learning pipelines with pytest

Transcript for episode 170 of the Test & Code Podcast

This transcript starts as an auto generated transcript.
PRs welcome if you want to help fix any errors.

00:00:00 [Brian] pytest is very flexible. It’s great for testing packages, of course, but also much more Prason. Daniel was on Python Bytes recently, actually episode 250, which will link in the show notes. Prison is a principal data scientist and he does a lot of machine learning and he mentioned that he loves pytest and uses it to test machine learning pipelines. So obviously I wanted to know more and asked him to be on the show. This is a pretty cool discussion. I hope you like it.

00:00:30 [Brian] Thank you to PyCharm for sponsoring this episode. Pycharm helps me to understand and play with my code. The refactoring tools are amazing. A simple one is just to rename a method and it just gets renamed everywhere. There’s a whole bunch of other cool refactoring tools as well. If I changed a bunch of code, I can visually see the diff of my code and the get repo code, and I can even visually walk through the local history to see all of my changes. I actually love refactoring and PyCharm helps me have fun while I’m doing it. Try PyCharm Pro for four months by going to Test And Code PyCharm.

00:01:22 [Brian] Welcome to Test and Code end up talking about testing with data science and machine learning, but these are all vague terms, at least to me.

00:01:42 [Brian] What does data science and machine learning mean for your role? What does that mean?

00:01:49 [Prayson] Yeah. So I think we can start with some kind of clearing things. What do we mean by data science and what do we mean by machine learning and all those stuff?

00:02:04 [Prayson] So when it comes to data science, I think the most easiest way to define it is starting with so many people say start with data and then you do the science. For me, I go like, no, it actually start with the business need. Like, what is it that a business has at the moment that they want to optimize or minimize? It could be they want to optimize the click rates, or it could be that they want to sell more product, or maybe with the insurance, they want to predict which of the customers going to turn, stop being their customer and try to do something about it. Or maybe they want to know which of the customers they should call. And when they call, they will most likely buy an extra an extra package for them. So it usually starts with the business need. And then from there comes the data. Do we have any data that we could use to fulfill this business need? So in this case, when it comes to data science, we will have this data and we’ll say, okay, let’s just do some data exploration and see what can we find in this data. And for me, I usually say it’s even better to start with something called data mining and data mining is that you don’t even start testing models or anything. It’s that you say, let me look at this data and come with some assumption and try to affirm or disaffirm my assumption, just looking at the data and this can take you really far away.

00:03:40 [Prayson] But after that then comes to, okay, so we have this data. We want to optimize this, then we build some models algorithm, which in a simple term is just the representation of the data. So if you’re trying to predict which customer is going to stop being a customer, we look at the past data to see what was the behavior that led this group of customers stopping being a customer.

00:04:04 [Prayson] And given this feature, we can predict, oh, we can see Brian is starting to follow the pattern of the customers that actually stop being our customer. Let’s do something about it. So that’s how you look at the past data to try to predict the things that are happening now or things that are going to happen. And then, of course, in data science, there’s like a branch of different now we jump to machine learning, and this in machine learning is just different technique to answer these kind of questions.

00:04:35 [Prayson] Right. So if we create these kind of branches in machine learning, we have this different pool of something called supervised machine learning. So ideally, you will have a data set that both has some things that will help you predict a specific target, like saying, we are trying to predict the price of houses in New York. That means we will have a certain data set where we have certain features could be rooms and location and everything. And then we have price. So we both have kind of questions and answers. And then we teach the model to find what are the rules that takes us from this question to the answers? So in this case, it’s called supervised learning because we both know what the answers need to be. So in this case, when we are predicting something, a good model is the model that comes close to the answer that we already know.

00:05:35 [Prayson] If we take a very simple approach is that you take your data set, you split into training and then validation.

00:05:46 [Prayson] And then in the training, you teach the machine both the answers and the question and the expected answer. And once your machine learns something, then you take the pool which you haven’t shown what answer should be. And then you just send the question in and the machine will give you the answer. And then you compare the machine answer with the answer that you know should be true. And then you compare these two and say, okay, how different are they? Right? And then the best model is the one that comes very close to what you’re trying to predict. And then that’s the model. You will say, oh, this is a better model that we need to put in production or do whatever things that. But this comes in a field of supervised. It’s called supervised learning. And then there’s something called unsupervised learning where you don’t have the answers. So you have the questions, you don’t have the answers. So you have to take the question and figure out what answers can we get from these questions? And then there you, of course, need to have a human who comes and says, okay, it seems this here is about this one. So this is about this one. So this is like clustering, et cetera.

00:06:59 [Prayson] So in this case, it’s unsupervised. So you don’t know the answer. Prior, you ask the machine to look at this question and come up with answers, and you as a human being, will look at those answers and say, do they make sense? And the model that comes up with the answers that make sense is what somebody will put in production.

00:07:19 [Prayson] And then from there, of course, there are others. There’s like different kind of models beyond those ones, but the classical ones are this unsupervised and supervised. But then we haven’t go into the realm of deep learning and we haven’t even started going into reinforcement learning.

00:07:42 [Prayson] There are so many elements depending on what kind of business logic you’re trying to solve. But for me, I’m a classical what you call like a business type, a person who does not just get excited by model. This model perform awesome stuff. Or we can beat a chess player with this new algorithm. For me that things do not excite me. What really excites me is how do this algorithm help a certain company fulfill their business need? Or when it comes to helping bring social change, how are my algorithm helping bring a social change? Right. Like maybe we’re trying to fight racism in a just system. So how do we create algorithms that are actually bypassing these biases? Right. So those are the things that excites me that I use algorithm to solve real world problems.

00:08:47 [Prayson] Right. And it’s not to create more problem, which we always do, but trying to move towards the other side of getting to the what we call the utopia world, the world we want, the world that we wish we have, which we don’t have because it’s a thing. Because I usually tell people most of our models are just mathematical representation of the data that we have. So if you train your data on, say, American social justice or anything, then it will become bias, because that’s what our data is. Right. Because data are just not like the models are simply a reflection of our data.

00:09:34 [Prayson] In that sense, when we come in, we want to start fine tuning the world that is to the world that we want to be. Right. That’s where we come in and say, okay, we can always say, okay, we know our data is biased. How do we unbiase it when we train a model? So the model won’t learn the bias if left by itself.

00:09:55 [Brian] Oh, that’s true. Yeah. Okay.

00:09:56 [Prayson] And this also is a part where I will come to cover when we talk about testing because this is also very important.

00:10:04 [Brian] This is actually a great introduction and I’m now even more excited to learn more about data science and machine learning than I was before. I don’t think I’ll ever get to the point where I really want to get too into data science contests and stuff like that. But I like the idea of solving business problems and people problems. It’s too bad that they’re not the same thing actually though.

00:10:28 [Prayson] Yeah, I think there is where I usually have a Disclaimer because most people when they think of data science they will think something like cargo, you have this cargo Masters and everything which is awesome but I think it’s just wrong in many ways. I know I’m going to be in a bad light when I say something like this because when you do something like cargo we are actually training data scientists who actually care about model performance. So they are competing on which model perform really well. Right. Which model is the best model that goes up in this ladder?

00:11:06 [Prayson] But when you come in the real world I usually say I really care so much less on how the model performs because the other things that are in play is this model fail because you can have a highly performing model but absolutely biased.

00:11:25 [Prayson] Yeah, right. So we are creating a culture of data scientists that care much about model performance and they are usually somehow blind by other very equally most important thing like what features are we feeding this model?

00:11:45 [Prayson] How is this model going to be used? Is it going to be used for greater good or for greater evil? Right.

00:11:53 [Prayson] So for me I want a cargo competition to give me the best model which is also the best ethical model. Right. So they should rank with the first ethical of the model and then the accuracy. But now it’s simply just the best performing model which I just think it’s just training a new wave of data scientists that are less likely to think of deeper things and only focus on building these super cool powerful accurate models instead of knowing that, oh no, there are other very equally powerful parameters that need to be in place and this is why we are usually in news. See Google has released this thing that is now segregating women. Oh, a movie Netflix I think it’s called Coded bias where the narrator discovered that she’s black. She discovered that some of the facial recognition could not find her because she’s black and if she put a white mask then suddenly she appears being detected. And if we data scientists were training this ethical way we will be able to catch such these outliers very fast because we will do test and code will discover and we will cover to this one. We will discover that we cannot catch every edges but then we can make ourselves vulnerable and say, if you can see something that we haven’t catch, then do write an issue so that we can run an iteration and correct our blender.

00:13:42 [Prayson] Because that’s how it is when it comes to test. You run your code, you discover it fail somewhere, and then you write an issue. Hey, this is what I did and this is what failed. And then I go like, Oops, okay, I can see that our models were trained in white male and there was no black African women. So let’s just run new test, new training with black African women, and voila, next iteration you can be found too. So things like that.

00:14:09 [Brian] And that’s kind of a deep end of this topic.

00:14:15 [Brian] How do you look at the data and say, we don’t have the right data or we don’t have the right data set or maybe too much data. We’re collecting the wrong things.

00:14:26 [Prayson] Yeah. Now we can start diving into the machine learning and testing and tell you, like, how we started this journey in the place where I am at the moment.

00:14:40 [Prayson] So I was not introducing testing in a good way.

00:14:45 [Prayson] And I began by hating testing.

00:14:48 [Prayson] Yes, I was introduced with unit test, you know, the Python where I had to write a lot of classes. And my simple test for small thing just took half of the page.

00:15:01 [Prayson] I have to know how it should enter and how it should shut down. And all these steps, which I go like, I don’t need this. And then all this assert that. And then my Test And Code was gigantic. But I was happy because in this company where I was working, they were trying to remove silos. So they wanted the data scientists to work with software engineers and DevOps. So they make sure that they will take a data scientist and put in a DevOps team, and they will take a developer to put in a data science team. So in this case, you get to learn what is the other side doing.

00:15:45 [Prayson] Yeah. And it was really great because then I was introduced to NodeJS and I had to call a lot of JavaScript there. And of course, they were real developers.

00:15:55 [Prayson] Like, most of the data scientists are not software developers. We just do NumPy pandas. And then we think we can code Python. Right. And then we discover that, oh, no, there’s a lot of deeper things. Like, there’s a solid principle that we need to start thinking about. There’s all these huge way of designing our code base that we have not even think about. Right. So when you are thrown into the other pool, then, Whoa, they go like, no, you start by writing your test, then you write your code and they go like, what? This is how you guys do it. Yeah. But of course I will not be able to do it, but I can see why you guys are doing the same. There but then I think the node. Js people, they did not know so much Python. So they introduced me to the Unit testing Python. I don’t know why it’s called Unit Testing Python anyway, but it was not that easy. And I think it was one podcast you were doing and you talked about Pi test.

00:16:59 [Prayson] Right.

00:17:00 [Prayson] And that was the first time I actually went and look how pytest does what I was doing in Unit test.

00:17:07 [Prayson] And then I discovered that my code base suddenly become very tiny and it was readable and I could understand and I could parameterize and I could do all this awesome stuff with it. And then the end is that the world changed for me. So everything I write had to have a test. And I was bashing everyone who’s not having a test, taking my time with the software engineer into the data science world and says, hey guys, we need to apply like building ML pipeline should be exactly as doing software engineering, just like the way they’re doing their stuff. We need to do the same so we don’t have shortcuts.

00:17:54 [Prayson] So in most cases, when you hear about data scientists, they will start like with the Jupyter notebook, pull in some data with Jupiter notebook, do some expository data analysis, do some prototyping of different models, and come with something awesome. And some people think that’s the end. And they go like, oh, that’s just the beginning. So that’s just a prototype of what you can do. So from there it’s like, then how do you move from the Jupiter notebook to the real world? Like having your source code, having your tests, having your documentation, and everything in a more structured way, right? Yeah.

00:18:32 [Prayson] So in my team, I don’t want any Jupiter notebook in production for some reason.

00:18:38 [Prayson] So we always have this folder called notebook, which is usually get ignored where they can do a lot of experiments and only some experiment that we want to keep track of. Then we will keep those ones because maybe we made big decisions and we want also some people to do quick prototyping to see why the things that we do. But otherwise, I should say Jupyter notebooks are simply for prototyping. I know, I think it’s Netflix that has taken Jupyter notebook to another level.

00:19:13 [Prayson] I respect that.

00:19:15 [Prayson] But I’m still in a classical way where, no, we still have to go the old way of having just like the way you create a normal package which we could peep install. That’s how we should build all our pipelines.

00:19:30 [Prayson] Yeah.

00:19:30 [Brian] Well, okay, a couple of things I want to poke at before we move forward.

00:19:39 [Brian] Netflix is a big company, and I think you’re right about some of that from Netflix. But they operate on a model where each individual team gets to make their decisions.

00:19:51 [Brian] So I’m sure that there’s some people that’s. But some people on Netflix that totally agree with you and there’s some people that use notebooks in production. I think we probably can’t do a blanket statement on them.

00:20:03 [Brian] The other thing is I wanted to ask you about this. Okay. So when you said you were using unit test as a framework, it is a lot different than pytest, but you can do similar things. But I guess I had a question about that. Using unit test isn’t the same as writing unit tests.

00:20:27 [Prayson] That’s correct.

00:20:29 [Brian] So when you switch to pytest, were you writing the same kinds of tests or changing the kinds of tests you’re writing?

00:20:39 [Prayson] I think we were doing similar tests. Right.

00:20:42 [Prayson] But they just became easier and they became more like we could write more tests since the code base was smaller if you compare just so we were doing similar tests. So we had one file here which we used the unit test library from Python, and another one next to it is a Pi test.

00:21:04 [Prayson] And one was just taking almost a third number of our code base. Right. And in this case, when you reach the point where I am, for me, writing less code is better because Asia says it’s a snake, it will bite back.

00:21:35 [Prayson] It’s a Python. Watch out, he bites back. So in this case, you have to write as minimal as possible. Then the Byte will not be as painful as when you have this huge chunk of code.

00:21:51 [Prayson] I can just take it to what, like what we’re doing and how this test comes into play, right?

00:21:57 [Brian] Yes. That’d be great.

00:21:58 [Prayson] Yeah. So we perform some kind of unit test and this, of course is testing different things. So we will test like the input features. So we know, okay, in this project we’re expecting this kind of inputs. Right. And then we know this input should be integers, this should be strings, it should be this. And now things to Pidantic. We can do all this validation and everything and this becomes way, way easier. But when we started this, Pythonic was not there. So we were not having all this, what you call like free lunch.

00:22:40 [Prayson] We had to do everything ourselves. But now we are replacing so much with Pedantic, then it becomes quite easier. But we’re also doing some kind of another part. We’re doing configuration test. So when you create models, some things when you change just a little can create a huge impact.

00:23:00 [Prayson] Right.

00:23:01 [Prayson] So an example is we are creating a cluster and we say that this cluster should contain 55 topics. So if somebody just goes and changed to 54, then you can get completely another picture.

00:23:16 [Prayson] Right.

00:23:17 [UNK] Okay.

00:23:18 [Prayson] And then maybe we are training a model and we want to reproduce this, then we will usually set the random seed to be able to reproduce the same numbers. Right. So if somebody goes and changed this random seed to something else, then you can get a totally complete different model.

00:23:37 [UNK] Interesting.

00:23:37 [Prayson] Altogether. Yeah.

00:23:39 [Prayson] Because when you train a model basically what usually it starts is that the model would generate some.

00:23:46 [Prayson] So imagine we are trying to predict the house price. So we know maybe we need the number of rooms and we need the location. So maybe the location plus the number of rooms will tell us about the sales. I’m just making it very simple, but you can add different features. So when the model starts, it’s usually just create a random weight. So it says model times 0.01.

00:24:12 [Prayson] So it says like room, whatever the size of the room times 0.01 plus maybe the location, which is maybe 100 meters from something times 0.2 equals to the thing we’re trying to predict. So this initial weights here are usually randomly selected.

00:24:31 [Prayson] So it depends on how you start them, how your model can converge when you’re trying to train it. So taking the same initial conditions are very, very important.

00:24:45 [Prayson] So when we’re performing this kind of test, we usually do this configuration testing that we make sure that the same configuration we use in our model to produce in our development, to produce this model, we have to test it, that it stays the same. Because if somebody changed it, then we changed an entire pipeline.

00:25:08 [Brian] Okay. So you have to make sure you use those configuration on the final thing also.

00:25:14 [Prayson] Yes. In order to make sure that we keep the consistent. And we can also reproduce the model because sometimes they usually ask a customer comes and says, okay, why was I scored like this? Then you need to rewind time back to play along. How did this happen?

00:25:32 [Prayson] So if something went wrong there, then you cannot reshow how the model reached to that conclusion. So you need to backtrack everything back to say, okay, you were scored this way because of ABCD, right?

00:25:49 [Prayson] So we have this unit test where if I take it from start, where we test these different functionality. So the input is the same. The configuration we used to test the model is the same. Okay. Given this feature, if you put them in the model will produce this outcome, right? So we have to test that one. So if you put this, it will produce this. If you put this, it will produce this. So we can repeat this again and again. When we put it in development, when we put it in staging and when we put it in production, so we can see, okay, it’s a consistent behavior throughout, right? Yeah. And if anything happens, then there’s a failure somewhere. And then we go and figure out, okay, what has changed.

00:26:29 [Prayson] And then another biggest part, we have something called the fairness test.

00:26:33 [Prayson] So these are something that I’m brand new due to GDPR and due to ethical AI.

00:26:42 [Prayson] So these fairness tests, they are more like things that will have saved a lot of company getting into bad press. So ideally is that we test if the model segregate a certain group, right. So we look at different things. So we have something like protected attributes. So with different cases, there are different protected attributes. So some people can say age, gender, sex is protected attribute. So once we know this protected attributes, we come to the conclusion this protected attribute needs to be removed. Or if they’re there, how do we ensure that they do not cause trouble?

00:27:28 [Prayson] For example, I think there was one company in New Zealand or something where they were offering insurance premium. And then if you actually were born in Saturday, you got a different premium than if you were born in. I mean, if your birth date was on Saturday, you got a different premium than the person who actually had it on a different day of the week, which is hilarious.

00:27:54 [Brian] Doesn’t make any sense.

00:27:55 [Prayson] It doesn’t make sense at all. I might have mixed those dates, but it was just your price was given according to which day of the week, which is just hilarious. So if this test were put in place, they will have flagged that already, right? They will have not allowed this model to go in production in the first place. So these are like attributes to which we the fairness test, something we call like a counterfactual test. So you pretend you send different kind of inputs which are very similar, but you only change those attributes that you think might cause issues. Right. So it could be sex or gender, right. Or it could be sexual orientation. And then you see whether the model returns similar results. Right. So if someone says, okay, we want the same model to predict people who live in Copenhagen area and people who live in suburb, they should get the same result. So I will send two different latitude, one in Copenhagen, another one in another place, and the model should give me the same results. So if it doesn’t give me the result, we said, okay, our model is segregating the demo geographic where you are. Right.

00:29:10 [Brian] Interesting.

00:29:13 [Brian] My thoughts would be that you shouldn’t even have that data like male or female.

00:29:21 [Brian] You could be predicted if you just don’t have that data there. However, it might be some other piece of data that ends up inadvertently segregating male and female populations that even if you take that out. And so having that data there allows you to make that test to say, are we regardless of whether or not the model is using that piece of data, there may be some other piece of data that’s correlated that’s causing the bias or something.

00:29:52 [Prayson] Yeah. So for me, I am usually torn apart between removing data or dropping data or anything. I usually more proposing or putting into place an awareness of something.

00:30:11 [Prayson] It’s good just to be aware what you have. Should someone drop something because it’s sensitive and maybe yes, maybe no. But we should just be very transparent about it. So we should make sure people are aware about these features. What we are trying to do with this counterfactual testing is trying to overcome this part of problems by saying, okay, given this feature, when we send to the model, how do we see the model behaving? I think the problem comes when we don’t know which features we have. And once we know it, we’re trying to hide the truth. For me, I go like, no, we should try to know which features we have and when these features are controversial, let’s become very loud about it. When we’re using them, like, we become really aware we are using them and we are very careful using them. And that means we have to run a lot of counterfactual tests to make sure that if it goes against any things that is not right, then we shut it down.

00:31:16 [Prayson] Right.

00:31:18 [Brian] What does counterfactual mean?

00:31:20 [Prayson] So a counterfacture is we call like States of affair. That could be, but they are not.

00:31:27 [Prayson] So a counterfacture will be that we are having this conversation. So there’s another possible scenario where we’re not having this conversation.

00:31:36 [Prayson] So it’s just like it’s a counterfactual.

00:31:40 [Prayson] So it’s a possible state of affair.

00:31:43 [Prayson] So they don’t need to be there, but they can be there so that we can test something.

00:31:47 [Brian] Okay. So we’ve been talking about several types of tests that look at how the behavior of the model. Right?

00:31:55 [Prayson] Yeah.

00:31:56 [Brian] Are you writing these in pytest?

00:31:59 [Prayson] Yes, we are writing. Oh, this is pipest. Isn’t it crazy?

00:32:02 [Brian] Yeah.

00:32:05 [Brian] So how does that work? You’re treating the model as a high level thing where you have certain data sets already used PreCan data sets, or you make them up yourself.

00:32:17 [Prayson] Yeah.

00:32:18 [Prayson] We start with the data set that we receive. So when we train our model, we put some kind model in production and before the model goes into production, it goes through these different tests. So we test different parameters.

00:32:34 [Prayson] Does our data set contain this way? When we do text preprocessing, did we remove some words that needs to be removed? Like, people’s name needs not to be there. So we have to check is there any someone’s name is Jacob in the data set? It needs not to be there. If Jacob is there, then the test will fail. So it will take different names and then try to say, are they in these pre process steps? And then there’s something called like stop words, like words which we don’t think has any meaning or anything.

00:33:06 [Prayson] And then after that, when we have a model, when we do this test, we can pull this model down and run this counterfacture test.

00:33:16 [Prayson] The model is not affected by these protected attributes.

00:33:23 [UNK] Yeah.

00:33:24 [Prayson] And then if someone else comes in, oh, I think you should add this one more test kind of test. Then we change this one single parameter and see, will the model still behave the same? And if it fails, then we say, thank you. You have found an edge case that we didn’t consider, then we go retrain and fix that issue. But the cool thing is all in Pythouch. Isn’t it crazy you can write all these things, all these tests in a really systematic way. Another part we do a lot of tests is we do benchmark testing. So whenever we deploy a new model, we have to make sure that we do something.

00:34:05 [Prayson] I think in Google they call it what they call the dark mode or something like that. They call the dark lunch or something Google, but we just call the shadow deployment. So ideally is that whenever we introduce a new model, it has passed everything. We do the test where we run our newest model and the current model almost parallel. And then it perform all these other tests too, like stress testing, and see how much can it be heat? Does it get the same result? Is it better than the model that we have? And here we are using Pytech to check the performance of the model that is running, which becomes what we are looking for. And then compared to the model that we’re trying to put in deployment, do you know this concept of shadow deployment?

00:35:02 [Brian] It sounds like you’re just testing two different systems to compare them.

00:35:06 [Prayson] Yes.

00:35:07 [Prayson] So ideally is that you send the traffic which goes to the model that is in production, also to the model that is in the shadow mode, and then you can see how are they performing. But the model in production is the one that still returned the result.

00:35:25 [Brian] Okay.

00:35:25 [Prayson] Model in shadow mode actually return the result in a place where we can do the testing.

00:35:31 [Brian] Okay.

00:35:32 [Brian] So it’s kind of like segment A B testing where you get a segment of the population.

00:35:40 [Brian] But instead of just using the new model for the new data, you’re taking a portion of the traffic and sending it to two different models.

00:35:51 [Prayson] Yes.

00:35:51 [Brian] And comparing the two. Okay.

00:35:53 [Prayson] Yes. And then there we’re running the test to say, okay, is the new model outperforming the model that is in production?

00:36:01 [Prayson] Because if it’s not outperforming it, then there’s no need to change, right?

00:36:05 [Prayson] Yeah. So in this case, you have to say that the model we are trying to put is better than the model that is currently in production. And we can say, well, it takes the same traffic without causing any issues. And, well, if you guys are happy with it, then we can deploy the new model and then we just switch to the other one and the end user does not know then what happened.

00:36:31 [Brian] Similar approach. It happens a lot with refactoring, whereas somebody says that we don’t intend to change the behavior at all, but we want to make sure that the output that we come up with is the same and that we have a degraded performance.

00:36:50 [Prayson] So that’s a cool thing that we are not the cool things. I think they are just the theme that pytest has made them cool doing as a data scientist.

00:37:01 [Brian] Those are some fairly high level things to test for around these models and stuff.

00:37:08 [Brian] But then also some people are also using software tests for little tiny things. Like I write a function, I want to make sure that the function works right.

00:37:18 [Brian] How I expect it. Are your software developers or your data scientists using that as well?

00:37:24 [Prayson] Yes, that’s actually so that is the part of the unit testing when we do all about the input feature. So when I talked about that is that we usually have a function that loads data. Right. So we actually test this function. Did it load the data the way we think it should load the data like a sample set. Okay. And then we have different Pandas pipeline. So we maybe are removing missing rows or we are trying to remove a duplicate service. We also test those functions that is this function doing the thing that we think it’s supposed to do.

00:38:05 [Brian] Okay.

00:38:06 [Prayson] So those are part of the whole unit. And then when we’re actually doing the API test, there is also we do like integration test, because then you are testing both the function mentioned that load the model did load the right model and the model has not changed. It’s still the same model as we think it should be. And then when I put the data the pipeline to clean the data, did clean the data the way we say it should be. So we test all those ingredients also as a whole, right?

00:38:40 [Brian] Yeah.

00:38:41 [Brian] And that’s totally cool.

00:38:43 [Brian] But I’m glad that you came out with the perspective of this at the high level, being able to do counterfactual testing, benchmark testing, all these different things with Pi test, because that’s one of the things I’m trying to tell people about is you can use these software testing tools to test really anything. If you can get at the information in Python, you can use software testing tools to like pytest to check whatever you want.

00:39:17 [Brian] And then you get like red light, green light. Is it good or bad?

00:39:21 [Brian] And I love that.

00:39:24 [Prayson] We discover something very extraordinary. The moment we started doing a lot of tests, we have cut down the debugging circle big time.

00:39:38 [Prayson] We discovered that we could refactor our code without any fear because there’s nothing as fearful as refactoring a code knowing that it will break so many things. Right. So whenever we do anything, I just said, okay, we’re going to refactor this part.

00:39:55 [Prayson] We do some changes, and then we run the test and we see, hey, the tests are passing. So we didn’t break something. Oh, they are not passing. I think we broke something. Right. And then from there we have to go quickly and figure out and the good thing is that I usually force my data scientists to write good average error failure. So I go like, oh, we expected this, but we got this. Then it tells me, oh, quickly, I can just go see where it felt. Right. Because sometimes you can see a citation, we didn’t expect that value.

00:40:31 [Prayson] What value? Tell me which value I gave in and what was the expected outcome. Right. I got used the F string to tell me I put this. But actually this was the one that was expected. Then my debugging and my changing of the code will come really, really fast.

00:40:48 [Prayson] So whenever we do, I usually emphasize, oh, I really want this acetation message when things fell to be not too verbose, but to tell me exactly what went wrong and what was expected of that. Right. Because I’ve seen very funny, something went wrong and I was like, how does that help me? Something went wrong. Right.

00:41:13 [Prayson] I want you to tell me exactly, okay, we expected this dictionary with this key, but these keys are not missing. Right. So I know. Okay. So it seems my change does not send this. So I need to change my code this way. Right. So whenever we see like, oh, the model failed. Okay, I can see okay. Because the model has accuracy as degraded.

00:41:38 [Prayson] We expected it to be say 98%, then now it’s at 92%, then something fell, then I know, okay, I need to retrain the model. I need to do something about the model to push the accurate up. Right.

00:41:53 [Brian] I like the strings that you can add to asserts. I also try to tell everybody that I work with to use descriptive variable names within the test so that when I have a test failure, we can turn on show locals and dump all the locals with the test failure. And then it helps to describe what’s going on.

00:42:15 [Prayson] Yeah, but I’ve also studied another culture in my team. I usually say, when I come to your code base, I would just like to go directly to your test.

00:42:25 [Prayson] And by looking at your test, I should know what this package is doing, right? Because I don’t need to start going from this to that, to this to that to figure out what’s going on. I really care less. I want to go to test and I know exactly what package does, what function interact with, what function, what are the expected output. So by me looking at your test, it should tell me exactly what’s going on. Yeah, this actually enforces them to write good tests because they know I really the first place I go look is not everything else. I just go directly to the test folder.

00:43:11 [Brian] I like it.

00:43:13 [Brian] And hopefully this will enforce the test at least early if not test first.

00:43:20 [Brian] Because when you have to write the tests from that model, if describe a package, that means you’re going to write tests that use utilize the API for the package and then if the tests are hard to write, you’re going to change the API. So that it’s easier to write the tests and doing that early is the right time to do it.

00:43:43 [Prayson] Yeah, well, but I think because when you have mostly junior developers they usually say, why do I need to write a test that seems to be very unnecessary that I’m just testing this and then I go like, yeah, it’s unnecessary when our code base is very small but we know that our code base usually grows with the demand and then the thing that you thought was unnecessary becomes the most pivotal part and I also say when we on board new people into our project, it’s really easy just taking them through the tests to explain what is it that we’re trying to achieve.

00:44:29 [Brian] Thank you so much. I want to learn more about machine learning and pipelines and stuff and then maybe come back and talk to you more sometime.

00:44:37 [Prayson] Yeah, definitely. Thank you so much for having me.

00:44:39 [Brian] Thank you.

00:44:45 [Brian] Thank you Prayson. That was a really interesting talk. Thank you PyCharm for sponsoring this episode. Visit them at Test And Code Combe PyCharm thank you, Patreon supporters become a supporter yourself by going to test and support that’s all for now and go out and test something.