A discussion with Katharine Jarmul, aka kjam, about some of the challenges of data science with respect to testing.


Transcript for episode 33 of the Test & Code Podcast

This transcript starts as an auto generated transcript.
PRs welcome if you want to help fix any errors.


Brian Okken: Hello and welcome to Test & Code. How about some data science? I am excited to share this interview with Katharine Jarmul, also known as Kjam.  Before we get started, I want to thank all of my wonderful Patreon supporters for sticking with me— you’re awesome, thank you. Special thanks to Oliver, Andrew, Evan and Jordan for contributing at the Super Hacker level. If you’d like to join these super hackers, go to testandcode.com and click donate. Now let’s learn about testing in data science with Katharine Jarmul. Welcome to Test & Code, a podcast about software development and software testing. Could you introduce yourself to my listeners?

Katharine Jarmul: I’m Katharine Jarmul, most people on the internet know me by KJam. That’s been my nickname for a while now and I’ve been working with Python now for nearly 10 years. I work primarily in data science and machine learning, I run my own data consulting company here, in Berlin, called Kjamistan, that’s most of what I would say. 

Brian Okken: There’s a couple of reasons why I really wanted to on the show. One of them was that because you did a Euro Python training about data unit testing, but also on Kjamistan, your website, testing is like top, it’s the top of the list, so it seems kind of important. Why do you care about this so much? 

Kataharine Jarmul: I have more of a computer science background than a lot of my data science contemporaries I would say. And because of that, one of the things that I notice usually when I’m working with data science teams is a lot of experimentation and a lot of playing around and figuring something out, and perhaps a little bit less of what we would consider normal computer science principles of having unit tests and making sure that our code is covered and so forth. And so I think that when I started thinking about how to talk about testing to folks maybe that just have a statistics background and haven’t necessarily worked in a lot of languages where they need to write structured tests, this came up, this conversation of how do we test data science code, if it is just experimentation, should it be tested, how do we better test pipelines and other automation workflows that we might be working with. And it became kind of a pet topic of mine, I guess you could say, where I was curious to hear about how people were utilizing testing in their approaches and to talk more about how we can do that in a better and more streamlined and more automated way.

Brian Okken: How’s that been received so far? I mean, is your company a consulting company?

Katharine Jarmul: Yeah, yeah so I work in consulting primarily and then I also do teaching and speaking.

Brian Okken: Okay, so do you like for instance is anybody needing help with validating their designs code or is that part of it, or is that just something that you’re thinking about more now?

Katharine Jarmul: So I definitely work with some clients where that is a big piece in terms of figuring out how to create data validation in an automated way. And I work with some clients where I work with a Q&A and testing team and I provide kind of maybe some data science insights to the Q&A and testing as well as figuring out ways that we can do things like property based testing with whatever it is that they’re using.

Brian Okken: Okay, I don’t know how to get started, there’s a lot of topics here. How do you get started with testing with data science stuff? Do you focus on the data, focus on the pipeline or just is it everything that really needs addressed?

Katharine Jarmul: When thinking about this problem and looking at a new project, usually I would start with more of a holistic overview of what’s happening with the system. So for example, if we’re talking about ETL pipeline, it might have a series of different sources that are inputs, those might have particular properties or standard behavior and we might first think about how do we allow for integration testing, essentially. So what happens if one of those ingestion services goes down, does it break the entire pipeline.

Brian Okken: Okay, remind me what ETL is?

Katharine Jarmul: Yeah, it’s Extract Transform Load, so it’s a common practice let’s say within Hadoop or also now even stream processing if you would use Apache Spark or something like that, then you would be essentially taking incoming data, incoming data sources you’re perhaps doing some transformation and some cleaning in the middle and then perhaps you’re exporting to some sort of tabular format or some sort of regularized format that you would use later to so-called data lake if you will or whatever it is your larger storage that you’re going to use for actually running analysis later.

Brian Okken: Okay, and when you use the term integration test, you’re meaning more of a just as opposed to unit tests? It sounds like that.

Katharine Jarmul: Yeah, yeah so for this a lot of times you would want to say, okay, if one of our data sources goes down or if our connection to the end result tables goes down, what should happen to the job ]? So a lot of times these are managed by graphs so they’re managed by directed a 05:39  [indiscernible] graphs into that type of behavior, so that we can determine okay, we have lost connection at this state, marker cells 

05:48 [indiscernible] failed and perhaps set up a retry at some point in time. And for a lot of the frameworks that people use that’s somewhat built in. 

Brian Okken: So that’s for fault tolerance and testing for failures of a data source or something. I think some of the stuff is used to make some sort of decision or have some— we transform data for a reason and some decisions or something at the end is affected by the input data. Is part of the testing to use a fake data source with known qualities to see if those qualities show up in the output?

Katharine Jarmul: Yes, so this is a part of kind of what I’ve been pushing more towards. Usually you would have some sort of schema validation or data quality validation within this pipeline, but sometimes that’s just not set up, because the system assumes that the data will be of an 06:47 [indiscernible] quality. And the problem with this is for using a lot of outside APIs or if perhaps you even were consuming internal APIs but there’s things like schema changes, these validations can either throw false positives or false negative or also alert us of issues with the data that are not necessarily issues, but a schema change. And so you need to keep an eye, essentially you should have the schema validation but you also need a process for determining when the schema might have changed and to react to that in an appropriate way. We see this a lot when you have a larger organization and perhaps the engineers working on the API that is being consumed are not aware that that data is being used by another team for say analytics or something.

Brian Okken: Yeah, it’s a tough problem. I was talking with a data scientist about a year ago, we talked to a data and pipelines a little bit, but the code within the transform part trying to get training for people writing that code to do more unit tests or component tests as well— is that something you see lacking as well and that people need to work on?

Katharine Jarmul: Yeah, I agree completely with that analysis. I think that a lot of times unit tests are not written for that code or that code is also not modularized in a way that it’s easy to test. And this creates an issue where is says, “Okay, well this is just one chunk of a long pipeline, how do I test just this step?” And depending on what framework you’re using, this can sometimes be easy or difficult, because if you can’t actually run that piece as its own solitary piece of code, then this becomes really difficult and how do we apply these unit testing principles to these different pieces of the pipeline that we may or may not have actual access to test in the way that they would be presented within the framework.

Brian Okken: Yeah, you mentioned, we’re talking about testing these pieces with unit test techniques, I like that terminology, even if it goes against some of the CS people that like to— I mean in the test driven development community there are some parts of it they like to think of a unit test is just testing individual functions in isolation. And of course, if that helps people that’s fine. But when you’re talking about a stage within other types of software development, that might be just interfaces between components and those are a great place to test and chop off as between the two, between interfaces. And it sounds like a pipeline stage is a similar sort of interface layer that it’s good to be able to isolate and test around that. Is that correct?

Katharine Jarmul: Yes, so these might be like when we were talking about the graphs, these might be particular nodes or transitions between nodes where we’re passing off between what could be a function or what could be a step depending on what framework we’re using. And because of that, we need to be able to test what are possible inputs and then what is expected behavior, whether that’s a particular output or a state change and we need to at lease observe it and determine, okay, this has happened and it’s happened within what we would expect, and also determine what happens when something unexpected is seen. Because the last thing that we would want to have happen is all of our data becomes corrupted because we’re perhaps letting in a few bad seeds if you will.

Brian Okken: So how do you deal with that? I mean there is a lot of statistics in the middle there, right? Do outliers have an effect on the rest of the system if there’s like averaging going on?

Katharine Jarmul: Yeah, if you’re using heavy regularization or other types of normalization, then perhaps you don’t see the problems. That could potentially lead to bigger problems down the line though because if you also have access say to the raw data or if different change in statistical distribution is changing the way that your regularization is behaving, then what could happen is this is kind of masking those problems. And when I go and I’m data scientist be on your team, and I’m taking a look at perhaps singling down a particular region and I want to go back to the non-normalized data, if there’s errors in that or if perhaps we have a unit problem or something was mislabeled or mistyped which is common, then if those have kind of been swept under the rug so to speak, then I’m going to have a really big problem with whatever particular analysis I’m running. And that might mean, okay, we’ve actually had some bad data or bad labels now for the past six months that nobody has noticed because this is the first time somebody is taking a look at it. 

Brian Okken: Are there cases where valid data just has missing data points?

Katharine Jarmul: Nulls are a big part of when you’re working with data cleaning, how do you determine or how do you interact with them. And yeah, that’s a really big problem obviously within data science is how do you deal with missing data and also what we would call non-signals. So perhaps a signal is a click, right, but a non-signal is a non-click. Now, what does a non-click actually mean? That’s stuff for a debate, right, some people make that a very strong data point and other folks might say that that doesn’t necessarily hold a lot of meaning. And so a lot of the metrics that we use when we think about data science are things that people have determined how to measure. And sometimes we’re incorrect, so sometimes these measurements, these metrics and also this missing data can skew our analyses in a way that we should be aware of, which is why in my opinion, things like testing for the behavior when we have a lot of missing values or testing for the behavior when we see corrupt data is something we should be aware of way before our data science code hits production. 

Brian Okken: So how do we fix it? Do you have any thoughts on that?

Katharine Jarmul: Yeah, I actually just released today on PyPi a tool that I’ve been kind of working on with some of the projects that I’ve seen these types of problems on and it’s called “Data Fuzz” and it’s very new and it’s only been used by me, so probably it may or may not be relevant to everybody’s work, but it essentially tries to take some of the ideas of fuzz testing or adding noise and allow you to apply that to a data set so that your data set is somewhat corrupted, with a little bit of tweaks that you can do. Then you can use this essentially to run through let’s say a pipeline code or something similar and see what happens, does this behave how I would expect, are our crashes handled, are warning sent, does my model just completely fall in its face in respect to machine learning and so forth. So this kind of is something that I’ve been working on on the side, that I think approaches like this can help us have tools where we can determine what should we do if we see truly poor data. Should we ignore it, should we give a default response, should we throw it away to make sure that it never ever hits the database etc. Should we flag it, these are all things that each team is probably going to make a decision slightly differently. 

Brian Okken: And I know in like the scientific paper in realm, there’s a lot of need for reproducibility of everything. Is reproducibility a part of data science now as well?

Katharine Jarmul: Yeah, I definitely think that that would be a problem if you let’s say had a model that you were using and you’re finding it difficult to reproduce. I would say that a lot of the reproducibility within the scientific community comes around a lack of open source principles, so essentially a lack of sharing the data used to do training or to do the experimentation and sharing the code right. So a lot of the folks that I know that were working on expanding reproducibility of let’s say natural language processing are working to how do we do these algorithms into the open, allow them to be shared and allow code to be shared with them, and then allow other people to reproduce the results. But I would say that yeah, because it’s heavily entwined with kind of where machine learning is right now that reproducibility for data science teams as far as model robustness and model health if you will, is something that has to be determined. Another library that is really great for this is “hypothesis” and I don’t know if you need used it before, but it’s property based testing for Python and I think that for data scientists purposes, if you do have access to the smaller units of code, it can end up really helping, because it allows you to essentially use some series of static types to test for outcomes. So you can essentially say, “Yeah, I’m expecting integer input and float output,” and so forth, and that will allow you to test in a property based testing format. And David Maclver is the primary author of that and he’s put a lot of time into having really thoughtful trees so that it can actually find edge cases that you perhaps wouldn’t think of when you were just writing tests yourself. 

Brian Okken: I was curious as to what scale you use it at, is it like a A pipeline stage you can use it at that level or do you use it at the entire system level testing or is it everywhere, where’s the right place to put this Hypothesis when testing data science problems?

Katharine Jarmul: So for the most part I use it when I have a discrete piece of code as a property test for a particular unit of code. So this particular method should take these numbers and somehow correlate them, or determine if there’s a correlation or something, and it should return this other data type. The difficult thing with Hypothesis and kind of why it inspired me to write Data Fuzz is that outside of the normal properties of the types you cannot define down to the nitty gritty level of saying perhaps that I need to have this type of distribution or I need to have some other property that I’m looking to test, and for that reason, that was kind of where I was using Hypothesis a lot and I was finding a few limitations and that was my inspiration for kind of creating something that can create synthetic data within a particular series of properties and then add some noise to it. But yeah, I think Hypothesis is really great and I haven’t played around with it yet in terms of feeding data into a pipeline, but I have been following the development and I think that there’s been recently some changes that would make that a lot easier and I think they even added to the example on the site for using it for web fuzz testing, so essentially using it as a generator to create realistic looking API requests with a series of different static types and then you can essentially say, okay we expect these things or of these ranges of things for these inputs and use it to fuzz test your API interface.

Brian Okken: Just assumed the shapes and properties of the distribution are kind of important especially with these sorts of problem domains, I am often thinking in terms of communication systems and like RF data, because I work in the communication industry . A lot of times we have to test systems by like RF signals that don’t actually ever exist, we have to be able to pick up something that we can’t get normally. So you said you started out in more of a CS background. How did you get into doing what you’re doing now?

Katharine Jarmul: I originally studied computer science in university and I ended up switching degrees mainly because there was a really big gender dearth in my program and that led to a few negative experiences, so I switched and I mainly focused on political science and economics which was nice because I still got to use my math. I did quite a lot of statistical analysis on a few projects that I did and was able to keep up with the econ studies that I did. Then I left and didn’t do anything related to technology for several years and eventually found my way back. And when I found my way back, I was working at The Washington Post on one of the reporting teams. And I started getting involved with the apps that were built there and I don’t know if you know, but the apps that were built there at one point in time was the largest Django install in the world I believe. 

Brian Okken: Wow, cool. 

Katharine Jarmul: And so I got introduced to Python via building data driven Django apps for The Washington Post. And then I evolved from there, so I started working with a series of data journalists at USA Today after that and then I left and started working at a aggregation start-up where I got to work on Hadoop and a few other things and started working with an NLP and I slowly followed my way there into data science, as you can probably see the progression.

Brian Okken: Not only does that sound like a blast, it sounds like it was a fun career path, but quite a resume there, wow, very impressive. So what do you really care about now? Actually, before I ask you about that, I want to ask you about your speaking— you’ve spoken a lot at a lot of conferences— why is speaking at conferences important to you?

Katharine Jarmul: I started to first notice this being important when I was part of the founding committee of PyLadies. So we founded PyLadies in Los Angeles in late 2010 early 2011. And when I would get in front of a group and speak, I would notice that a disproportionate amount of women would come speak to me afterwards and just say that they were really grateful for my talk being there and that it really made them feel welcome. And I think that this kind of events became important for me in the sense of giving technical talks on stage and being another voice saying, “Gender and your gender identity doesn’t necessarily have anything to do with your technical ability.” And that has kind of inspired giving talks. That and I love the challenge of talking with folks afterwards and hearing good questions and ideas I think they do make me feel like I’m giving back a little bit to the community and that perhaps I’m inspiring some women to give more talks alongside having really great discussions. I don’t think that I see a as much of a downside there, I see only positives.

Brian Okken: I’d like to encourage anybody that hasn’t watched one of your talks to go watch because right from the first second you start talking to the audience, it doesn’t sound like a lecture or that you’re like instilling the knowledge from the top of the mountain. It just feels like a conversation among friends.

Katharine Jarmul: Oh, thanks, yes. I’m glad you liked it. I think the purpose of the talk should perhaps be more conversational then perhaps one person being an “expert” and teaching us all how to do things. I mean, there’s a time and a place for that, but I’m not a professor and therefore I don’t need a lecture podium.

Brian Okken: Well it’s cool, but then back to the question that I skipped over— what in your field or out of your field is exciting you right now?

Katharine Jarmul: On the topic of testing, I’ve been pretty passionate about trying to keep up with the literature around how do we automate some of this data quality testing in a way that makes sense. And this is, of course, really difficult, we talked about this a bit earlier in terms of I have this schema, I’m doing schema validation, the schema changes. Now I need something, perhaps machine learning, to tell me that the schema is likely wrong. That way I don’t have to sit there and filter through a bunch of bug requests or failed jobs where a machine could perhaps recognize that the patterns have changed. And there’s quite a lot of really interesting research happening right now within this, particularly around databases. Let’s say if you have a large distributed database system and a schema changes affecting and blocking a bunch of rights or whatever it may be doing, you would probably want there to be some intelligent solution that doesn’t necessarily wake your engineer up at 2 a.m. with emergency.

Brian Okken: Definitely. And also just even more insidious are the ones that don’t cause any known problems, they just make the data different in a wrong way.

Katharine Jarmul: Yeah exactly, so I would definitely want an alert if all the sudden I feel like there was a decimal point change or something, I would need to know if that perhaps the API that we’re using or something has changed the format. 

Brian Okken: And I can’t remember where I saw it, whether it was on your website or somewhere, there was an amusing image of like somebody reporting that a person that died and their age was printed in a negative number just because somebody reversed the birth and death year, that was an amusing error.

Katharine Jarmul: We see these data validation issues in the wild so to speak all the time, because as humans, we have common sense. And even though computers and machine learning and AI If you want to call it that, has been making advances, there’s still quite a lot of things that you or I will look at and we would laugh or think is silly, but somebody has to think to teach that computer, of course, that there is no such thing as a negative age. And I shared that in one of my slides because it’s a really difficult problem. I think that particular one was a Google search results that has since been fixed, but these mistakes have been even at Google engineering level, probably even more so right, because of the amount of data they have. And when we start to think about these problems, they can touch any part of our company or our software that relies on data to make some sort of informed decision. And that’s why I think having this data quality or these commonsense checks is something that we’re going to have to figure out on some level, how do they get automated, how do they get incorporated and how can we then make sure that they’re valid 10 years from now. 

Brian Okken: And also the pieces along the way, like what I’ve seen before is if somebody understands the physical characteristics or the real world characteristics that the data is describing, they can look at the process flow and the result and have a like you said a rule of thumb as to whether that’s a reasonable or not, like a reasonable factor. But the internal stages, maybe so far removed from the the end output and the the real physical situation that it’s hard to have a mental model of whether some number is reasonable or not. And I don’t have a solution for that, but that’s something to be aware of. It’s good to be more careful in the inside when it’s hard for some developers to write a test for what is a reasonable number when it’s not obvious what a reasonable number would be in certain situations. 

Katharine Jarmul: Exactly and I think this is particularly difficult when you’re dealing with a small amount of data at the beginning and you don’t necessarily have a good idea of your constraints, or if you’re dealing with something predictive, like a forecasting model or something. Is that a reasonable number, especially when you have a small amount of important data or when you have like as we say a cold start, these are really hard problems just solve and I think require a lot of documenting, team discussion and determining over time whether your initial heuristics were actually correct or not.

Brian Okken: I encourage people to write obvious tests as well, especially on input data, because things like in that example— make sure that the death is in the year or day after the birth, that seems like a silly thing to ask in reality, but when we’re checking the data it might be reversed and I see it all the time with datasets going through like DSP’s up to higher levels, it’s an ordering thing and the assumption of where the max value and where the min value is sometimes gets flipped and we need to make sure at each stage that we’re using the data in the right direction. I’ve seen things that measure the power of a cell phone that when we do the math and explain it to some of the engineers that that amount of energy is more than the Sun, so that’s not possible. But when people are familiar with the units and it’s just some weird number, they are like, “I don’t know, it looks like an okay number,” and we’re like, “No, that’s way off.” So trying to teach people what those mental model is. Especially when people are developers and everybody in the stage are working on a domain that they’re not really that familiar with, that can be a problem.

Katharine Jarmul: Yeah, I think that domain expertise is something that is perhaps getting further and further removed from the people that are working on the data engineering or the data science sometimes. And that is a really important conversation in terms of figuring out, “Hay, perhaps the cell phone is not emitting the power of the Sun” is the likely conclusion. I think how do we talk with teams so that domain expertise is shared enough, or that at least it has been shared once and been applied in some sort of intelligent test design, or at least alerting, right of minimally say, “Hey, these numbers seem out of the bandwidth range that we would expect, some human should take a look at them.”

Brian Okken: I’m trying to keep these a little bit shorter but I’m enjoying this conversation. Is there anything that you wanted to talk about that we haven’t hit upon?

Katharine Jarmul: I think that the conversation moved past it but I did really like the point about intermediary data values. This is particularly important when we started thinking about things like neural networks and deep learning where we perhaps don’t have any interpretability of how that’s operating within the network. I think that there’s been a lot of good research coming out and even tools coming out that are starting to allow us to interpret layers of these neural networks, and therefore perhaps start to apply some of these principles to that area where I can start to say, okay, I can interpret this layer is being activated by zip code, or something like that it’s being activated by this cluster of zip codes and for that reason I think perhaps is behaving unfairly. This is when we start to think of having also fair models we can start thinking about how do we test and introspect these models in ways where we can hold them accountable, where they don’t become these black boxes. And I think it very much relates to your idea of what is this intermediary stage, does it make sense or not? If we can’t answer the question if it makes sense or not, then we probably can’t determine whether it’s doing what we intended or not.

Brian Okken: There’s been a thing that came up, I think it was in the EU or somewhere, like a request that maybe in the future decisions made by neural networks and artificial intelligence today might have to explain why they came up with that decision. Is that even possible?

Katharine Jarmul: Yes it depends on what models you’re using and it depends on how you’ve architected and designed it. There are some that behave much more like black boxes than others and so there’s been a lot of really cool research happening of how do we allow for this level of introspection, so that people can start to say, oh this is why the algorithm is suggesting X, Y or Z. Or this is why it is saying that this customer is going to turn. There was a recent series of research by Fast Forward Labs around this, which is run by Hillary Mason and there is also a bunch of research happening and with H2O.ai around this, around creating interpretable models within the H2O.ai framework. I think there’s a really big push on quite a lot of continents for this, particularly because if you think about it from like a legal standpoint, if you say that you can’t insure me or if you say that you can’t give me a bank loan and then I take you to court and say that you’re discriminating against me, then how do we know whether the algorithm is actually discriminating or not, and so we need to be able to have these types of inferences for court cases and for legal discrimination cases around the globe.

Brian Okken: Wow, I didn’t think about the legal aspect, I’ve been thinking about like the medical part, like if we’ve got something examining X rays or CAT scans or whatever, MRIs that tells somebody, “Yeah, you probably need to start chemotherapy because we think you have cancer.” You’d want to know why, why this thing thought I had cancer, but at the same time also teach people about the data stuff that we didn’t even know. Like maybe we could have non-machine predictors show up if we understand the data better. 

Katherine Jarmul: Exactly, and I think that yeah, you hit right on the head in terms of a lot of the medical research that’s happening right now is trying to figure out let’s say like low hanging fruit, right, where it requires a doctor right now but it doesn’t necessarily require a doctor’s expertise and then maybe you could just look at the borderline cases say, “Okay, the algorithm only had a 60% certainty here,” or something like that and throw those for a human to look at, making doctor’s time more valuable, having them do something that a machine can’t do and kind of step in and help when perhaps the accuracy is off rather than have them spend a bunch of raw time doing something that perhaps even a machine can do better some times.

Brian Okken: This has been a fascinating conversation. If people want to know more about what you’re up to, how do they get ahold of you and find out more information?

Katherine Jarmul: I’m @kjam on Twitter and then you can always reach out at kjamistan.com and I have a contact form there as well as my email, so I’m fairly easy to get ahold of.

Brian Okken: Okay, it’s super cool that you agreed to come on the show, so thank you.

Katharine Jarmul: Yeah, thank you so much for having me. 

Brian Okken: I’ll have some links in the show notes which are at testandcode.com. Ask me questions on Twitter either at @testandcode or @brianokken. Thanks again to Patreon supporters, you rock and you keep me doing this. Until next time, go test some code.