Software developers have long known that automated testing is essential for managing complex codebases. Great Expectations brings the same discipline, confidence, and acceleration to data science and engineering teams.


Transcript for episode 95 of the Test & Code Podcast

This transcript starts as an auto generated transcript.
PRs welcome if you want to help fix any errors.


00:00:00 Data science and machine learning are affecting more of our lives every day. Decisions based on data science and machine learning are heavily dependent on the quality of the data and the quality of the data pipeline. Some of the software in the pipeline can be tested to some extent with traditional testing tools like Python test. But what about the data? The data entering the pipeline at various stages along the pipeline should also be validated. That’s where pipeline tests come in. Pipeline tests are applied to the data. Pipeline tests help you guard against the upstream data changes and monitor data quality. Abe Gong and Superconductive are building an open source project called Great Expectations. It’s a tool to help you build pipeline tests. This is quite an interesting idea, and I hope it gains traction and takes off. Thank you to Reagan for sponsoring this episode. Also, thank you to Patreon supporters.

00:00:58 Welcome to Test and Code, the podcast about software development, software testing, and Python.

00:01:06 On today’s Test And Code. I am super excited to have Abe Gong on the show. He’s got a company called Superconductive. Abe, welcome. We’re going to talk about Great Expectations and your company and what you’re up to. But first, before we get into that, can you introduce who you are?

00:01:23 Yeah. So I’m Abe Gong. I have been a data scientist data engineer for the last call. It eight years before that, I was in grad school and doing kind of the same thing on the side.

00:01:36 It was fun in 2011 when we started calling it data science because there was this name for this thing that I did. So I love data. I’m especially interested in human centric data. So health care education places where rows in your database are people, and by understanding them better, you can actually help people.

00:01:56 Okay. And now Superconductive is your company, right?

00:02:00 That’s right. We’ve been doing healthcare data consulting for the last couple of years, and our goal was always to come up with a product out of that.

00:02:08 We think Great Expectations might be that product and that the twist is I had never really thought about doing an open source go to market. And here we are using Great Expectations was an open source project to try and kind of find where a future paid product could come from.

00:02:25 Okay, well, let’s just jump right in. Great Expectations is a super cool tool, but what is it and why do people need it?

00:02:33 Good question. So the problem we’re solving is pipeline debt and that’s technical debt that grows in data pipelines.

00:02:41 I think the simplest way to say it is if you were working on an API and you released it out in the public and you said, hey, I’ve got this great API, but there’s no documentation and it’s not tested and it’s not stable like that. The API is going to give different results at some point in the future, but I don’t know when or how or why we would all look at that and say that’s just unacceptable.

00:03:02 Right?

00:03:03 Like that kind of API. Like you can’t build anything stable on top of it. But if you go and talk to people who are working in the middle of companies that have data warehouses and the flows of data moving around, almost none of that is tested or documented or stable. And yet somehow we all think that’s normal today. So I call that collective set of problems pipeline debt. It’s like just not doing the things you do pay down technical debt and the rest of the software ecosystem.

00:03:32 I love that term. That just nails it. Pipeline debt. There’s a lot of clean up that you do, but you just do that for your data and then forget that that clean up was necessary.

00:03:43 Yeah, there’s going to be a lot of debt people forgetting what they did, what order they did things in.

00:03:48 It’s kind of the wild west. There are a lot of things, just good practices in the software world that haven’t yet made their way all the way into the data world.

00:03:56 Right.

00:03:58 Okay, so this pipeline debt. So how does Great Expectations help with that?

00:04:04 The same way you pay down technical debt in a lot of other places by bringing the system on your test. So Great Expectations makes it easy to write rules to validate your data. And those are things like, okay, you’ve got a table.

00:04:20 I would expect to have columns A, B and C. And then maybe I’d expect column A to be of type Int and B to B of type float and C to be a string. And then beyond that you might say, oh, I also expect that the range of values in A is going to be between ten and 90. And in column C it’s going to match some regex 90% of the time. And for all of them, maybe 2% missing values is the most you could have.

00:04:47 Okay, it’s like asserts on your code, but it’s on the data, right?

00:04:51 Yeah, that’s exactly right. And I guess I should have said this, the library is called Great Expectations because each of these assertions, we call it an expectation, right. It’s an assertion about data.

00:05:01 Do I have to write down all these expectations for Great Expectations or can it infer them from like a set of data?

00:05:07 So we have what we call data profilers and Great expectations. And the idea is exactly what you’re saying, that you give it a set of trusted data and it can go through and come up with a pretty good first slate of expectations. So if you want to get hyped about it, you can say, oh, with data, your tests will write themselves.

00:05:27 I think that’s overselling it a little bit because there’s always going to be specialized domain knowledge and kind of just understanding of what’s important and what’s not that you have to add on, but yes, we have made it so that you can infer a lot of the expectations from raw data itself.

00:05:44 And there’s a whole bunch of cool things that you’ve put in place within this that I wouldn’t have expected to put in things like that. You can say I expected all of the data and all the elements in this column to be within this range. However, I know that some of them are going to be outliers, so I want 99% to be in this range, things like that. That’s pretty cool.

00:06:13 And then you said, also you can allow missing data.

00:06:16 The open source library has about 50 of these expectations, and we built these up partly just from personal experience and partly from just gathering suggestions from the community over the last couple of years.

00:06:28 Usually when we kind of present on Great Expectations, people look at a few of them and then they start nodding and they say like, oh yeah, I remember all the hassle that I had with date time formats or strings that have leading or trailing white space. They’re just these problems that show up over and over again. And it’s not that we’ve gotten every single one of them, but we definitely collected a lot of the common ones and made it easy to write good rules for them.

00:06:52 This episode of Test and Code is sponsored by Ray. Save time with Ray Gun’s crash Reporting detect, diagnose and destroy Python errors that are affecting your customers. With smart Python error monitoring software from Ray Gun, you can be alerted to issues affecting your users the second they happen. Raygun takes you to the exact line of code where the error has occurred and tells you how many times it has happened and exactly who has been affected. Have complete visibility of your app or website so you discover and fix errors and performance issues before your customers experience them. Regun supports all major programming languages and platforms and can monitor both back end and front end code. Regan also fine tune the filtering and notification control so you can focus on fixing important issues and problems affecting the most users. Not be bombarded with redundant notifications. It takes only minutes to set up. Try it yourself by going to Raygun.com that’s Raygun.com.

00:07:54 Do you know if there’s a normal use case for people using it with notebooks, or are they using it mostly as command line tools.

00:08:00 Or do you know that’s actually changing right now?

00:08:04 What I mean is, at the beginning of the summer, Great Expectations was batteries not included. So we had this notion of an expectation you could deploy that in Python Pandas. Rather, you could deploy it against SQL and you could deploy it against Spark, but we didn’t give any guidance on how do you deploy it. Like, how do you store your expectations if something breaks in the pipeline and expectations start returning false? What do you do in that case? All of that was just an exercise for the reader. Okay. And part of what we’ve done is we’ve started kind of morphing over and focusing on this more full time is we’ve talked to a whole bunch of teams that have deployed data validation tools, including Great Expectations, and they all build kind of the same thing in their first couple of months. And that’s now what we’re building out in Great Expectations. So just to give a little bit more color on that, there are two main modes. One mode that you see is teams that are using Dag is the wrong is Dag jargon that most Python users know? I’m guessing it’s not when you say Dag.

00:09:15 I think directed a cyclic graph.

00:09:17 Yeah. Most data pipelines are now orchestrated as Dags of one kind or another and see if you have this upstream to downstream flow of data.

00:09:26 System Airflow is the kind of most famous tool for orchestrating this, and there’s starting to be some others that are popping up as well.

00:09:35 So anyway, one way of deploying Great Expectations is in a Dag manager like that, you can put in nodes that will check data. So after you do some transformation or run a machine learning model or ingest data after a step like that, you can then verify the correctness of the data by inserting a couple of lines of code from Great Expectations in so that’s one mode. The other mode that we see is some teams don’t have a good Dag orchestration tool like that. Or sometimes the team that cares about data quality isn’t the one that does the orchestration. So they have a hard time deploying a tool like that. And in that case, often what people will do is they’ll point Great Expectations directly at a database and say query tables X, Y and Z every night and generate a report based on that.

00:10:22 Is the interactive notebook thing something you’re going to keep, or is that something that you’re going to let go of?

00:10:27 So that’s totally there still and a lot of people, when they’re creating their expectations, they’ll use notebooks to dive in and kind of explore their data and write expectations at the same time.

00:10:38 But the place where Great Expectations tends to be most useful in the long term is almost like a monitoring tool where you’re deploying it against production data and saying, look, last week I used this notebook to check and verify that this data followed such and such rules. Now every week when new data flows through my pipeline, I want to verify that it still looks the same and it hasn’t changed from what I thought before.

00:11:02 Okay, I get the report, one that says I’m going to look at a database table in a database and tell you whether or not it still passes in the Dag model where it’s part of the pipeline.

00:11:16 What does it do when it comes up with failures? Does it send it to a reporting system or what happens.

00:11:23 Yeah, same thing. Actually, it’s the same. But you can do more in the sense that because the data hasn’t yet been stored anywhere or flowed downstream. Sometimes you want to generate that report and have be able to dig in and see what went wrong.

00:11:40 But often if you’re in Airflow or something like it, you also want to stop propagation of bad data because rolling back corrupted data from a system is usually ten times harder than putting it in the first place.

00:11:53 So people are using this as a stop gap to say, don’t go further if it doesn’t pass here.

00:11:59 Yes. Okay.

00:12:00 That’s interesting. Wow.

00:12:02 I imagine this is useful everywhere. Do you know what kind of people are using this?

00:12:08 So if you jump in our public Slack Channel, you will see all kinds of data people. We’ve got people from startups, we’ve got people from big insurance companies, from data consultants, hedge funds.

00:12:20 It’s all over the place. I think the common thing is almost everybody is a Python SQL Git user, like some mix of those three things. And everybody is in a place where they’re not the only data person on their team. So collaborating and kind of sharing knowledge about how it all works is starting to be important.

00:12:40 One of the things that surprised me is with the interactive model, the ability to accumulate data configuration, you can just kind of play around with your data frame and do expectations against it. And the ones that pass get collected into a set of passing expectations, and then you can save that and use it later. The other thing I really like is this ability to say some percentage of the data should pass in the interactive mode. It still tells you what data doesn’t. That’s pretty cool.

00:13:09 Yeah.

00:13:11 One of the things that we’re trying to do is we’re trying to do not full automation of any of this, but we’re trying to do sort of automated assists for the processes that people are going through already. So when you’re trying to figure it’s actually kind of hard to sit down and write a good test, it’s very hard to turn your brain inside out a little bit.

00:13:32 What we’re trying to do with great expectations and especially that iterative notebook mode that you’re talking about is just put something on paper. It will respond and tell you if it’s true or false. And if it’s false, then it’ll give you some hints about how you might want to change it.

00:13:45 Yeah.

00:13:45 It’s probably worth calling out one big difference between the data world and testing in the data world and testing in kind of normal software engineering. Okay. And that is when you’re testing and software engineering, you usually want to test. You want to test when things change. Right. And software, it’s usually when you commit new code, when you compile and deploy new code. So you’ll see CI tools and like, you run tests on commit or on push.

00:14:13 In the data world, there’s this interesting hybrid where, yes, you change your pipeline code and you want to test then. But actually the thing that changes a lot more frequently is the flow of data through the system.

00:14:27 In order to test when things are changing, you have to test at what we call batch time. So, like, when new data arrives, not just when new code arrives.

00:14:36 Okay. So is there two different modes that you’re working in? Do you use the Great Expectations tool within a pipeline where you’re developing the pipeline or making changes to it? Do you put dummy data in then or something?

00:14:53 You’ve gone exactly to the thing that makes it a little bit tricky.

00:14:56 The thing that we see most people do first is they deploy it as kind of like a pipeline testing tool, kind of a monitoring tool that observes when data changes. And then as teams get more sophisticated about it, sometimes they go through and eventually they’ll start to build out tests for when the pipeline itself changes. But just like you said, in order to do that, you need dummy data or snapshot data to do that. And there’s actually a whole separate set of questions around like, okay, how do I build up a good library of positive and negative samples of snapshot data?

00:15:28 You’ve brought up things like Airflow and other tools. Are these cloud based things.

00:15:33 Or are they usually local or it is all over the place?

00:15:37 Most of them are moving to the cloud now. And in the open source world, most people have pretty good control of the flow of their data from place to place. If you get into kind of more enterprise world, there are some closed ecosystems, like tools that don’t talk very well to other tools that make it hard to plug in other things.

00:15:55 Great Expectations, you can use this even just in closed environments, right?

00:15:59 Yeah.

00:15:59 Anywhere you can Pip install something you can run. Great Expectations. Right. So if you want to run it on your laptop, totally, if you want to deploy it to production, you dockerize it and put it on a box in the cloud.

00:16:11 There must be a lot of excitement around it. There’s a whole bunch of features that I didn’t expect. So that’s going to be my next question. Is there some cool feature that you know of and that people don’t expect when they start learning it.

00:16:21 So one of the things that we built over the summer, and it’s still, I’d say in beta, it still definitely got rough edges.

00:16:28 But we’re getting a really good response on is a feature that we call compile to docs.

00:16:34 So here’s the set up.

00:16:36 We talked earlier about how we’re trying to solve this problem with pipeline debt.

00:16:41 Some of that debt is because of lack of tests, but you also have a lot of debt because of lack of documentation and people throughout the organization are always coming to the data team and asking them, hey, can you explain how this data works? Can you, like, maybe even produce some docs on it? And I’m guilty of this, too. Very few data teams have good data documentation because it is just so hard to produce and keep up to date because the data is always changing. Different stakeholders in the.org want different versions of the documentation. And so keeping on top of that is a huge pain.

00:17:16 So anyway, the cool thing that we can do with this notion of an expectation is it’s not just a JSON blob that can be run as a test because they’re all declarative, because they describe exactly what they’re doing. We can compile those into human readable documentation.

00:17:30 And so the kind of promise there is. And then when I see human rebuild documentation, typically a static website, you can embed it in something else and it’ll describe in bullet points or graphs or tables. Here is exactly what your data should look like.

00:17:45 That’s actually pretty awesome.

00:17:46 Yeah, it’s really fun.

00:17:49 The thing is, if you could keep your data docs up to date for free, you would totally do it. But the maintenance cost is usually such a burden. And in this world, if you’re auto generating your docs from your tests, then you know that as long as you’re running tests, your docs are up to date because they’re actually validating against the data last night.

00:18:08 So the way we’re saying it is, your tests are your docs and your docs are your tests. And as long as you’re running tests, your docs will never go stale.

00:18:15 Yeah, that’s pretty cool. More eyes looking at it to say that shouldn’t be a requirement for the data at this level or something like that.

00:18:23 Yeah, especially because you can put it in English. There are an awful lot of people in most companies who need to understand how the data works, but they’re never going to read a JSON blob. And they’re going to have a really hard time parsing Python or SQL.

00:18:37 And so if you can have a thing that is code that compiles to real test and code be executed, but also is human readable for those non engineered data stakeholders, it really opens up a lot of doors.

00:18:49 Okay, now we’re working on it. I think everybody should be able to read JSON and Python and SQL, but we’re not going to win that battle.

00:19:00 Maybe in a generation or two.

00:19:02 I know that there’s some fun things in store for this tool. So tell me what’s on the horizon.

00:19:10 So I think the main thing that we’re working on now is getting the first hour on the first day to just really be smooth and clean.

00:19:19 Like I said, we built out a whole bunch of new stuff over the summer, and we’ve seen enough teams doing similar things that we’re confident that those are the right things to build but I don’t think if you go and download it today and dive into our docs today, I don’t feel like our onboarding experience is super clean or intuitive yet.

00:19:40 So that’s stage one make it easier to learn how to use the tool.

00:19:43 Okay.

00:19:44 Beyond that, there’s a lot we can do to kind of flesh out this notion of compiling to documentation.

00:19:50 And also going back even a little earlier, this notion of automatically profiling data, there’s a lot we can do to make those smarter. Like right now, they do a decent first cut, but we know that there’s more to do, and we’re definitely interested in pushing harder on those grounds.

00:20:07 The expectations are saved as Jason files. Is that correct?

00:20:11 That’s right, yeah. One of our core requirements is they all have to be Serializable to something that embodies the code but doesn’t.

00:20:18 But it’s a Serializable format. It isn’t like a pickled object or something like that.

00:20:22 Yeah. So that people can review it and edit it.

00:20:26 And you can build viewers on top that know how to consume those objects and things like that.

00:20:30 Well, this is exciting. I think that you’re going in the right direction. I think the idea of making the onboarding experience great. I think that’ll be really cool.

00:20:38 Yeah. It’s coming together. We’ve got a lot of really good people in the community giving us good feedback and telling us where we’re right and where we’re wrong. So just give us a little bit of time and it will Polish up nicely.

00:20:48 Data science and decisions based on these data pipelines is becoming more and more a part of all of our lives, whether we work with it or not. So making sure that actual quality decisions are being made because we’re making sure that the data pipelines are clean. This is a really great thing for you guys to be passionate about, and I’m grateful for you to be here.

00:21:08 Yeah, thank you. I totally agree. It’s a fun problem to be working on.

00:21:12 Do you use other traditional software testing projects also to test the rest of the software?

00:21:18 So we use pytest pretty extensively within Great Expectations.

00:21:22 So we’re testing our tests and they’re actually test the tests of the tests.

00:21:28 Do all the rushing testing dolls there.

00:21:30 Good job on picking Pi test. It’s my favorite test framework, so it’s good.

00:21:34 We were in unit tests for a while, and pytest was just a little bit more flexible in some ways that we wanted.

00:21:40 Yeah, I think it’s a good choice. Well, thanks a ton for coming on. If people want to learn more, how do we find out?

00:21:46 Go to our Slack channel. So if you go to greatexpertations IO, that’s actually a good place to start, but GreatExpectations IO Slack, we’ve got a public channel. We try and be super responsive there.

00:21:58 I encourage everybody that’s working with data to check this out. It’s a really cool tool. And thank you, Abe, for coming on the show.

00:22:04 Yeah thanks so much Brian.

00:22:08 Thanks Abe for working this problem space and for this great interview. Thank you to Patreon supporters for continuing to support the show. Join them by going to Test And Code.

00:22:16 Comsupport.

00:22:18 Thank you to Reagan for sponsoring this episode. Take charge of your app and web monitoring with Reagan. Find out more at raygun.com that link is also in the show notes at testingco.com 95 as well as a link to great expectations. That’s all for now. Now go ahead and test something.