108 - PySpark - Jonathan Rioux

Apache Spark is a unified analytics engine for large-scale data processing. PySpark blends the powerful Spark big data processing engine with the Python programming language to provide a data analysis platform that can scale up for nearly any task.

Transcript for episode 108 of the Test & Code Podcast

This transcript starts as an auto generated transcript.
PRs welcome if you want to help fix any errors.

00:00:01 Apache Spark is a unified analytics engine for large scale data processing. Icepark blends the powerful Spark Big data processing engine with the Python programming language to provide a data analysis platform that can scale up for nearly any task. Jonathan Rieu, author of Piscepark in Action, joins the show and gives us a great introduction to Spark and Pispark to help us decide how to get started and whether or not Spark and Pi Spark are right for your project. Thank you, Patreon supporters for continued support of the show. And thank you, Pie PyCharm for sponsoring this episode. They’ve made some cool new changes that I’ll talk about later in the show.

00:00:51 Welcome to Test and Code, a podcast about software development, software testing, and Python.

00:01:00 Today on testing Code, I’m really excited to talk with Jonathan Ryu. I just practiced his name. It’s a French name. It looks cool written out. You’re in the process of writing a book or wrote a book?

00:01:12 I’m in the process. According to my editor, I’m nowhere near fast enough, but it’s scheduled to be delivered by the end of the year.

00:01:20 Okay. Books are just like software. You set a deadline and miss the deadline.

00:01:25 It’s Price Park in action. We want to talk about what Price Park is because I actually don’t know and who you are and all that sort of stuff. So let’s start there. Who are you and what do you do?

00:01:35 I’m Jonathan. I’m a data scientist for a company called EPAM. We’re software development consultancy that’s global. I’ve been with them for a little over a year and a half, working on data science engagement for different clients through North America and Eastern Europe. Parallel to that, I don’t know how to grow talent. I have a small team of data scientists who are coming from different Horizons and trying to scale them to learn piespark and be a little bit more big data aware, which is the reason behind the book.

00:02:12 Okay, so you’re helping other people learn how to use it and to learn data science.

00:02:16 My story through PISCES Spark, I basically learned it when I so my undergrad is an actuarial science. I work in insurance for many years, started doing the Society of Actuary exam process and realized it wasn’t for me and then moved to data science, got thrown directly into an environment where they were migrating from SAS to Python.

00:02:40 The standard big data Hadoop to Cloudera and nobody knew how to use that software. So fixing Big Data issues fixing a broken Spark installation was pretty much my first two years when I became a data scientist and through searching that material, collecting notes and I realized that there’s a lot of appetite for people to learn data science. Learn Big data, but where to start in a practical fashion is not really covered, and especially from the perspective that your data is never going to be clean. And practicing to work with dirty data is something that’s almost impossible to explain. Like you can’t really prepare someone in any other way than through exposure. So by teaching the fundamentals, I got a better chance at becoming a good Pipe Park developer, and I wanted to kind of use that to be able to pass it to some other people.

00:03:45 Okay, Pispark is before we started recording, you’d mentioned that it is kind of a lot of mainstream people in Python and data science are using Pandas, and somehow Pispark is not.

00:04:01 Is it an alternate path or are they used together?

00:04:03 Well, I mean, they play very well together. High. Spark is very geared towards big data, so it is the best way to me to summarize it is.

00:04:15 Pispart makes big data processing and machine learning as easy as it can be without making it too easy. You’re able to code on a cluster like if it was just a single node. The only difference is that you have a little bit more control over where the data is, but for the most part, it looks very streamlined and easy to read compared to doing your own MapReduce jobs.

00:04:44 Okay, what does Pispc provide that you don’t get just with Python?

00:04:49 I think historically I know now that Python has some distributed data processing library. Task is one of the most popular. One of the things that I think Spark got for itself and then Pricespark was able to get it is that their big data model is really, really easy to understand. You have just a handful of data structures that are extremely powerful. It’s very easy to map them into your mind, understand what you’re performing as operations, and it makes your code extremely clean, I would say, compared to Pandas. So if you’re able to use Pandas on a very powerful machine, I think that you should definitely do it. It’s the right tool for the job. But when you’re working with a massive data collection and you can’t afford hundreds of thousands of dollars of Ram into a computer, then Spark becomes interesting because it bridges the gap to allow migrating to a big data world without leaving too much of the Python flexibility.

00:05:56 Thank you PyCharm for sponsoring this episode. Pycharm 2021 is out and oh, what a treat. Get integration was already amazing, but now it’s even better. You can do interactive rebacing. That’s cool. You can choose to have the commit window appear as a tool window next to your code instead of a pop up. I really like that branch searching is now easier and you can even install Get right from PyCharm. You can now even install Python right from PyCharm. There’s improvements to the virtual environment support and improved support for adding packages to requirements that text files. If you use those. There’s even a new Charm command line tool that opens a light edit version for quick edit jobs that don’t require all the bells and whistles. You can split the terminal window now to see the output of two commands at once. You can configure the status bar. There’s improved database support with SQL script run configurations. They’ve even improved the font with the JetBrains Monotype face. Now that’s attention to detail, you got to check this out and you may as well start with the Pro version. It’s free for four months only if you go to testandcode.com Pycharmsavetime, use PyCharm.

00:07:01 Okay, so appears that PyCharm is a Python binding to Spark, which must mean that Spark is not written in Python. Do you know what Spark is in something else?

00:07:13 Spark is in Scallop. So one of the cool thing is that the operating model is consistent for the most part between Scala, Java, Python and R. I know that Microsoft is working for. Net binding to Spark, so you would be able to use it with F Sharp, C Sharp, et cetera, which is probably beyond the scope of this podcast. But for data scientists who learn Python, or for a data developer who knows Java and Scala, it kind of creates kind of a good middle ground where you’re able to use Python to do some steps and then you’re able to take the relay and use Java and you know that the data is going to be portable and you don’t have to learn a second libraries and the vocabulary is the same. It makes a very kind of happy medium between multiple programming language, which we don’t see a lot in open source project.

00:08:03 That’s actually pretty cool.

00:08:05 Yeah. I mean, Pragmatically, in real world applications, you can have a mix of languages, it works fine.

00:08:13 So that’s kind of neat. There might be some benefits to having, like if you wanted to have I’m just pulling things out of my hat. There are user interface solutions in Python, but there’s also user interface and web interface stuff in Net. So depending on what the company’s expertise is in having, that might be a good thing to do. And there’s a speed and size difference as well, do you think?

00:08:37 Well, speed is always something that’s a contentious subject.

00:08:42 A lot of the people that I see using Spark, especially as you’re learning it, you’re working with relatively small data collection and people get disappointed because all of a sudden it’s not as fast as it’s being advertised. And Pandas can be faster on a single node because it use like NumPy and highly optimized code. Spark is really amazing the moment that you scale beyond a computer. So we were working with Terabytes, even though terabytes of data is still not necessarily considered big data, having you kind of like you write your code in pipespark, you make sure that it runs fast enough for you, and then you know that as your collection of data is going to grow, well, Spark will never become a bottleneck. Okay, interesting where Pandas can potentially become a little bit of a problem at the moment that you need to scale beyond a single mode.

00:09:36 Well, keeping that in mind then, is it still something approachable for an individual that wants to just start learning data science and start learning data science with Python and stuff? Does it make sense for them to use Pispark even on their own computer?

00:09:52 I had an easier time learning Pispark than I had learning Pandas. Okay, Pispark has a couple of modules, and you know, the one for data manipulation is called Pispark SQL, and it uses the same vocabulary as SQL queries. So for somebody who has the opportunity to work a little bit with SQL, understand what is a select statement where filter group by, et cetera. Basically, the Pi Spark API to data manipulation is SQL with methods.

00:10:26 So your DF select and then you put the columns that you want where, and then you can filter group by, and then you group by the columns. So I think the term for that is like either method chaining or a fluent API. You’re able to read the instruction one by one, and it becomes extremely easy to write them. And then you let the Scala Optimizer to take all of those instructions, reorganize them, optimize them to make sure that your code runs as fast. So it’s extremely descriptive. Like most of the time you see people having monster like DF for a couple of lines, but it’s still extremely readable and it removes the need for a lot of documentation because your code is very transparent.

00:11:15 That’s pretty exciting. And anybody that’s not familiar with SQL code and might be put off by that and a little scared. I personally am of the opinion that we’re doing everybody a disservice by trying to hide sequel from people because it’s something that will help you in your career. If you just spend a weekend doing some tutorials and just going through some play examples tutorials with SQL, it will help you for the rest of your career. Even when you normally use libraries that kind of hide SQL from you, there are times where you have to get in there and write it anyway. You’ll be grateful for that weekend you spent playing with it. So I think that’s actually pretty cool that they’re not trying to that it’s an API that’s sort of mapped similar to SQL statements. That’s neat. I think you said this while we’re recording you’re training some people on your team. Are they direct reports or are they just people you have to work with?

00:12:09 The way the team is organized is like everybody under Big Data, quote unquote is in a single large team. So we have data scientists, data engineer, data strategist, data quality analysts, et cetera. And then people get to kind of pick and choose depending on what is the assignment. I would say Price Park is usually front and center to a lot of the assignments that we’re getting. Whether you work with a bank or a telco or even retail companies, the volume of data at the moment that you start digging a little bit, there’s a lot of data in the making and having kind of an easy to use tool that’s relatively I mean, going from complete beginner to uncomfortable. Writing a little bit of code to do some exploration is something that you can pretty much do in a day and then you’re able to scale up as you’re going to go along. There’s a lot to discover. It takes me more time to do research than to write because there’s so many details and intricacies and it is such a wide subject. But to get started, it’s really not that long.

00:13:17 How much of this is targeted towards exploratory stuff? Is there a difference between working pipelines and exploratory?

00:13:24 This is kind of the dirty secret of bicep is when it comes to doing exploration of data. If you take an example, showing data on the screen is something that any data processing library needs to be doing. But when it comes to charting, distributing a chart is really not an operation that a person would do.

00:13:48 So Pipe Park doesn’t provide charting by default. So what it does is it provides a very handy method which is called Two Pandas, which is going to return a data frame that then you can chart.

00:14:03 So I think it’s important to draw the line that it is not replacing pandas in any way. As a matter of fact, it works super well with it. When it comes to data manipulation, large scale data manipulation, my workflow is usually going to be I stay in Pispark for as little as I can because the moment that the data is becoming like single node large, I’m able to scale down the pandas and start using Plotly or bouquet or map plot Lib to do some graphical exploratory data analysis and it becomes that kind of rapid fire I lift up to Pi spark when it comes to once I’m done with it, I’m comfortable with the data. Then I write my pipeline and it goes from manipulation to preparation of data to machine learning. And then if I need to print the results, I usually use a blend of Pispark and then psychic Luna Stat model, depending on what model I’m using.

00:15:00 I saw a talk recently, I can’t remember where it was, where somebody said maybe it was you. I can’t remember that 80% of data science is cleaning the data so much and the Pi spark is helping you within the beginning part where you’re cleaning everything up and selecting and pulling things out and whatever, I’m guessing.

00:15:24 Well, realistically, when you’re cleaning big data, you don’t take the same approach as if you have an Excel file or a couple of thousand rows. There are still some data scientists out there that are looking at their data line by line to try to make sense of some patterns. But when you’re with pispark, you change your approach by sampling intelligently and it becomes more of a race between the data manipulation. So using distinct records showing the top 20, the bottom 20 to be able to explore your data in another fashion because you will never be able to review everything. And there’s a little bit of you’re building evidence that your data is good rather than assessing that the data is perfect.

00:16:08 If somebody was going to try to learn both, you’d want to learn both the Pandas model and whatever also with this and learn Spark also. I’m guessing then or can you use one or the other?

00:16:20 The moment that start with Pandas, learn the ecosystem, and in the moment that you feel like Pandas is holding you back, it is a good idea to look into Pispark. For those people who really like the Pandas API, I tend to be in the opposite camp. I wish Pandas looked more like Pispark, but Databricks developed a very Nifty library called KUALAS, which basically provides a Pandas style interface to PISCE park. So you’re able to use Pispark like if it was Pandas, and for the few times that I’ve used it, it’s actually pretty good. You don’t have to worry about the moment that the data is scaling up Pi, Spark is getting used, and then all of the collections are Pandas collections.

00:17:09 There’s really a great will from the creator of Spark to integrate Pi, Spark and Pandas and make them play very nice to one another. So, like, my suggestion is always to learn both, but people have limited bandwidth and sometimes you don’t want an N plus one library.

00:17:29 Visualization is a good example. Like, how many visualization libraries does Python have? And I think it’s an infinite amount, but I really believe that I Spark has its place in a data scientist toolkit, and it is just a very nice tool to use.

00:17:46 You gave a talk on it, right? That was last year.

00:17:49 Yes, last November at Python Canada.

00:17:52 Okay, we’ll drop a link in the show notes. Admit, I haven’t watched it yet, but you mentioned that in the talk you talked about some cool things and some not so cool things about Pispark, so I thought it would be fun to touch on one of each. Can you tell me one cool thing and then we’ll talk about one not so cool thing.

00:18:10 Oh, sure.

00:18:11 My favorite cool thing about Pispark is I call it the kind of the Mario analogy. So because in Pispark, every method or a lot of functions that operate on the data frame return a data frame, you’re able to dot, dot, dot into your data processing. So data frame select where, filter, group by aggregate, et cetera. I use the analogy of like, basically it’s like if you’re a plumber, you’re just assembling kind of pipes together and it makes it for a very easy way to understand how you’re doing your data manipulation. One of the thing with Pandas is even to this day, although I’ve been using it for a couple of years when I need to do multiple selection, I still get confused about lock and I lock, although I should know that one is for the index, the other one is for the labels. It’s really easy to kind of like use that analogy, and it makes for very readable code, which is so important in my book. If somebody reads your code and there’s a difference between, oh, I don’t understand exactly what is the code doing, but I see a good idea and looking and having no idea what this is about, and I think it’s really hard to build a pipeline program where people are going to be completely clueless.

00:19:37 Yeah, I’ve seen the multiple dots. That means that each of these operators are going to return the resultant data frame so that you can keep operating on it.

00:19:46 Exactly.

00:19:47 Is it a good idea? I’ve seen examples of it. Mostly I see people doing a line continuation character and then writing the new operation on the second line or continuing on the next line. Do you recommend that or do you not like that?

00:20:02 I have offset in my code style is basically I rely on Black to decide for me. Okay, I’m not a huge fan of backlash to continue on the next line, but one of the things that’s is pretty cool is that in Python, if you do, let’s say data frame equal, and then you start your operation and you have statements on multiple lines. If you start your statement with a parenthesis, basically everything inside a parenthesis, you’re not subject to the same kind of white space escaping line. So you can just do parents, you just wrap it in kind of a list style, like a poor man list that allows you to have your code well aligned and everything is very readable and it seems to be the preferred way that Black does it. So because it agrees with me, I tend to say that it is the way to go.

00:20:58 It looks better too, I actually admit, because the line continuation I’m used in various rare C examples and C Code in statements that can’t be split up. But it is a little surprising to see it in Python code. It’s not something we see very much.

00:21:15 Well, it has to do with that. The API is inspired by Scala and you know, some of the time the way having a lot of methods that are chained one after the other and they can get pretty long, I would say usually Pipe Park code for all things considered will have a tendency to eat more horizontal space than regular Python code like that has been my experience, and it can be a problem. I haven’t taken the time to configure my editor to stop annoying me when I go over to 80 character.

00:21:46 So Black format will do the right thing and reformat these things on multiple lines. Then that’s pretty cool.

00:21:53 You know, when you’re working in a team, you don’t have to bother about personal code preference. Like everybody blocks their code, everybody’s consistent. We have black as a kind of a step pre hook and get to make sure that everybody is consistent. I haven’t made a lot of friends because of this, but I think it’s for the better.

00:22:12 Do you change the line length or the string manipulation, or do you leave them as default?

00:22:17 I think Black is going to authorize up until 88.

00:22:21 For my personal project, I go up to 100, but my company’s code style is to keep it to the default.

00:22:28 That’s something, I guess. Good for you guys.

00:22:32 I don’t remember what we have. I think we have like 125 or something like that. But we all got big screens. That was something cool. So the dot notation, how about something not so cool?

00:22:44 There’s a Stack trace their Java stack race. So you basically need a double degree in forensic and a huge dose of patience to be able to understand what happened. And it’s especially prevalent because Spark is a pretty complicated library. So you see methods upon methods being called and the stack is very long and that becomes a huge annoyance. My trick is usually go straight to the bottom of it and then if it doesn’t work, go straight to the top, and if it doesn’t work, Google the first few characters and then look at somebody already answered my question. But this is definitely the most frustrating thing. Duality between Java and Python can also be annoying, so sometimes you have to be mindful of your types. Java has a notion like you can have a data frame that has shorts in it, and in Python, numbers are numbers. They’re very well behaved. So sometimes you get especially when people are using databases and they’re importing them straight, they’re doing an addition and they see the number rolling over. So this is something to be mindful, I would say. Those are the biggest annoyances that a beginner encounter in Price Park, and they make approximately 90% of the complaints that I’ve heard so far.

00:24:05 Okay, yeah. So test your code, put it in version control and be willing to back it up and start over.

00:24:14 That’s my solution to confusing stack traces.

00:24:18 Any issue with writing test code with PISCE park?

00:24:23 The testing strategy when you’re working with a library like PISCE park and pretty sure it’s similar for a lot of data driven application.

00:24:31 I’m huge on testing, mostly because I don’t trust myself some of the time, especially if you’re using Pi Spark in the context of big data processing or in the cloud. You’re talking about a cluster of a couple of machines, which can be pretty pricey and you want to spend as little time running your test as possible. I use kind of a different strategy depending on what does my code target, my overall data pipeline what I’m usually going to do is I’m usually going to use some kind of reasonability check as I go along testing a bunch of select statements and some where those are more related to either business rules or what you want to accomplish, and they become quite trivial to test very fast. But anything that is a little bit more complicated, usually I will carve in a function aside, and I will use either a local installation of Spark or we’re going to add it to our CI CD pipeline to test some minimal use cases that we have a high level of confidence that they are representative of the problem. So good example is if you’re doing an email parser, well, you’re not going to test over your whole data frame, you’re going to select a bunch of use case, and then once you’re done with it, then you get relatively confident that it’s going to work for everything. And then as you go along, you try to remove as much as possible the tests that are targeting the whole data just because they’re very long to run. Pispark takes on a Cluster Setting up a cluster with Pispark, I work mostly with Google Cloud Platform. It can take a couple of minutes to set up your cluster, which sometimes can be just too much when you’re coding and you’re testing something. So it’s always a good idea to have a local installation in which you can kind of sanity check what you’re having, and then it removes a lot of the pressure as well on the Git repository because you don’t want to spin up a cluster every single time you do a commit.

00:26:34 I think that’s a good idea. Do you have local representative data or usually pulling live data and doing a subset? How do you deal with that?

00:26:44 Usually what I’m going to do is I’m going to select some relevant samples of the data. We have also kind of a synthetic data generator that is going to generate some data according to multiple factors to avoid sometime. If the data is in the cloud, you don’t necessarily want to get it out of it. So you’re going to test that on fake data. You’re still going to do your integration testing using all of the real data, and you might find some other mistakes that are going to drive your unit testing or your smoke testing. But yeah, we tend to avoid using mostly because, like as a consultant, it’s not ours.

00:27:23 Yeah, that’s true.

00:27:26 Okay, so I had to apologize to you. I realized that that’s in Chapter 13. And you haven’t written Chapter 13 yet.

00:27:34 That’s okay. I would say this was one of the hardest chapters to do research on. You know, at the time when I started writing the book title, plus, testing was not a subject that was really fashionable. It was not a lot of information. So I had to come up with what I thought made sense and then talking to people who do testing for other libraries. Joel Grooves from the Allen NLP Library was a big inspiration of seeing him testing. Machine learning. Code is something to be seen, and I think a lot of data scientists should watch him and take as much learning as they could from this untested. Code is dangerous code, and machine learning is code that makes a lot of decision on behalf of people. A lot of the models you’re trying to automate human reasoning or replace human processes. If your code is not tested, who cares about how good your model is?

00:28:33 Well, yeah, except for we’re still getting to the point where we’re trying to convince everybody of that that testing belongs everywhere. If there’s software involved, testing needs to be involved. So I’m glad you’re on board with that. Cool. Joel, what did you say his last name was?

00:28:49 Gruce. Grus.

00:28:50 Oh, yeah.

00:28:52 Does he have a video out or something that does this?

00:28:57 He has a bunch of videos that are as incredible.

00:29:00 Yeah. So you don’t have to come up with it like on the spot right now. But in the next couple of days, if you could send me a link to something that could include for people to take a look at, that’d be great. Plus your talk, of course. So thank you so much for joining me today. Any closing thoughts or anything that we forgot to cover? You’d like to talk about Pipe Park is fun.

00:29:18 I really recommend everybody to get a try with it. Try your next data science project, even if it’s not using big data. Have a look at the API. I think it’s going to expand a little bit. Some of people’s mind about how should a data processing library look like, and I’m very happy about their dedication. I know it’s naturally a Scala library, but their Python API is extremely good.

00:29:44 Cool. And I will put a link to your book up as well. There’s the first three or four chapters, I guess the first three chapters, and then another chapter is available for people to take a look at.

00:29:55 One, two, three, and six. That’s because it’s on my plate of review. I need to split it into chapters.

00:30:04 Manning is extremely demanding in terms of what they expect the book to be. It’s incredible working with them, but it makes you realize that writing in school and writing a book is something like it’s a totally different ball game. It is very hard.

00:30:22 Yes. Writing a book is something I wished for all of my enemies.

00:30:29 It’s an incredible learning experience as well as well as teaching. And that’s kind of one of those things of that old saying of the best way to learn something is to try to be able to teach it. And writing a book is an eye opening as to how much stuff you thought you knew. And then you go to write it down and realize you don’t know how to write it down.

00:30:48 Yeah in your mind it’s a beautiful picture and you’re like this is going to be amazing and then 2 hours later you have three paragraphs written and then your editor is telling you that you’re late special shout out to Marina I’m sorry.

00:31:05 Well I wish you well in the rest of writing this book and I’ll keep an eye out for the final one and especially for chapter 13 to come out and thanks a lot for coming on the show. I’ll talk to you later.

00:31:17 I mean it’s a real pleasure. Very pleased to have been able to be here.

00:31:22 Thanks.

00:31:26 Thank you. Jonathan Vicepark sounds super interesting and I love that you’ve included testing in your book. Thank you so much to Patreon supporters for supporting the show. Join them by going to testandcode.com support and thank you Pie charm for sponsoring episode. The link to the extended profile is at testandcode.com PyCharm that link is also in our show notes at testandcode.com 10 eight. That’s all for now. Now go out and test something.