Researches and others using data science and software need to follow solid software engineering practices. This is a message that Joel Grus has been promoting for some time.
Joel joins the show this week to talk about data science, software engineering, and even Fizz Buzz.
This transcript starts as an auto generated transcript.
PRs welcome if you want to help fix any errors.
00:00:00 Researchers and others using data science and software need to follow solid software engineering practices. This is the message that Joel Grues has been promoting for some time. Joel joins the show this week to talk about data science, software engineering, and even Fizz Buzz. This episode of Test and Code is brought to you by PyCharm, Savettime, use PyCharm and by listeners like you who support the show through Patreon.
00:00:37 Welcome to Testing Code because software engineering should include more testing.
00:00:43 I’m really excited today on a Test And Code to have Joel Bruce and I’ve been following his work for a while, although you do quite a bit. Can you introduce yourself?
00:00:53 Yeah, totally. So my name is Joel. I work in software engineering and Machine Learning Capital Group, a large investment company where I lead a small team that basically builds machine learning and data products for people who make investments. Before that, I was at the Allen Institute for Artificial Intelligence, which is an AI research nonprofit in Seattle. There I worked on a team called Allen. Nlp built a library called Allen LP, which was basically a deep learning library for NLP. Researchers helped them write their experiments, do good science. And so I helped develop the library. I helped support the library. I partnered with researchers to write their experiments. What else? Before that, I was at Google for a couple of years doing software engineering, and then before that, I was doing data science at a bunch of startups in Seattle. So my background is really a mix of pure data science, pure software engineering, and more recently, things kind of sitting at the intersection of both. I might be known for my book Data Science From Scratch, which is in a rally book about data science. The second edition came out last year. I think it’s pretty good. I may be known because I’m the person who went to Jupiter, Con. And gave a talk called I Don’t Like Notebooks because I Don’t Like notebooks. And so people shared that online. I wrote a blog post that people enjoyed that was called Fiz Buzz in TensorFlow about a guy who gets asked to solve Fizz buzz at an interview and use TensorFlow to show up as interviewer. And recently I sort of expanded on that and wrote a book called Ten Essays on FizzBuzz, where I took basically ten different solutions of FizzBuzz, each of which was interesting in its own way and used each one as an excuse to sort of write an essay or document about some aspect of Python or math or coding or software design or engineering or something like that. So that’s kind of my varied background.
00:02:36 Wow. That is quite a varied but a lot around the data science, machine learning and data part of it, right? Yes. And you might confusingly with somebody else. Do you also podcast?
00:02:46 I do. I have a podcast with Andrew Musselman called Adversary Learning.
00:02:50 I can’t say we’re as regular as you are about putting out episodes, but we do every once in a while.
00:02:57 And was it just for fun or why did you guys start that?
00:03:01 Honestly, it’s mostly just for fun. Also, it gives us an excuse to hang out and talk for an hour every once in a while, which is sometimes hard to do with your friends. And we’ll invite our other people on. And it’s sort of a fine line between talking about data science and software and then being the two old guys in the balcony on the Muppet show who sort of heckle things.
00:03:21 I love that. I love that visual. I mean, I remember we started Python Bites partly as an excuse to keep up on all the Python news and newsletters and all that stuff. And I was trying to do that anyway and realizing that I really wasn’t keeping up on it. But having to for a weekly podcast makes it so I do that more often. Yeah, but I also gained a lot from doing that as well. I started your FizzBuzz book. So you’ve also put these blog posts into a book, and that’s available if people want to buy it as a book. Right. Or an ebook, at least it’s an ebook.
00:03:52 Yeah. And it’s not so much the blog post. The blog post was just a single extended joke about solving in TensorFlow. Because of that, I sort of became somewhat known as a thought leader in the Fizba space. And because of that, I started just kind of collecting other interesting solutions or started thinking about other interesting solutions. And I never really blogged about them because I don’t like beating a joke to death. But then eventually someone on Twitter was like, you should write a book about 100 ways to solve Fizzbuth. And I don’t know, 100 is a lot, but I came up with ten that were pretty good, and I thought I could get a good book out of that.
00:04:24 So I did. So I’ve just kind of broke the virtual spine and actually fascinated with the first solution that you’re talking about. Or maybe it’s the second. I’m not sure where you kind of talk about, I guess, the traditional method of Modulo math, but that there are so many other ways that you can tell whether or not something is divisible by three or five. Was it two or five?
00:04:46 Three or five? Yeah. So sometimes there’s a very standard. So for people who don’t know, Fizz buzz is the problem. You want to print the numbers from one to 100, except if the number is divisible by three, and steady print Fizz if the number is divisible by five, and steady print buzz if the number is divisible by 15, and steady print fizz buzz. And so this is sort of a canonical lowest common denominator programming task that you get people in a software interview to kind of weed out people who can’t write a look of code. But occasionally people will look at this and say, you know what the solution? Checking for divisibility uses the modulus operator. And you could conceivably have someone who’s an accomplished software engineer, just doesn’t know about the modulus operator and maybe doesn’t know how to check for divisibility. And so in that section or in that essay, I went off and sort of pretty extended diversion on what other ways could you check for invisibility if you did not know the modulus operator?
00:05:38 Yeah, and it’s pretty fun. Like the brute force method is something that I hadn’t considered just subtract to find something divisible by 15. Just subtract 15 from it until you get a number between zero and 15. And that’s clever. There’s a whole bunch of clever stuff. One of the things I don’t know if maybe I just skimmed past it or I haven’t gotten to it yet. Do you talk about whether or not it’s a reasonable thing to ask in an interview?
00:06:02 Not directly. I think implicit in the book is sort of a discussion of this. And I think there’s another side of the book where I say that I think it’s probably not a very good measure of much of anything, but I don’t really go into that subject at length.
00:06:17 I would say, okay, I have to admit, I have occasionally asked people to write Fsbas, but it doesn’t really tell you much. Like, for instance, if you get the right answer, you actually learn nothing other than there is like you pointed out, there are multiple right answers, but the canonical Modulo answer, if somebody knows that, then they’ve either just know Modulo or they’ve already looked up this problem and know it’s going to be asked during interview.
00:06:41 You know, one thing that comes up later in the book is that I sort of explore a couple of variants of the problem. And so here’s the variant that’s actually slightly more interesting, which is instead of doing Divisible by three and Divisible by five indivisible by 15, we do power. So we say if a number is a perfect square print fizz, if a number is a perfect hue, print buzz, and if a number is a perfect six hour print fizz buzz. And it turns out that this one is a little bit more subtle than you’d think, because if you try and sort of check, it’s not that easy to check whether something is a square. Actually the naive way, you probably run into floating point issues and so on. There are a bunch of variations that are actually can be interesting. And then the other thing is that given any problem, like given the dumbest problem, you can always add extra layers to it and say, okay, you’ve solved that easy version. Now what if instead of having divisible by three, five, and 15, we want to add another prime factor and we have seven in there? And so you can get in there and say what if we have arbitrary many prime factors and we want to look at arbitrary products of them? So there are a lot of ways you can take these basic problems and actually make some kind of interesting and add layers to them. But I don’t think most people do that.
00:07:50 So I’m looking forward to reading the rest because it sounds fascinating. So have you been involved with hiring people then in the past?
00:07:57 Yeah, especially at AI, too. I was in a lot of interviews, so I have a lot of strong feelings about software engineering interviews.
00:08:03 We won’t get down the path too much. I’ll just play Devil’s advocate and say coming up with good questions to ask people is difficult. It is an art form and not a science to determine whether or not somebody knows how to write software in a 50 minutes interview. I’m not really defending some of the horrible things people have done or I have done, but just like people don’t really have I didn’t learn in school on how to do a tech be in a technical interview. I also didn’t learn how to do a technical interview. It’s not something that was covered.
00:08:33 Yeah. And I think people approach it with different goals. Right. So I think a lot of people who do tech interviews, they’re sort of leading the unexamined life in some sense, and they haven’t really thought about what they’re looking for. They thought, here’s a program I like someone to solve, and I’ll see if they can solve it, but they don’t think about why do I want the person to solve it? So, for instance, there are a lot of problems that you might find in like a cracking interview book where the problem has the following flavor. Here’s a problem that has an obvious end squared solution and that has a very subtle, like linear time solution. Can you find the linear time solution? And so these kind of problems to me, they don’t tell me a lot about a candidate. They tell me either this person memorized this problem or maybe this person is like an algorithmic genius and redirect this on the fly. But they don’t tell me a lot about what kind of code is this person actually going to write. So I actually tend to favor and this is something that I’ve sort of gone to overtime, but I tend to favor problems that have the flavor here’s some kind of algorithm written in terms that maybe a bright ten year old could understand.
00:09:35 Can you implement this in code in sort of a clean, maintainable way so that it’s not so much trying to trick you and it’s not so much trying to delve deep into how much humor from your algorithms class. But it’s like here’s words, can you turn them into code and does your code makes sense?
00:09:52 That’s a good way to do it. Yeah. And what’s the music to me is in most of the programming I do in scored algorithms will be fine because my data sets are small.
00:10:01 One of the things that I’m looking for, actually, I didn’t intend to talk to you too much about interviewing, but being conscious of why you’re asking questions and what you’re looking for I think is important. I have interviewed for both Python people and for C people. And so for when I’m interviewing for Python, one of the things I want to look for is do they actually know how to write stuff in Python or do they just know the Python syntax and they still think and see? And that’s something that we can sort of tease out by making sure they know how contact managers work or something like that. They just don’t want to have somebody show up and start doing for eye and length of that sort of stuff in Python. And on the C side lately, just trying to figure out kind of which model of C Plus Plus do they know? Is that the new stuff or the old stuff? Because it depends on what we need somebody to do.
00:10:50 I work at Google for a couple of years. I mostly wrote C Plus Plus there, but the brand of C Plus Plus that Google wrote at the time and probably still writes was very idiosyncratic to Google, and so it used Google libraries instead of standard libraries and things like that. And I used to joke that even though I’d been writing C Plus Plus for like two years straight, I could never pass a C Plus Plus interview unless it was at Google, because everything else I just wouldn’t know half the stuff.
00:11:19 Quick, what’s your favorite dip tool? Get client, database, front end code profiler, code coverage, visualization tool editor and debugger. For me, it’s PyCharm for all of these. It’s not just that PyCharm can do all of this, it’s that I grab PyCharm for these tasks because it is the best for me that I have found. I’m serious. I also save time by using the same tool on tons of my workflow. And you can too use the linktestingcloud.com PyCharm and try it out yourself. Try a multi change multifile commit with PyCharm and you’ll never want to use anything else. Attach PyCharm to a remote or local database. Same effect. It’s fast and intuitive. Makes working with databases fun again. Dev Tool Same twice. Last week I used it to compare directory structures with subtle differences. Test Runner Same best in class and the list goes on. The Link Testingcode.com PyCharm lets you try out the full power of PyCharm Pro for four months. Savetime use PyCharm. That’s Test And Code. Compaijar’s is actually kind of a rabbit hole.
00:12:24 But a fun one.
00:12:25 And so is interviewing questions. But we also wanted to talk about data science and kind of this melding of software engineering practices and data science. So do you have thoughts about that too as well?
00:12:37 Yes, this is one of my pet causes, which I call Why Data Scientists Should Care about Software Engineering Best Practices. So I gave a talk a couple of years ago at the Southern Data Science Conference on this theme vaguely. And then I mentioned that I gave this talk about Jupiter notebooks, Jupiter Con in 2018, and I had a lot of complaints. But a small part of that talk was around the theme of notebooks actually make reproducible science more difficult. People push back a lot of this because they found it counterintuitive. They seem to think of notebooks as being these documents are reproducible computations. But it turns out that a lot of the things that when you put your work in a notebook, it can actually be really difficult for someone else to run it because paths are hard coded everywhere and there’s no list of requirements and things like that. And so anyway, because of that, I sort of ended up going down this path of talking to a lot of academic researchers, actually about the same topic and why researchers should care about software and best practices. And so I went to a couple of conferences and spoke about this. But also I partnered with a lot of researchers, and I really tried to hammer on them. Look, you need to be doing code reviews on your experiment code to make sure it’s right. You need to be writing unit tests on your experiment code to make sure it’s right.
00:13:51 You want to consider using Docker images so that anyone can run your code anywhere you want to give clear instructions on how to run this and so on.
00:14:00 It’s funny, about maybe two years ago, I spent like three months embedded with a team of researchers helping them do some research, and we ended up getting a paper out of it. And that’s my publication that I have my name on about reading paragraphs and understanding what processes they’re describing. But if you ask me, what was your contribution to the paper? Probably the biggest contribution was making the researchers write unit tests. That’s what I did every day, making them do code reviews, making them do unit test and code finding bugs before they ran experiments. So, yes, it’s a huge pet cost of mine.
00:14:31 So the unit test part I see. I totally once people get the hang of writing, I’m guessing some of the data manipulation processes that data science might use are things that are fairly not too complicated. To unit test, you get something that takes some input and produces some output. You can have expected input and output and test for that. The larger level stuff seems like it would be more difficult like to try to do system level testing.
00:14:55 Well, once you’re talking about system level testing, then you’re possibly no longer talking about unit tests. But the thing is, especially in terms of research or think about model development, I want to develop a model. There’s nothing necessarily system level about that. I can write tests for my data pipeline, and that makes sure that my data coming in looks like I expect it to. I can write tests for my training pipeline to make sure that model and I can write tests for my model itself. And how do you write unit tests for, like, a deep learning model? Well, a couple of things. One, you feed it sample inputs and check that it runs a forward path without crashing. That in itself is actually a really valuable test, because if you’ve messed things up with dimensions and inside it’s going to crash. So you’ll catch that, then you check that the output has the right shape, has the right format, and then what you can actually do is you can say if you’ve parametrized your model and maybe you have some thousand dimensional transformer, you instantiate a tiny five dimensional version of it, and you train it and you make sure that the loss is going down over $100 or whatever. And so if you have a model that forward pass runs correctly, it has the right shape, and you can actually train it with very low capacity to, say, memorize two training instances, then those give you a good idea that things are going right. They’re not a perfect test, obviously, because there’s fuzziness and randomness here, but they rule out a lot of really common failure cases.
00:16:10 Yeah, that’s excellent. I had a question of how do you test pipelines, and I don’t know enough about pipelines to tell them how do you test anything?
00:16:18 Well, so my answer to how do you test anything is you got to know what does it look like if it’s working, and what does it look like if it’s not working?
00:16:25 And throw stuff in for that. But often the real question is that people come to me for is because they don’t know how to answer those. They don’t really know what it looks like if it’s working or what to look for if it’s not working. And I can’t really help with that with different fields. You got to figure that out on your own, I guess.
00:16:41 Yeah. And once you’ve spent enough time in the field, you should hopefully have a good sense of what things look like when they’re not working. Once you’ve tried to train hundreds of deep learning models, you know a lot about the failure cases of deep learning models, and you know a lot about the mistakes that you’re likely to make.
00:16:56 You worked with. Some researchers were trying to get things like unit tests and code reviews into the research field. How was that received? Were they resistant or welcoming of those practices?
00:17:08 It was a little bit of a carrot in the stick. They wanted my help writing their experiment, and I said that I couldn’t help them unless they agreed to work the way that I wanted to work. So they decided that having my help far outweighed the inconvenience of having the right tests and having to do contributed, which I think it probably did. I would say that there’s a real division. Right? So you have some researchers who have a pretty solid background in software engineering, and they get it instantly. They say, yeah, of course, that makes sense. And then you have some who are really used to sort of doing things a little bit fast and loose. And there you have I think you have more of a sales job, but I find it can be a slightly time intensive sell, but it always works because at some point you catch something right, you make the right tests and they find a bug and make them write more tests and they find another bug. And after a few of these, they’re like, wow, these are bugs that I would have wasted a day running experiments on didn’t work, or I would have wasted a week launching thousands of dollars worth of jobs that were broken. And so I think once people really experience this and say, wow, look at all these bugs that I’ve caught, either by writing tests or by having extra sets of eyes on my code, then they’re pretty enthusiastic about it. Now, do they keep doing it after I’m not in the room? I’m not entirely sure. But at least while I’m there, they seem to be okay with it.
00:18:23 Now somebody out there is listening and they’re thinking, well, I do data science, and I don’t know what that means do data science, really, but whatever that means for them, and they want to try to pick this sort of stuff up, but they don’t have a joy to come work with him. What sort of stuff people need to learn about and where do you go for that information?
00:18:39 So one thing I mentioned, the second edition of Data Science From Scratch came out last year in 2019. And one of the big changes that I made for the second edition was a really strong emphasis on testing throughout the book. And so this is not like writing fancy test harnesses or test cases or anything like that. But it was we’ve just written a function. Let’s write some assert statements to check that it’s behaving the way that we think it should. And when you’re writing a book, it’s a little bit easier because you can just use the search statements as the actual documentation of what the function does. I mean, obviously the function should be somewhat self explanatory as to what it does. But instead of saying, here’s an example of how to use the function, you just say, let’s assert that if you call the function on for the results and so on. So if you read that book, you will get a sense of how I think about testing functions in a data science context or testing data science context. Now it’s a very simplistic way and that it’s really just like a search statement, a search statement, a search statement. And so it doesn’t tell you like here’s how to use Pi test or anything like that. But it does give you the flavor. And similarly for the FizzBuzz book too, again, throughout that book there is a big emphasis on testing the things that we do. And so there we do make a tiny test harness for the physical problem itself. But also as we implement other things along the way, we test all those too. And so I hear from people and they say, oh, I kind of get it because just the simple assert based testing is such a big step above zero that even if people don’t go from that to using pytest or whatever, it’s still a huge win. And obviously you hope that they actually use Pi test or something and run their test cases programmatically. But just even getting into that mindset, if I’ve written something, let’s make sure it’s right. I think that helps.
00:20:28 Okay, I’m actually confused right now, so I write a lot of Pipest code. It’s got a search all over the place. So what am I missing? Why is it not a surprise testing? I don’t know what a surprise testing means.
00:20:39 Then I guess what I mean is just basically instead of writing test files running with pytest, just sticking some asserts in your module.
00:20:49 You know what I mean?
00:20:50 So like kind of bulletproofing your input and output or.
00:20:53 No, what I mean is more of the following. Let’s say I have a math module. Right. And so I define a math square root function, or I define a square root function. And right under that function I just put a couple of searches. That function is working the way that I expect it to, so that if you were to import the module, I guess those search would run, but they’re not isolated. And like here are my test cases and there’s not an easy way to programmatically run them like with Python. That makes sense.
00:21:18 They’re simpler tests, there’s no classes, there’s no test functions. It’s really just like here’s a bare assert, it’s asserting something about that function.
00:21:26 Okay, yeah, some people throw those in like if name equals main as well.
00:21:31 Yeah, that’s a reasonable way to do it as well. Except that again, pytest won’t catch those automatically, right?
00:21:36 Yeah, right. And like why? Because Pipes is so easy. You’re using pytest for the fiscal, right?
00:21:43 I don’t know that I do. In the book, I think the book is mostly just a search statement. There’s one chapter that uses hypothesis kind of in a one off way, but for the most part tests are just kind of again, bare search statements and not pytest.
00:21:55 Okay, in your own work, do you use hypothesis for either Pi test or hypothesis for testing?
00:22:01 So in my own work I use Pi test quite extensively. And in my own work I don’t really use hypothesis. It’s one of those things where I love the idea of it, and then I haven’t gotten my brain into the space where I think about the code I write in terms of generative tests.
00:22:16 Yeah, that makes sense. With Python, I’m still at the point where my commercial code that I’m working with functionally understanding that things are working with Python, and if I expanding them into different sets of data with parameterization is usually enough for me. And then when I think about maybe I could expand some of the test to use hypothesis, I get on to some other problem that I need to solve.
00:22:39 Yeah, and I think some of the data science and machine learning instance examples, it’s harder to where those kind of tests shine is where it’s easy to define invariance about your functions. And a lot of times, if you want to check that, here’s a model I want to train correctly, and I want it to have the right fields in its output. Those are not so much like invariants that are easy to capture across a range of inputs and outputs. But I haven’t spent a lot of time thinking about this problem, so it’s possible that there are some good answers that are not occurring to me.
00:23:10 Well, I’m excited, actually, if I did know about data Science from scratch, I’d forgotten about it, and I’m excited that the second edition has a testing focus, and that’s a neat thing, or at least there’s some test emphasis on it, and I’m excited to take a look at that. Now, I realize that FizzBuzz was sort of a fun project for you, but it does put you in a set of people that have both written for a publisher and self published. Do you have any compare and contrast what you prefer, or are they different for different purposes?
00:23:40 They’re different for different purposes. So when I wrote Data Science from scratch, so a couple of things. One, I had not accomplished much of note in the industry, and I did not have much of a name for myself, and I did not know anything about selling books. And so if I had written it and put it out as a self published book, it probably would have sold very few copies. And so it worked out really well for me to work with O’Reilly, and one you get some cachet of. O’reilly is a well regarded publisher in the textbook space, and then they help promote it also lends me some credibility. Not only did I write a data science book, I wrote a Riley book on data science, and people like, oh, that sounds awesome. So anyway, with this book, why did I self publish it? A few reasons. One is, it’s an extremely strange book.
00:24:30 It is, right? If you write a book about data science, people who think I want to learn data science will buy that book and read it. Or if you write a book about Pi test. People who want to learn about Pi test will learn about that book and read it. Now there’s no one who wants to I mean, there’s not no one, but there are not a lot of people out there who are thinking, you know what? I would love to read a book about FizzBuzz. Let me go find one. So it’s a weird book in that sense. And then it’s also a weird book in the sense that it’s not really about anything in particular other than the way that my mind works. Like, here’s ten crazy, some are crazy, some are not crazy. But here’s ten interesting ways to solve this problem. And I’m going to use each one to just basically dump a bunch of thoughts and information on you. And so I’m really happy with how it turned out. I think it’s a really neat book, but I think it’s also a really unusual book. So when I try and think, can I imagine that as an O’Reilly book? Why would O’Reilly publish a book that’s ten weird essays about Fizbas and other publishers as well? So because of that, and I didn’t really engage them or anyone else on it, but because of that, I thought, you know what? I’ll just throw it on Lincoln or whatever. And then the other consideration was selling and marketing a book is a skill, and it’s a skill that I like to think that I’ve learned some things about over the past five years or so. And so in other ways, it’s kind of like a challenge for myself.
00:25:50 Can I get out there and sell this on my own?
00:25:54 And it will be interesting to find out what one option is. You can get on podcasts and start talking about it.
00:25:59 That’s a good idea.
00:26:00 Yeah. So just putting that in suggestion out there now, I’m really enjoying it. One of the things I don’t know if you’ve done other problem sets in many ways, but I’ve heard this from other people. I’ve heard people, for instance, if they’re learning a new language, will solve a small problem that they’ve solved a hundred times already and try to solve it in the new language, and then maybe even a handful of problems that they normally do like that just so that they don’t have to come up with a problem. They just have one that they always resolve or just try new methods like physics. I haven’t gotten completed with it, but it’s fun to try to find the least efficient way to do it. And it’s also fun to do the like, can I do the entire thing in one line of code? And I think I got that one, but it’s not readable.
00:26:49 I think it was a Python version of fiscal, but it’s not really helpful other than it’s a constraint. This is the problem you’re going to try to solve. And I think it’d be cool to have more people write essays like little books or even self published books on essays on different ways to solve one problem.
00:27:05 Yeah. I mean, I think probably the biggest challenge in doing this was how can I come up with ten ways that are interesting but also different from each other and that allow me to kind of explore things in different directions. So, like, one of the more unusual solutions basically involves random choice. And so you can solve FizzBuzz using random choice. But in order to sort of explain that and explore that solution, you really need to understand quite a bit about, okay, what is the Python random module and what is it actually doing, and how do random numbers in Python actually work? And so it sort of gave a good platform for sort of going off in that direction and really thinking about those kinds of things.
00:27:53 Yeah. That kind of reminds me of somebody who was telling me about a Sort algorithm. Can’t remember the name of it that just randomly selects elements and then checks to see if the result is sorted and if it’s not, redo it. And it just does that forever until it gets a sorted list. I can’t remember the name of it, but that’s an amusing sort of algorithm, not an efficient one. So what’s next on your what are you thinking about now for next? Things to write about.
00:28:17 Oh, Jeez.
00:28:18 Like I said, right now I’m thinking about selling and marketing.
00:28:22 Oh, yeah. But those are different parts of your brain, though. These are interesting things to learn about.
00:28:27 Exactly. So right now I’m thinking about how do I get this book in front of people? I’m thinking, should I make a print version and sell that or would that not be valuable?
00:28:37 Because I’m very excited about the book. I’m very proud of it. I think it turned out really well, but that’s only like half the equation. And the other half of the equation is how do I get other people excited about the book and how do I get them to read it? So that’s kind of what I’m focused on and then also work.
00:28:52 I’m excited about the data science stuff, and I think I’m probably going to do is get the data science from scratch book, the second edition, and start reading that. And then I’d love to have you on again to ask you more questions about data science because it is something that’s kind of everywhere now. And I like that from scratch part. So is it okay for me to read this? I don’t know how many data science background other than interviewing data scientists so far.
00:29:19 Do you know some math?
00:29:21 Okay. Then you’ll be fine to read it. It’s basically the target audience is someone who knows some programming and some math well defined, some. Well, I’m sure you know way more than enough Python to read the book.
00:29:35 And then for math, it helps if you understand algebra, linear algebra, a little bit of calculus, things like that. Okay.
00:29:41 I might be in trouble.
00:29:42 Yeah. Lots of people have read it successfully, so I think you probably can, too.
00:29:46 Yeah. Okay. I guess a random sort of tangent. People that feel like they got through math in high school or College or especially through the College, like CS program or physics or whatever, but they didn’t feel great about how they did. Is that sort of a litmus test of maybe data science isn’t for them, or do you have any opinions on that?
00:30:06 I wouldn’t say that’s a litmus test for anything. I mean, a lot of people are like that. I went to math grad school, and I wouldn’t say I really understood calculus until I taught it. And then I started really understanding it. I took it in high school and there I didn’t get it at all. I was basically manipulating symbols according to the rules in the textbook. And then I used it in College. But again, I didn’t really understand it.
00:30:28 Again, it wasn’t until I was in grad school and I was teaching calculus. I was like, oh, now I really deeply get this. So I don’t think that having taken it and not really dropped it, I don’t know if that means much of anything other than maybe it’s time to revisit it for data science, the more dangerous part is probably less the math and more the statistics, because you have a lot of data scientists who are doing statistical analyses and statistical tests. If you don’t understand what those tests are doing and how they work, it’s easy to get them wrong. And so that’s I think you’re right that most data scientists are not doing their own calculus, and they might not even be doing well more than algebra, but most data scientists are not doing their own calculus. So it’s true that not knowing calculus is really not going to be that much of a barrier, just statistics.
00:31:10 I wish we had more emphasis on that. I think we should be teaching statistics way earlier in school and having it not just once, but like several statistics classes throughout before somebody graduates high school so that everybody gets an understanding of that.
00:31:23 Well, those are two different things. Making everybody take a class or making everyone understand it.
00:31:27 Well, more people understand it.
00:31:29 At least you got to teach it well then and then you need people who can teach statistics.
00:31:32 Well, that’s true. I guess that’s very valid. Well, I’m keeping you a long time, and I had fun with this conversation, but I want to say thank you for coming on the show.
00:31:42 Thanks for having me.
00:31:45 Thank you, Joel, for that fun discussion. Thank you, Pie Charm, for sponsoring the show. Try Pycharmourself at testandcode.com PyCharm savetime use PyCharm. That link is also in the show Notes at testandcode.com 126. Thank you, Patriot supporters, for continuing to fund the show. Join them by going to test and Go.com slash support that’s all for now. Now go out and test something.