Data science, data engineering, data analysis, and machine learning are part of the recent massive growth of Python.

But really what is data science?

Vicki Boykis works on projects in machine learning and data engineering across a variety of industries, and joins this episode to help us understand really what is data science.


Transcript for episode 57 of the Test & Code Podcast

This transcript starts as an auto generated transcript.
PRs welcome if you want to help fix any errors.


Welcome to Test and Code, a podcast about software development and software testing.

What is data science? What does the data pipeline look like? What is it like to do data science, data analysis else’s, data engineering, things like that? Can you do it on your laptop, or do you have to do it in the cloud? How big does data have to be to be considered big? We cover all these questions and more with Vicky Boykis. She’s been doing this for a while and answers all of my questions.

I learned a lot, and I know you will, too. Thank you to Digital Ocean for sponsoring this episode.

Today on Test and Code, we have Vicky Boykis. So, Vicky, thank you so much for coming on the episode.

My pleasure. Thanks for having me.

And could you introduce yourself who you are and what you do?

Yeah. So I’m a senior manager at Cap Tech Consulting in data science and engineering, which is a mouthful. And it basically just means that I do any type of projects that involve both data science and the back end machine learning engineering work to make all of that data work happen. And I live in Philadelphia.

Okay. The opportunity to go to Philadelphia once, but I’ve never been. Do you work on projects for this company, or is it a consulting thing where you work for other different projects?

We have clients in a bunch of different sectors. Our biggest clients are in finance, healthcare and retail.

Okay. And I read I think it was just a joke, but a comment that data science is just a new word for statistician.

Is that just a simplification and a joke, or is there some truth to that?

Are you trying to start a Twitter war?

That’s hard to say. I think it’s both true and not true, which is like a very consulting answer, because one of the things consultants always say is it depends on the situation.

I’m looking from the outside in. And I still don’t quite get what data science is. So I’m sure other people are kind of confused, too.

Yeah. So I see data science is having a couple of different parts. The first one is the part that you described, which is the statistics part, which is you have some data perform analysis on it, and you’re trying to figure out what’s happening. For example, if you’re like a big e commerce retailer, you want to figure out why people are buying your product or more importantly, why people aren’t buying your product. Right. That’s kind of the data science side. And so that’s how data science started from statistics in the late 2000s, which now seems like a century ago. But now what’s happening is that we now have a lot more data. So the era of big data, which actually no one says anymore, which I thought was pretty interesting, too. Nobody says what nobody says big data anymore. I don’t know if you’ve noticed that. But big data has fallen out of Vogue, too, even though the term is only like four years old at this point.

Okay. Is it just replaced with data then?

Yeah, I guess people just talk about people talk about data streaming pipelines now.

Okay.

That was my extreme consulting voice.

So going back to that. So that’s the data science part. And then what’s happened since then? So people started working with that data, and then they realized that they needed to reproduce those models over and over and over again. So you don’t want to just figure out why people are buying more extra why you want to figure out why they’re doing that in season. You want to present these results real time to executives on their phones. So you want them to be able to refresh the model very quickly. You want to be able to pull that data together quickly. And so what my theory is what started happening is these data pipelines started being built out on the back of data science. And there was already obviously, there was both statistics and ETL. Etl. Now the sexy word for ETL is data engineering. Right. So we now have data engineering and data science and machine learning engineering, which is kind of bridging the gap between the two. And so data science is no longer just statistics, but I see it more as a moving towards kind of like almost a software engineering paradigm where you have to understand how all these parts work together.

Yeah. So what does the data pipeline look like?

Yeah. So it could be. So again, if you’re a retailer, an online retailer, you probably have browsing data of people who come to your website and you keep that data in logs. So those logs go to some back end server, some sort of queuing system. And then you have those logs either in a queue or you probably don’t store them in the database, but you pipe them through a new queue and you pipe them into what is known as a data Lake. And so a lot of these data Lakes right now are popping up in the cloud in S three specifically. So you dump this unstructured data out into S three, and then you figure out what you want to do with it. And so what you usually want to do with it when you try to build predictive models, if you want to put it in some sort of structured system, either a really large database or you want to put it through some sort of analysis system, such as Spark, for example, that will run the algorithms.

Okay. This does sound very like software engineering to me at this point.

Yeah. So it’s interesting, I think that most of the work so if you talk to data science, they’ll still tell you that 20% of their work is actually doing the models, 80% is cleaning the data and making sure that the data makes sense. So I think most of the work in machine learning in general is getting the data to where it needs to be. Especially in large companies.

This can get really, really complicated really quick. So if you think about your log data coming back from your ecommerce site, it’s possible that it’s going into three different places. And none of those three places have a complete view of the event stream. So you have to go to a fourth place to reconcile that stream. And then you have to spin up your data warehousing solution or your data Lake in the cloud, and then you have to migrate all of that in a single place. So a lot of the work is spent getting the data into shape for you to be able to see the end result, which is a predictive model that everyone always talks about.

Okay.

And so in the jobs that you take on, what is the end result look like? Is it something that is continually running when you leave and people can just maintain it?

Yeah. So that’s the end goal for a lot of these things. Right. So sometimes it can be that people just genuinely want to know what the answer to this is. So sometimes the end result will be a report or proposal saying, this is what this run of the model looks like. It’s up to you to decide what you want to do with it next. And a lot of times we do try to leave functioning systems in place saying, this is how you run this model, and then this is the maintenance of the model itself. So the goal ultimately is to get into production, which I think can be harder than you would think in data science, as in all software systems. Right.

Yeah. And just kind of for perspective, how long, calendar wise, are you working on a particular problem?

Usually it can be anywhere from small, short projects where we come in and we do assessments on what needs to be done. So those are usually can be eight to ten weeks, or we have projects that are 18 months, 24 months, depending on the scope and the size. And usually with larger companies, they’ll be a little bit longer.

Wow. Those are big projects then.

Yeah.

And I don’t know if anybody else cares about this, but how many people are working on this together? Are you using solo on this or is it a team of people?

So I’ve always worked with a team. Sometimes it’s been that I’ve been the only analyst and process has been the analytics report. But usually I work within an engineering team, and the teams can be anywhere from like three to seven to eight people.

Okay. This is fun.

Yeah. I love it.

I probably wouldn’t be doing it if it wasn’t.

It’s something I used to ask a lot, and I’ve been asking more and more now is what the fun parts are people’s jobs. So which part do you like the best?

Yeah. So I think for me, I tweeted this a couple of days ago, but I basically had this tweet from 2012 or 2013 where I said I understood everything about Hadoop and then I quote tweeted myself and basically said I know nothing. So the fun part for me is really digging into distributed systems and seeing how all parts of the system work together. And what I’ve learned about myself is that I like doing both the data science part and the delivery data engineering part. So I like seeing a project from start to finish, like from where the data ingest hits whatever part I’m responsible on getting that data clean, getting it to the state where it needs to be tuning that system. So it’s a production running system, and then finally visualizing the data from that system and being able to give people, business people reports from that system. So I really enjoy visualizing how that whole process needs to be architected and come together and packaged into a single piece of software or product.

Yeah, that does sound fun.

Thank you to Digital Ocean for sponsoring this episode. Digital Ocean is the preferred cloud platform of hundreds of thousands of innovative companies. Digital Ocean makes it easy to deploy, manage and scale applications with an intuitive control panel and API designed for developers. Get started with a free $100 credit towards your first project on Digital Ocean and experience everything the platform has to offer such as cloud firewalls, real time monitoring and alerts, global data centers, object storage, and the best support anywhere. Join over 150,000 businesses already creating amazing things on Digital Ocean. Claim your credit today at testandcode.com. Digitalocia.

One of the things we’d mentioned we were talking beforehand was that there’s a couple of different activities that’s often involved with this, and you had brought up a couple of axes that maybe they’re related the models versus cleaning and then also development versus experimentation. Are those related concepts?

Yes. I think this is something that we really started to see crystallized and maybe in the industry as a whole over the past couple of years is the idea that there was a blog post a long time ago and again a long time ago, data Land was like maybe four years ago saying that there are two types of data scientists. There’s an A type and a B type, and an A type does analysis and a B type does building. And so if you’re an A type, you’ll do stuff like analyze data and Jupiter notebooks, present those results, do stuff with R, present those results, and do it in a self contained application. And the B type would strictly be confined to building out ETL pipelines that power all that data delivery. I think what we’ve started to see, or at least what I’ve started to see in my personal experience, and I think this has been backed up by some talks that I’ve heard recently is that data scientists now need to understand kind of the full breadth of that work and be able to do both pieces of that work and what that means for data scientists in particular.

And just speaking from my background, I have a background as a data analyst. And so the way I got into data science was I got tired of waiting for ETL jobs to finish, and I kind of started digging around in Hadoop on my own.

Okay.

And I started running my own query results. So that’s how I got into software development.

What I’m starting to see is a lot of people kind of moving towards that middle ground where they’re moving more from analytics per se to also having to know the basics and the fundamentals of software development in order to build out these pipelines and work on these open source tools.

Like a stage of a data pipeline.

Is that going to be a bit of Python code or a bit of R code or something like that?

Yeah. More notebook type development where you have some data, you maybe do some cleaning of that data in a notebook. You start to see what it looks like, and then you fit a model to that data, and then you say, okay, maybe this model isn’t giving me the results I want. I’m going to try to fit it again, or, oh, maybe this field has nulls. Maybe I’ll drop it, or maybe I’ll drop this variable. So a lot of a type work I see as really experimental in nature and really essential to the first part of building out data product. And then after you get all that a work solidified, that’s kind of where the B word starts of. Okay, well, now we want this data. We want it to be streamed in on a daily basis so that this model that the data scientists built can actually run.

Okay, interesting. So this is following a similar path that I’ve been watching. Some of the software development, the extreme programming first gone into TDD and stuff.

There’s a flow towards thinking about development in three stages. There’s the learning stage where you’re learning how it all works and how all the pieces fit together and doing the first implementation and then building something that is maintainable. And then the third phase being maintaining and growing. It looks like data science is sort of following the same path.

Yeah, I think that’s true. So data science is a field by self. Statistics is a very old field, but data science kind of this combination of statistics and software is still relatively new. So we’re still trying to figure out what it means to be a data scientist. And it’s probably like the question that fuels 99% of data science tweets and blog posts, my own included.

But I feel like we are moving to that phase where we are more software inclined or we realize that we have to be more software development literate as opposed to statistically literate to get our models out there into the world.

When did you start picking up software?

So this was actually maybe two or three years into my career. So I have kind of a nontraditional background. So I was an undergrad in economics at Penn State University. And the reason I picked that is because I didn’t want to be just an English major and I didn’t want to be a math major. I wanted to do both. And so you can kind of see this play out in my later career, where data science is also the combination of having to explain stuff to people and then analyze that stuff. So I did Econ, and then I started working as an analyst for an international consultancy down in DC.

That was just a fancy title for doing a lot of work with spreadsheets, just like everyday spreadsheet work. And then I heard about this thing. And so this was about ten years ago now. I heard about this tool called R that was open source, and people were doing a lot of stuff with it. It was much less expensive than SAS, which we had. And so I started to slowly creep into that world. I started using Access, which I look back now, and that was like a terrible time in my life.

And then the first time I really saw the power of doing software development as opposed to straight analysis, is when I was able to have a script that scraped the database, like an open source database that wasn’t well organized for some data that I needed. And I was like, wow, okay, this thing has legs. Like, I really should learn how to more apply software development to my workflow.

That’s great.

And then a couple of jobs later is when I was an analyst and I got exposed to Hadoop from the ETL side. And I said, okay, I should really learn about this because it looks like it’s going to be big. So it was kind of accidental. And I consider it probably like the luckiest accident that I’ve ever had that I was exposed to this stuff early in my career and I could then move into software development and data science.

Okay. Now I was looking at some of your blog posts, and I don’t remember how long ago this one was. But you did a blog post article on doing analysis on your laptop versus running things in Hadoop.

Yeah. So I think that was maybe like, what, two years ago? Three years ago.

Is there a Cliff Notes version? Is a laptop better on some problems than pushing things to Hadoop?

Yeah. So the closest version is for data that’s of a smaller size.

Let me pull up that slide because I have a really good slide of like one big data is there used to be a really good site called your Datafitsandram.com, which doesn’t exist anymore, which I’m really sad about. But you could go there and it had the specs of the beefiest intel single server machine, and you could put in your data size, and it would tell you if it fit on that machine or not, and then you don’t need a Hadoop cluster, basically. So I kind of used that as my inspiration. And the blog post talks about how when you start getting to, like 10GB of data, you probably want to start using something like Spark and relational databases like Postgres. And then once you get to 100GB of data, you probably definitely want Hadoop or Spark or something else to process that data. But for anything less than that, actually setting up the system creates more overhead than running the system.

Okay, so you’re saying that people don’t really say big data anymore, but what is big? And if big is greater than 10GB, that’s actually very big. Yeah. As far as I can tell. Is the workflow different? I mean, do you work differently on Hadoop than you do when you’re working locally?

Yeah. Well, for sure. The thing you have to understand about if you want to process data in a distributed manner is you start to have all sorts of trade offs. So the biggest one is probably the cascade.

Yeah. Okay.

The issue is now you have multiple servers where you need to coordinate the data, and so the data might not be consistent from one server to another, or it might not be available, or you might have servers that don’t communicate from one server to another, and then they might and then you have replicas of data. So now you have to start taking all these things into consideration.

Okay. Interesting. What are the challenges going forward, do you think?

An interesting trend that I’ve seen over maybe the past three or four years is that now too much data is being collected or maybe too much of the wrong data. So I’m thinking specifically of the liabilities that large companies like Google and Facebook are experiencing where collecting too much user data is actually a liability for them. And so I think a challenge will be learning to work smarter instead of working harder and learning about what kinds of data you actually want to keep.

Okay.

The other issue I can think of is a lot of companies will now put their data in cloud services like AWS or Google Cloud, but then the settings can be sold to sometimes that they leave their data open and vulnerable. So AWS has done a lot in addressing this lately. It used to be that you could put your data in an S three bucket, which is just like the best way to store data in S three, and you put your data in there, file, object, store, and then it would be open to the world unless you said otherwise. And so what people do is they test to see what comes out like a bunch of different organizations, including political organizations, e commerce organizations, advertising companies. They just had these buckets open to the world. So that turned out to be a bigger challenge than actually analyzing the data.

Turned out that that’s not a good idea.

Right? I’m pretty wise to that and set up a bunch of strict rules. And now the default is your bucket is closed and left otherwise. But I think now that we’ve collected all this data, managing and figuring out what to do with it, especially with GDPR for European companies, all that stuff on the horizon, seeing what kind of back pressure that has on US legislation, all that stuff will be what I’m watching over the next couple of years.

A lot of the stages, actually, I’m just guessing here at least some of the stages of all of this is reducing the ten to 100 or more gigabytes of data down to a single answer or a few bits of data. Is some of that reduction getting pushed earlier into the pipeline, even in like maybe collecting less?

I think it’s not. So the thing is, I think we’re actually now collecting more and more, and one of the reasons is because we have such good tooling. So now that we have things like Costco, for example, or Kinesis on the AWS side, we’re able to collect a lot more and we don’t notice it as much and it’s much cheaper to store. So I agree with you, that is a challenge to then get from that whole sea of data to a single answer that an executive wants.

Yeah. I think it’s interesting that I’d never heard the term data Lake before.

That’s a very good description for just a lot.

Ironically, it’s also become I’m based on this sometimes on Twitter, but it’s also become known as a data swamp lately because people and there’s a lot of different, funny water analogies for data pipelines, whatever. So people will throw data in the Lake but then not manage it or curate it because it’s so cheap to store now. But then the problem is who manages the Lake? And how do you enforce that? You only put this kind of data in the Lake or whatever.

Okay. Why a swamp?

Because it’s a dirty data or it’s dirty and it stagnates and there’s no one really curating it. So if you have a data, if you really need someone or some process that comes in and curates all this data tag so you can actually see what’s going on in there.

Okay. So I assume that it eventually just gets stale and gets thrown away.

That happens sometimes, too, something that I heard early on in my day career. So I was doing a lot of data reporting. So just manually putting together reports of stuff that happened overnight, copying and pasting it and sending it out as an email. And I told my boss, I’m like, I’m tired of sending these out, what should I do? And he’s like, stop sending them and see who emails you being annoyed.

He was kind of tongue in cheek about it. But that actually does work because a lot of the time people will ask for a lot of data, but they don’t actually know what they need or know what they want. So that’s kind of up to the data person or the data scientist to help define that as well.

Interesting. So you came from economics and got into data science.

Do you refer to yourself as a data scientist or something else?

Yeah, actually, I don’t know what I am anymore. It’s pretty interesting because I do both sides of the spectrum. So I do both data science and data engineering. So a term that I’ve heard coming up over the past couple of years, which I think is pretty accurate, is machine learning engineer. And so I’ve started to see this term surface at a lot of larger companies, especially in the tech space in Silicon Valley. But I think it’s more accurate because you’re both doing software engineering and you’re doing machine learning and data science. And I think for me, if I had to describe myself, that’s probably the role I would be most accurate. Right now.

I’ve seen heard stories of people coming into data science from lots of different fields. They just get faced with large amounts of data and having to learn how to deal with it. On the other side, people that are with the background and software. Does it make sense for somebody with the background and software faced with data problems to start learning about some of these techniques and data pipelines and whatnot I firmly believe that the interchange should go both ways.

And in my fantasy world, all software engineers would understand data and all data people would understand software engineering. So I’m kind of trying to encourage everyone to move towards the middle if it makes sense for their business case. So, for example, if you’re a software developer and your company does a lot of stuff with data and you’re interested in getting into it. Yeah, learn a bit about it. Do some reading. My favorite probably book that doesn’t have anything to do with technology is how to Lie with statistics. So that would be my first recommended read. And then I can add some more. It basically talks about how to work with data, why data is misleading, and how to think about data, how we can misread data. So that doesn’t have anything to do with any specific tech stack. It’s just how to be data literate. So, yeah, I would highly encourage it.

That’s a decent good start. Now, if I want to start playing with actual some tech stack or something, where do I start? There? Do you have any ideas?

I would say the first thing is to find a problem that you’re interested in exploring. So all data work usually starts with a question. If you’re in a professional setting, it usually starts with a business question of some business stakeholder looking at a graph and saying, why is this graph going down? Or Why is this graph going up? Or we want to spend more money here, should we spend more money here? Or we want to open this new product? We should investigate here for you personally. That can be almost anything. So just today I was listening to this podcast called 99% Invisible that talks about like really interesting, just weird stuff that you don’t think about. And the podcast was about emojis, like how they get started and how they get approved by the Unicode Group.

Interesting.

And I was thinking, and so one of the things that happens in the process is emojis are approved, but nobody says how they have to be drawn. So each company, Apple, Google, Facebook has their own team of artists that draw and propagate these emojis. And so I was thinking, could you see which emojis are the most similar between the biggest companies that produce emoji sets? Like, which emojis do people come to most consensus on? And so you could probably do something like that with neural networks. That’s a pretty complicated problem.

But some other things you could do, if you have a Fitbit tracker or a step tracker, you could see how many which days you’re tracking steps on. So there’s tons of data based problems in your day to day life that you think about. Like, can I quantify this and can I study it and do an analysis? And if you can’t think of any of that, you can just go to Kaggle.com. They have a lot of data competitions, but they also have a lot of really interesting data sets that you can perform analysis on. And what I like about them is there’s a ton of people also working on these data sets. So if you don’t necessarily come from a data background, it can give you kind of like a community and a guide on how to go through stuff step by step and what to look for in analysis. And as for Tech Stack, which I didn’t really answer, but I would just start with a Python podcast.

I would start with Jupyter Notebooks, which are a fantastic tool for importing data and Jupiter notebooks, Pandas Map, Plot, Sci-fi and NumPy. And you can just Google for Jupiter Notebook examples. And there’s tons of examples out there for how to go through and import your data, do an analysis and how to reason through the analysis.

Okay. When you’re doing setting up Jupyter Notebook, are you then using it as a running job to generate data over and over again?

Usually what I’ll do is I’ll pull in the data set so it’ll be hosted at a URL or it’ll be on my desktop or something. So the data sets that I work with on my own personal basis aren’t that big, so I just pull them in as a CSV file. Don’t do anything super fancy with them. I have done things before. For example, I have not strictly a data project, but I have this project called Soviet Artbot, which just tweets socialist realist pictures.

And for that I scraped a website or no, I didn’t scrape it. There was an API. I hit the API for the website, and I just downloaded all the pictures.

A lot of other sites have APIs. You could hit an API through a Jupiter notebook, but I wouldn’t use it for something that you’re going to put into production, but I would absolutely use it for exploration. Okay.

This gives me lots of places to go on. And oddly enough, I was forgetting that I’m recording this, so I was writing down notes, and I don’t need to do that. I can just listen to this again.

Yeah.

Yeah. That’s a lot to think about. And that’s one of the things I find amusing is this notion of what is a data scientist. And it’s probably a lot of different things depending on what kind of data you’re looking at. That shouldn’t be too surprising that’s the same as software engineer. A software engineer running websites is a lot different than somebody building, like, microchip controllers for heart monitors and stuff like that.

Yeah. And I think that’s a good analogy because I think we debate this a lot in the data science community, and we don’t really see we kind of see software engineering as like this monolithic thing. Maybe it’s closest to web development, which is absolutely not. And probably software engineers see, data science is this monolithic thing. So that’s a good myth to dispel.

Yeah, it depends a lot about in your field.

One of the discussions I often have with people is how much math do you need to be a software engineer? And it’s obviously a trick question. It depends on what field you’re doing software development for. If you’re developing for communication systems, you might need a lot more and different math in economics. It’s different math.

Yeah.

But you probably had a ton of math to get an economics degree, right?

So, yeah, we had statistics. Econometrics was probably the one that killed me the most, which is ironic. So I hated that one the most. But it’s probably the closest to what data science is today because we did a lot of regression analysis in Stata, which is even more proprietary software than staff.

Okay.

Interesting.

Well, I sure had a lot of fun talking with you. Do you have any closing remarks or anything you’d like anybody to do?

No, I think I’ll just leave everything. I dispute a lot of links, so I’m going to leave that all up to you to include in the show notes so people can go to the sites. I had a great time.

Well, I had a good time talking with you, too. So thanks a lot.

Thank you so much.

Thank you to DigitalOcean for sponsoring this episode visit Test And Code. Comdigitalocean to get your $100 free credit. That link is also on the show notes at testandcode.com 57 I’ve also included a whole bunch of links to think is mentioned by Vicky I probably missed a few so if you notice any that I’ve missed, let me know and I’ll add them. Thanks again for listening. Now go out and test something or do a little data science.