53 - Seven Databases in Seven Weeks - Luc Perkins

A discussion with Luc Perkins, co-author of “Seven Databases in Seven Weeks, Second Edition: A Guide to Modern Databases and the NoSQL Movement”

Transcript for episode 53 of the Test & Code Podcast

This transcript starts as an auto generated transcript.
PRs welcome if you want to help fix any errors.

You welcome to Testing Code, a podcast about software development and software testing.

On today’s episode, we talk with Luke Perkins about the book Seven Databases in Seven Weeks. Subtitle of the book is a guide to modern databases and the no sequel movement. We discuss a bit about each database covered, which includes Postgres, Redis, Neo, Four, J, CouchDB, MongoDB, HBase, and DynamoDB. This episode is sponsored by my good friends at Pi PyCharm.

Welcome Luke Perkins, and thank you for coming out to test and code.

Yeah, thanks for having me, Brian.

So you wrote the second edition for Seven Databases in seven weeks?

Yeah, that’s correct.

Before we jump into the book stuff, tell me about you. Who are you?

Who am I? Yeah, well, I am Luke. I have worked in a variety of roles in the industry, largely in a technical writing capacity, but I’ve also done some other things as well. I’ve worked on the customer success side of things. I’ve worked as a developer evangelist and developer advocate.

Currently I work for a nonprofit organization called the Cloud Native Computing Foundation, or CNCF, which is essentially it’s like a version of the Apache Foundation, but with a strong focus in containers and especially Kubernetes. So basically, if the Apache Foundation tends to be largely centered in projects run Hadoop and Java and things of that sort, the CNCF tends to be more focused in the Kubernetes ecosystem, with a lot of projects written in languages like Go similar concept, but with a different focus. And thus far my work at the CNCF has been largely focused in things like tech writing and web design, but I’ve also been writing blog posts for some of our constituent projects, and I’ve been going to conferences and giving meetup talks and all kinds of stuff like that.

So that’s what my current role comprises, essentially.

Initially I got into databases, which I actually haven’t been spending a lot of time on databases in the last couple of months, but my initial big foray into databases came when I worked at a company called Bash a couple of years ago. Basho was the company that gave the world React, which is a no sequel, more or less pure key value database.

And it was in that role that I really began to wrap my head around a lot of some of these deep database concepts in terms of crucial differences between sequel and no SQL databases, technical trade offs involved with making choices around data modeling and data consistency and those kinds of things.

Between my current role and my role at Basho, I’ve kind of jumped around a little bit as we do nowadays in the tech industry, but yeah, that should be enough overview for the podcast, I suppose.

Yeah. And I think and I may be having you mixed up with somebody else, but I have a memory of meeting you in person at a dinner a couple of years ago with Casey Rosenthal.

Yes, that’s me.

That’s when we started following each other on Twitter and interacting, sending tweets back and forth now and then.

I do recall I believe it was at the buy and buy place, oven up on Northeast Alberta with some pretty tasty vegan fare and some really stiff pores.

Yeah, I keep meaning to head back there. I haven’t been there since. You’re a Portland person as well, right?

Yeah.

I mostly grew up here, spent my formative years close by and went to College in Southeast Portland at Reed College and went away for grad school for a couple of years, but then moved back about six years ago. So really the bulk of my adult life has been here in town.

And then I’m curious, like how somebody got into the technical writing. How did that happen?

In a word, sheer happenstance basically have a degree in it or anything of that sort. And it’s something that I came upon, I guess a little bit later in life. In my early 30s, I took a path that when I was starting out on it, I thought was extremely unique. But I think it’s actually a fairly well worn path at this point, which is I went to school for many years, got a political science PhD, confronted a catastrophic academic job market where I tried really hard to find some kind of tenure track position in academia that was going to enable me to live a quiet and fulfilling life. But unfortunately, like a lot of people in my cohort and my generation wasn’t able to do so.

At the age of 29, going on 30, I decided that I was just going to do something different in life, started learning a bunch of programming languages and decided that I was going to throw myself headlong into working in the industry.

Got my first tech job back in 2012, doing some kind of technical marketing type stuff and writing blog posts and doing kind of more introductory material for developers. And in 2013, I was working at a company in Portland called Jan Rain and they decided their tech writing team was a little light on technical content. So they sent me over to those folks and that was my first foray into tech writing.

Okay.

And from there, yeah, I worked there for about a year and got the job at Basho.

I would say lead tech writer, but I was the sole tech writer. So by definition, lead tech writer there. And that’s where I think that I started to really Hone my craft as a writer and as a tech writer and really find my voice and find my niche within the industry.

That’s cool. I think that technical writing and good technical writing is these are some of the unsung heroes. I know there’s a lot of unsung heroes in the tech world, but definitely good writing is one of those.

Yeah, for sure.

I think that right now there is a growing awareness in the industry, that tech writers are extremely important, and that documentation is absolutely essential to the success of pretty much any software project you can conceivably. Imagine whether that awareness is really translated into the kind of institutional clout that tech writers I think, do deserve. Within Engineering. Orgs, that’s a different question. I think that some companies really do properly value it, and others definitely do not.

Yeah.

So jumping onto the book a bit, you got involved in a lot of the different no sequel from your work at Basho.

Yes.

How did that evolve into you writing the second edition of this book?

Yeah.

One of the authors of the first edition of Seven Days, the basis in seven weeks was a guy named Eric Redmond, who was a colleague of mine at Basho and extremely gifted engineer, extremely gifted writer and technical educator.

Really just, I think perhaps the closest thing to a pure polymath and Renaissance man that I’ve found in the tech industry thus far. And I worked alongside him on a lot of the documentation. I think that he had built the original documentation site and set up the static site generator and was really heavily involved in that from the very beginning. And he actually authored a book about the database React called A Little React Book.

And I think that Bachelor used to hand out coffee as a Little React book to people at conferences and meetups and things like that.

Eric is an amazing guy.

The first edition came out in 2011, and in 2016, Eric got in touch with me because he basically said, hey, Luke, the book is getting a little crusty. It’s showing signs of age. Of course, in a field like no sequel, things change so quickly that a book becomes pretty quickly outdated, let alone after four or five years. And he’s like, hey, do you want to call on as a co author on the second edition and spruce things up a bit? And I said, sure. So we talked back and forth for a while and we both agreed that actually, Ironically enough, we both agreed that the React chapter in the book should probably go and get replaced by a different chapter, which we ended up deciding on DynamoDB for that which I think was a good choice. In retrospect, I guess that I came upon the book just through what’s a diplomatic word for nepotism.

In my experience of writing a book, my experience was that it was more work than I thought about it in the beginning. And I’m guessing your experience is dissimilar.

Oh, absolutely. Yeah. It did end up being more work than I anticipated.

And actually, on the one hand, it was nice to have a bunch of existing material to work with, no doubt about it.

To be Frank, there are sections of this book that I modified and tweaked a little bit, but ultimately mostly left alone in terms of the basic content and some of these sections, especially once get into the nitty gritty details of column or storage formats and each base and things like that, I wouldn’t have been able to write with anything resembling the clarity and depth that Eric and Jim Wilson, the other co author, managed to do on their own, so that was a really nice part of the writing process. What was tricky about it, though, was that whenever I was writing, I was always triangulating in my mind between a variety of concerns, one of which was infusing the book with my own voice and trying to get my own personal Luke Perkins stamp on it, while also trying to remain true to the intentions of the original officers and that process.

It’s really tricky because there are times when you think, Gosh, you know, maybe I should just rewrite this whole section or get rid of it and start from scratch. Or maybe I don’t like this chapter as much, et cetera. And I just found myself constantly making those kinds of decisions. So it was both a blessing and a curse in terms of the amount of work that I was working within the strictures of an already existing product, so to speak.

This episode of Test and Code is brought to you by PyCharm. I started using PyCharm because of the amazing automated test support, especially for pytest, but the more I use it, the more I realize how much time it saves me. And the rest of my day I can click, commit and walk through all of my changes and make sure everything is really something I want to keep. If I see a print statement that I didn’t mean to leave in the code, I just unchecked that part of the file and it isn’t committed.

Want to try out a snippet of Python? There are tabs at the bottom for quick access to the Python console, as well as to the command line, console, version control, and even a Todo list that’s populated by to do comments in the project.

I find myself opening other tools less and less that timesaving of context switching adds up. If you value your time like I do, try PyCharm, head to testandcode.com/pycharm and try the Pro version for free for four months.

I came to the book because I am interested in no sequel databases, and I don’t really actually have a lot of experience with hardcore database development.

I know I need a long term storage for some applications. Where do you think a lot of the readers coming to this book, or why they’re showing up here?

Gosh, that’s a good question.

I don’t really have any meaningful figures on percentage of people coming to the book from different walks of life. I would certainly love to have that information because I’d be able to tailor it to various audiences and probably make a lot of extra money that way.

But yeah, that’s a great question. I can certainly hypothetically, I can think of a handful of character types or institutional positions or types of folks, to put it simply, who would be interested?

I think developers are obviously a big one, developers that are building out business specific application logic, and they want to choose the right tool for that particular job.

They want it to be easy to use. They want it to have a readily comprehensible data model. They want to understand the kind of guarantees that it provides, and so on and so forth.

I think that’s a big one. I think another one is some people, I think, like yourself, who are both developers and also just general technological enthusiasts.

I think that there’s probably a solid market for those folks as well.

I’m kind of like this, too. Even if I don’t work in a particular domain, if I keep seeing a term or a specific vocabulary or acronym or something popping up over and over again, at some point I’m going to say, Dang, it’s time for me to figure out what this blank thing is that everybody is talking about, maybe chain or something like that.

And I think that a lot of people are going to be curious about what this no sequel is, this no sequel thing, this paradigm, this way of doing things. So I imagine that’s probably a large demographic.

A third one is probably people in technology who they may not be performing the nitty gritty of building applications, but they are in a position to make important technological decisions. So people like CTOs lead engineers, software engineering architects, and people of that sort would probably be interested as well. People that just need maybe they’re not necessarily looking for an exact fit for a specific use case, but they’re going to want to familiarize themselves.

If not with the no sequel domain specifically, then at least with some of the trade offs they’re going to have to be aware of when making technological decisions about data. So I guess just off the top of my head, I can think of those three types of folks who might be really drawn to and would benefit from a look like this.

Okay, so one of the things I was hoping another reason why I asked John, is because I’m an impatient person and I don’t know if I want to take seven weeks. I was hoping you could teach me seven databases in seven minutes.

It was mostly a joke that I wanted to I couldn’t resist.

But are you okay with us just sort of walking through some of the databases?

Yeah, sure. Absolutely.

You start out with Postgres, and that’s not a no sequel. Is that for comparison, or.

It looks like a fairly good introduction to Postgres, actually.

Yeah. That chapter really, I think, is ultimately mostly for the sake of comparison, because if we’re going to spend all this time with focusing on something that’s no blank, then I think that the blank could probably provide a lot of necessarily context and conceptual scaffolding.

The Postcard chapter is mostly there for that reason.

And like you say, I think the chapter is pretty well done. It’s pretty thorough, because we are implicitly arguing in favor of NoSQL. But we’re not saying that we’re not trying to denigrate or devalue SQL databases. And so we really wanted to if we have one chapter to spend on SQL databases, we really want to give those systems their due. And we think of Postcards as a particularly recommendable exemplary database in the SQL paradigm.

And we wanted to show people what I can do because I guess that amongst the three authors, none of us see Sequel versus no Sequel as a zero sum game where one paradigm is going to win.

There are so many ways to use data, so many use cases, so many challenges, so many sets of trade offs that people are going to have to have to navigate that you need to really keep both paradigms in mind. And to that end, we thought we would give Postgres it’s due.

Okay.

I guess I’ve got a broad question I’ll save for later.

So next up is HBase, and I actually have never even heard of HBase before I picked up your book.

Yeah.

And what is HBase?

Well, HBase is the one database in the book that’s actually a columnar data store, which is interesting. It’s a little bit like relational databases, but with the crucial difference that data stored in columns instead of rows. And yeah, that gets conceptually really tricky really fast. But basically HBase tends to be used for really big use cases. I mean, it’s often used as a slighted as a kind of archetypal big database. It’s definitely not something that you’re going to use. If you’re a hobbyist developer that’s just building a basic Crud application and wants to do Select Star from table.

Okay.

Yeah, it’s definitely not really well suited.

I think that HBase is something that you’d want to use. If you have a team of people who know how to manage it and use it at big scale and you have a use case that really fits the column or way of doing things, then HBase is fantastic. But it’s possible that maybe you haven’t heard of it just because it doesn’t tend to be as popular in sort of hobbyist developer or even a small team doing a web application.

They’Re probably not going to run into those cases right away, right?

Exactly.

It’s because HBase, I think, tends to be used in environments where you’re doing really heavy duty data processing. So it tends to function very well for like a data warehouse, for example.

I’m trying to think of a good use case for this. So let’s say that you are monitoring things that people are doing on your web application.

You’re just collecting tons and tons and tons of information from lots and lots of different users and you’re never really deleting any of that information. You’re just keeping it around you’re warehousing it.

That data is just growing and growing and growing. And every once in a while you need to perform some kind of analytical something with it. You need to figure out what percentage of people are coming to our site and clicking this particular button.

And you’re going to have billions and billions and billions of clicks on your site to go through, and you’re going to perform some big and bold analytical processing something.

Hbase is the kind of database that’s really well suited for that.

So it does tend to be more of an analytical database and less of what you call a transactional database. So if you’re familiar with the acronyms OLAP and OLTP, A is for analytics, T is for transactional processing. Hbase is very much in the old lap.

Okay, now that with a lot of the remote monitoring and stuff, there’s actually probably a growing number of applications where people don’t know the questions they’re going to ask yet, but they know what data they can collect.

Yes, exactly.

For use cases where the core imperative is to just don’t drop any data, just anything you can get, throw it in the warehouse and figure it out later. Hbix is definitely well suited for those use cases. So for listeners who are familiar with AWS Redshift, for example, I think that’s a really commonly used analytics database or data warehouse, whatever you want to call it. Some others are. I mean, Google has a couple of offerings like Google BigTable, which is actually HBase, which I think is actually largely compatible with the HBase API. Okay, that’s another big one like that.

Well, Mongo is an easier one to describe, I think.

So next up, you’ve got MongoDB.

Yeah, MongoDB is interesting because I guess let’s start with the definition of the category. So Mongo is one of two document oriented databases in the book, the other one being CouchDB, which we’ll talk about in a second. Okay.

But MongoDB is for unstructured data, and it’s different from HBase. Well, one thing that it shares in common with HBase is that it’s a kind of database where you can just start throwing stuff in there and all data in MongoDB is JSON, and you even query it with JSON. It’s just JSON all the way down.

You can just start throwing JSON objects into MongoDB without any regard for structure, although it’s important to think about structure. But if you wanted to, you could just start throwing any kind of JSON in there and you could find a way to query later.

So if you’re working with data objects that don’t all look the same, where some of these objects might have fields that other objects don’t have, or maybe you’ll have 999,999 objects that look one way and you’ll have one object that looks completely different, you could throw that one object in there, no problem. And MongoDB would be able to handle that. No big deal.

Mongodb is a document database, and it’s built to.

I find it interesting that I would think that anybody using a document database that I’ve not tried Couch TB, though, but it’s a common one. If I don’t want to use a SQL database, but I’m going through it’s an easier API for people to learn, but I’m not quite sure what my API, what the information I really need out of the database, then Mongo might be one of the first ones to grab.

Yeah, definitely.

I think it’s precisely because it doesn’t really ask you to do any modeling or specifying before you start putting data in it.

Hbase Like I said, is a column or database. It’s different from a SQL database, but you still have to define your columns.

Hbase Even though it’s less strict in terms of making your data conformed to a relational pattern, MongoDB really makes virtually no requirements whatsoever. Now, if it’s smart to use MongoDB that way, it’s a completely different question.

I think that most CTOs and technical decision makers would be pretty aghast at the prospect of somebody using MongoDB in that way.

But one of the advantages is that, yes, at the end of the day, it doesn’t require a lot out of you. In that respect.

I have not used it.

How is it different from Mongo?

Well, CouchDB is different because I think that the killer feature of Couch, the killer app, so to speak, is that like MongoDB, you can throw unstructured data in it as JSON. So it’s also in no sequel land when you say document oriented. When you say document, that pretty much always means JSON.

I’m glad you bring that up, because my first reaction to those is why would somebody need a tire database to store Word documents?

Yeah, exactly.

I think we probably all benefit a lot if they just came out with it and said JSON oriented databases clear up a lot of confusion. But yeah. So the killer app in Couch TV is that you can write super complex queries and actually as stored procedures in MongoDB, you can basically hop into MongoDB shell and write queries as JSON that go through the data essentially in real time and find the documents that match your query with Couch DB. You can kind of do the same thing, but you can upload those queries as documents, and CouchDB will sort of perform the heavy lifting of in a very mapreduced kind of fashion, basically saying, okay, so I know in advance what this query is, and in terms of what happens when you trigger that query, Couch is very smart about very efficiently finding an answer for you.

Okay, yeah, it’s a document oriented database. You can throw any kind of JSON at it that you want to, but I think that there is an interesting queryability advantage to using CouchDB.

That’s interesting. Yeah. Cool.

And then the next one up is something I also have never heard of. Neo Four J.

Yeah, Neo Four J is from a paradigm called graph databases.

And graph databases are really just their own thing.

They have elements and aspects that make them a little bit like other databases.

But yeah, it’s almost like apples and oranges.

So a graph database is basically a database in which what you care about most is the relationship between things.

So if you look at, let’s say a family tree is a great example of a graph database where the most important thing is not the list of names. I mean, imagine if I was like, hey, here’s my family tree and I handed you a list of all the people in my family tree and maybe their age and like where they were born or something like that. Well, that’s interesting enough as a compendium of people that are somehow related to me.

But what you want to know is the relationships between these people.

So this person who are their parents, who are their children, and so on and so forth. A graph database would be well suited for a use case like that, where the most important thing is. So you can start with putting different nodes in the database. So that would be personal information about people in your family, and then you would go through and define the relationships between all these people. And from there, once you have your family tree, you would be able to perform really interesting queries. So you’d be able to say how many people in your family tree?

I’m trying to think of a good query for a family tree or what an interesting factoid one would want to know would be like how many people are within three nodes of me or three relationships of me on the graph, for example.

And the example that we use in the book is actually the six degrees of Kevin Bacon example. Okay. We actually pull in a ton of data from IMDb, the Internet Movie Database.

This is a practical example. So this is something that you can run at home and do on your own in conjunction with the chapter. But we basically put a bunch of information about movies and actors and actresses into Neo forge into this graph database. And from there we actually walk you through constructing the query that would enable you to figure out within how many degrees of Kevin Bacon a particular actor or actresses.

Thing.

Yeah, it’s fantastic at dinner party. So if you want to impress the inlaws, I highly recommend that you offer J chapter. The books and graph databases are super cool because just the thought of doing six degrees of Kevin Bacon in the other database paradigms, doing that in a KV store or the columnar or SQL or whatever, it’s just truly horrifying. I wouldn’t want to be the person called upon to implement that. So it is a very particular set of use cases that it serves. And I think Facebook and the social media companies, I think, actually use graph databases pretty extensively for reasons that are pretty easy to discern. I’m sure Tinder and the dating apps do as well.

Yeah. But I would think that there’s a lot of. I mean, it’s definitely a computer science type topic of graphs and graph theory, but there’s a lot of problems sets that you can solve with something like that that you just. Yes, you can solve it other ways, but it’s hard. And so it’s cool to have that around.

Yeah, it would definitely be very hard.

And then. Okay, so we have a couple more DynamoDB, also a new one to me, you said that’s from Where’s that from.

So DynamoDB is from Amazon Web Services. Aws.

Okay.

And it is a managed cloud offering. So DynamoDB, it’s unlike the other databases in the book, because it’s not open source and you could not run and manage it yourself even if you wanted to.

Okay.

So basically, with the other databases on the book, I could put together a crack team of sysadmins and run Ongo or run Redis or run Postgres or whatever.

With Dynamo, you are using Amazon’s services and expertise and you’re paying money for it. Well, I’m sure they have a pretty generous free tier, but yeah, I think that’s along the commercial versus the open source axis, this is the one non open source database we use in the book in terms of what kind of database it is. It is pretty interesting. And, you know, I actually, I had to coin a term to actually. Hold on a second. Let me look in the book really quick here, because I come up with a way of describing the data model that I think I call it like no sequel plus, or I forget, I haven’t read this chapter in awhile. Sorry.

No worries.

Yeah. So DynamoDB, it’s definitely a NoSQL database in terms of scalability and in terms of offering a lot of flexibility in the data model. But it is a little bit like relational databases because you do have to work with tables of data, first of all, and you do have to provide some information about those tables. So you have to define things like keys, how particular rows are identified by key, and there’s all kinds of interesting ways to do that.

And you also do have to define some of your columns.

It’s interesting, though, because with DynamoDB, you can do things like you can define some of the columns and then also have unstructured data stored in the same rows. So there is a lot of flexibility.

It is a pretty interesting synthesis of the kinds of queries that you’d want to run on a relational database, the equivalent of Select Star from table where X equals Y. And they actually do have a querying language, a very simple one, that enables you to do that, but they can also accommodate unstructured data as well. So DynamoDB is yeah, it’s very interesting.

It’s actually based on an academic paper that came out of Amazon in the mid aughts. I’m going to say 20. 05, 20. 06. But it may have been a little bit earlier and it’s known throughout the industry as the Dynamo paper, and the Dynamo paper has actually been extremely influential. So the database react is explicitly based on the Dynamo paper and a couple of others. And for a shopping cart application, any transaction that you lose, any data that you lose translates directly into lost money. Amazon had to come up with a super scalable way to handle this particular set of use cases and wrote the Dynamo paper unleashed upon the world. And it could well be argued that the no sequel paradigm.

There could have been a no sequel paradigm without it, but it was so heavily influential that it would have ended up looking completely different, I think.

Interesting.

Yeah. So DynamoDB, on the other hand, is the commercially available version of the Dynamo system that they built internally.

Okay.

Now I guess the last one, Redis, is probably, I’m guessing, arguably the most common database used in conjunction with other databases.

Yeah, I would say so, yeah.

Because Redis is typically seen as more of a Caching database. So for use cases like session storage and things of that sort, I think Redis and Van Cache are typically seen as kind of the two highest stars of the Caching Galaxy, I guess you could say. And you definitely would not want to store username and password data or something like that in Redis, you definitely would not use Redis as an old app, analytics database or anything of that sort.

But for more, I guess you could say niche use cases, although I hate to say niche, because things like Caching and session storage are extremely important.

But Redis does tend to be very adapted, filling in some of the gaps, so to speak. If you’re writing a standard web application or a SaaS application that utilizes a bunch of long term storage for data, if you’re building a social media application, long term data about users and about the relationships with people and so on and so forth. But Redis is going to fill in a lot of the gaps in terms of the user experience on the web page, for example, Redis is very adept at this type of thing now, is it?

Sorry.

For instance, I think there’s a lot of tutorials around for how to Shim Redis in between other databases then. Is that what you mean by Caching store?

Actually, I don’t even really know what that means when you compared it to Memcache. D if I query a bunch of stuff out of my other database, but I want to keep it around for new page updates or something. Is that what that’s for or what?

Okay, yeah, I see what you mean. Well, Caching is one of a couple prominent use cases for Redis. It can also do things like basic Pub sub, much like systems like Kafka and Pulsar.

It handles that use case pretty well. I guess by Caching, I mean more that it’s intended for shortlived data. Okay, so you can use Redis as a longer term persistent data store, but its data model and way of doing things isn’t really primarily geared towards that. So it’s primarily geared towards being incredibly fast. I mean, Redis is just amazingly blazingly fast.

And it achieves that by basically when you write to Redis, it goes first into memory and then every once in a while at an interval that you can set in the configuration, Redis will store it on disk.

What that means, of course, is that if something is stored in memory and something goes wrong and the node becomes unavailable, for example, before the data gets written to disk, then it means that that data is essentially lost.

Let’s say the node comes back up and the memory cache gets cleared out or something, that data is essentially lost. And so for Caching, that tends to be perfectly acceptable.

I mean, you don’t want that happening all the time, but it’s not the end of the world if you lose a user session store or something like that.

Okay.

Yeah, lightning fast, but with certain drawbacks some caveats.

Okay, so there’s a big question then is if I’m going to start it, I need a store for my data, how do I decide what to use?

Yeah, that’s really tough. We actually do have a couple handy tables in the appendices to the book where we have tables that list prominent features and which databases support them. So if you need something like I’m not going to take a look right here.

So if you need some form of cross node replication scheme, for example, which databases support that? If you need Sharding, which databases support that? If you need so called acid transactions, which databases support that and which don’t, we have a couple of places in the book where we tried to present that information in very concise form.

But of course, that raises the question of which features you’re looking for and what trade offs are you trying to make? And so I would say that a good place to start would be to say what problem I’m trying to solve, what’s unacceptable for my use case, what’s the worst case scenario?

In some cases, the worst case scenario is that fetching data takes a long time.

In that case, it’s better to maybe have a value that’s not the most up to date value, as long as there’s something right. So that’s a good thing to know in some cases.

In some cases, you need to run a complex query over lots and lots of data, and you need an exactly right answer every single time with no slippage and no fibbing and no eventual consistency. As we call it in no sequel land.

That’s a good thing to know.

If you’re building a shopping cart application, you need to never lose anybody’s shopping cart data because that’s essentially leaving money on the floor.

That’s a good thing to know and a good place to start. Okay.

Step zero is basically what’s the worst case scenario, and what are databases that enable me to avoid that? And typically, once you know the answer to question zero, as I’m somewhat fancifully calling it, that will often not just lock several databases off of your list of options, but it will probably eliminate whole paradigms of databases.

It might eliminate the whole no sequel paradigm. It could very well be that you answer question zero and none of the seven databases in the book that are SQL databases. Maybe at the end of the day, you just need MySQL Or Postgres.

And I think that that’s a great place to start. And from there you can start digging in and refining your query, so to speak.

Nice pun.

Yeah, always.

One of the things I like, especially is this land. I mean, this is a broad landscape with I guess, similar sized chunks.

I don’t know how to say this other than people jumping into stuff. This is a great book, for instance, for a College student or a grad student, because you don’t know what you’re going to deal with when you’re actually out in the industry. So having a good walk through a whole bunch of different types of models and actually having some examples that you can code up and watch how it works, it will also tell you not just theoretically what a database is like, but also what it’s like to work with it. And I think that’s a neat standpoint to give people a good broad picture of the landscape.

Yeah. Thanks, Brian. I appreciate that.

I think that’s very well said. I have nothing to add.

I definitely am just sort of curious about a lot of these. So actually, I’m even more curious after talking with you about these, of getting into some of these things.

I think it’d be fun to play with a bunch of them.

But the last thing in your table of contents is Cap theorem. I don’t even know what that is. What is a Cap theorem?

Cap theorem is very important, and it’s basically the idea that there are three things that you might want out of a database, and you can’t have all three. You can have at most two of the three things. So one is consistency, C is consistency, a is availability, P is partition tolerance. So consistency means that when you tell the database this is the current state of things, this is the current state of this table or of this value that when you go to read the table or the value that it’s going to be completely up to date and the database is going to be able to present a coherent picture of things.

A availability means that the database is available all the time. So, you know, maybe it can tolerate a couple of nodes going down, but you’re never going to query the database and not get an answer.

Okay.

So that’s availability.

P partition tolerance is basically if you’re running multiple nodes of the database, which I mean, nowadays if you’re running databases in production, you’re always running lots and lots of nodes. So partition it means network partition, which is basically if some of your nodes can’t talk to other nodes or if one of the nodes gets cut off from the network. And of course, that happens all the time. Networks are notoriously fuzzy and brittle in this way, can the database still function?

And so lots and lots of databases give you ways of having two of those three things. So CP databases.

Relational databases tend to be CP databases. So you can have consistency and you can have partition tolerance, but you can’t have availability.

So those databases, sometimes the database is just going to be down because you lose a node or something like that.

And it’s like, sorry, we can’t give you the exact right answer, so we’re not going to give you any answer.

Okay.

No sequel. Databases tend more strongly to be AP databases, which is where they emphasize availability over consistency.

For example, for use cases like I talked about, where the most important thing is that the database is always available, you can’t have your shopping cart database ever go down.

Now in a shopping cart scenario, let’s say a user puts five things in their Amazon shopping cart and then they remove one and the database forget something happens. The database forgets that you removed one and thinks you still have five items in the shopping cart. Well, that’s okay.

That’s an acceptable it’s not consistent.

It’s not consistent.

It wasn’t able to correctly provide the new state of things that you wanted it to provide, but it’s available all the time.

The user is going to check it and delete it again if they want to.

Yeah, exactly. And for use cases that are analogous to that where having outdated data or an outdated state of things is more important than consistency, or rather where it just has to be available all the time, even if at the cost of consistency, then you want to go for an AP database.

Two people looking at an article on the Web seeing a different number of likes isn’t that big of a deal.

Yes, exactly. Whereas if you pull if somebody was sharing the most recent New York Times expose on something, something in politics and you didn’t know how many likes it had because that database was just down, because consistency is more important than availability, that would be weird. Like your Facebook users would say, like that’s weird seems like a popular article.

In that case, 1.5 million likes versus one point 47 million likes is not really the end of the world, right?

Yeah.

Well, I think this was a really fun, quick fly through of a whole bunch of different databases.

Thank you a lot.

Yeah, thanks a lot, Brian. Thank you for your kind words. I had a great time talking to you and yeah.

I love the show and maybe we’ll have to get you on some time to talk about containers because I don’t know about those either.

Okay. Yeah, I would certainly love to. Like I said, nowadays my head is much more in container Kubernetes monitoring observability land than it is in database land so I would be more than happy to come back on and chat.

Cool. Well, thanks for your time and we’ll talk to you later.

All right. See you. Ryan, thanks a lot.

Thanks again to pie term for sponsoring the episode. The offer they set up for the show’s, listeners is only good until December 1 so to try PyCharm free for four months go to testandcode.com PyCharm that link is also on the show notes at testandcode.com 53 as well as links to the book we talked about seven databases in seven weeks and also links to all the databases we talked about. Thanks again to Luke for his effort on the book and for talking with me for this episode and thank you for listening for sharing the show with friends and colleagues, for supporting a show through Patreon and for using the link in the show notes to try out Pisa that’s all for now. Now go test something.