Performance monitoring and error detection is just as important with services and microservices as with any system, but with added complexity. Omri Sass joins the show to explain telemetry and monitoring of services and systems that use them.


Transcript for episode 169 of the Test & Code Podcast

This transcript starts as an auto generated transcript.
PRs welcome if you want to help fix any errors.


00:00:00 Let’s say you’re adding or updating a service or a micro service to an existing system. You want to monitor the performance of the service and the performance impact of the whole system.

00:00:11 Omri SAS joins the show today to talk about out monitoring of services.

00:00:18 Thank you, PyCharm, for sponsoring this episode. Pycharm saves me time. Error highlighting lets me know as soon as I type a mistake. Code completion and parameter hint pop ups save me from having to look stuff up. The Get integration makes all of my Git workflows super easy, and then you get a whole bunch of bonus features like the database front end that are super cool in a super fast debugger. I wonder how Python will save you time. Find out yourself at testandcode.com. Pycharm.

00:01:01 Welcome to Test and.

00:01:13 Code today’s episode of Test and Code.

00:01:15 I’m super excited to finally have armor on you’re from Datadog. I guess I should ask your name first, but is it SAS?

00:01:22 Yes. My brother and I are both in tech, and we get the software as a service question fairly often.

00:01:29 But it’s spelled different. It’s like the programming language instead, right?

00:01:33 Yes.

00:01:33 We were going to talk about monitoring because I don’t really know much about monitoring and also about kind of how it’s changed with some of the more micro services and stuff. But before we jump in to that, can you tell me a little bit who you are?

00:01:50 Sure. So my name is Omri Sass, not spelled like software as a Service I’ve been with Data Dog for about two and a half years. I’m a senior product manager in our APM Department. That’s application performance monitoring. It’s traditionally been considered one of the three pillars of observability as a kind of core principle of monitoring, and we’ll get into that in this capacity.

00:02:20 I do product management. I provide product management for all of our teams that revolve around service observability. So how do we reason about services applications in a micro service or hybrid environment?

00:02:36 How do we get health metrics, detection capabilities, auto detection capabilities, remediation, etc.

00:02:44 For your applications, and how that integrates with other types of monitoring, infrastructure, monitoring, network monitoring, et cetera. And we’ll kind of talk through these as we talk more broadly.

00:02:56 Prior to Data Dog, I was a product manager for Wire Cutter, one of the New York Times smaller publications, and before that I was a product manager for Data product at First Media, so kind of got a taste of a lot of world. The product manager.

00:03:12 Okay, what is the product manager?

00:03:18 So what is your day like? Do you talk with customers? Do you work with engineers? Are you planning things?

00:03:25 So it’s a question I get asked the most. The best answer I can give you is via an analogy.

00:03:34 Maybe it’s because I grew up with kind of the relevant folks at home, but a product manager is kind of like a music producer, but for software. So we don’t do the actual software ourselves. Right. Like no one wants me to do any type of coding that scares everyone.

00:03:54 But I make sure that the people who actually do the art.

00:03:57 Right.

00:03:58 In art and science of coding, that they’re working on the right things, that they have the right context, that they understand what our users want. And in that I spend about a third of my time meeting customers, potential partners, etc. About a third of my time with our engineers, and then about a third of my time making sure that everything else kind of goes smoothly, working with marketing, working with kind of other internal bodies within Datadog to make sure that our product is well understood and well supported across the board.

00:04:32 Okay, that sounds interesting.

00:04:34 Sounds like fun.

00:04:38 When I hear about monitoring, there’s a whole bunch in that range.

00:04:43 I also hear about telemetry.

00:04:45 Is that the same thing, or is there a difference between monitoring and telemetry?

00:04:49 So I’d say there is a hierarchy, right? Well, telemetry is the actual data. We extract telemetry from applications, from infrastructure, from third parties, from integrations, et cetera. All of these provide the basis for monitoring. Because monitoring isn’t just the data itself. It’s also the capability to make sense of it and in a lot of cases, to act on. So visualizing that telemetry or creating what a data dog we call monitors. Right. So the ability to create an alert if a certain type of telemetry under certain conditions crosses a threshold. Right. It’s too high or it’s too low or it doesn’t exist anymore.

00:05:35 All of those sort of things fall under the umbrella of monitoring. So telemetry is the data. Monitoring is what we can do with that data.

00:05:45 Okay, this has always been a thing, but it seems like it’s more.

00:05:50 Maybe I’m just stating the obvious, but as companies become more and more reliant on their web presence and their applications through the web, this is more and more important because having any part of your services go down is like maybe having a store, not being able to be open or something.

00:06:11 And then there’s been changes.

00:06:14 I don’t even remember when it was a few years ago or something. We were talking about monolith architectures versus micro services, and I don’t really hear that much anymore because I think it’s sort of a blend. You probably know more than me. Is that something you’re seeing?

00:06:29 Definitely. As things tend to happen, you talk about some secular shift in tech.

00:06:34 Right?

00:06:34 In the beginning, it was like a cloud migration where you move off of your on Prem environments and you do a cloud migration. But the reality is that a lot of organizations that have on Prem environments, they don’t fully migrate to the cloud. They start having a cloud presence, and they keep their on premise environment because migrating it is fairly hard.

00:06:57 The same thing applies to monoliths and micro services.

00:07:01 If you’re brand new to the game. You’re a tech start up, or you’re a tech company that was founded fairly recently. You might decide that you’re going to build microservices from the get go, in which case, kudos to you. But if you’re a well established tech firm, breaking a monolith, so to speak, isn’t so easy. And what ends up happening in a lot of cases is that some pieces of functionality would be peeled off the monolith and built as microservices. New functionalities would be built as microservices, but they would all revolve around this existing monolith that will take a very long time to actually be put away or completely broken apart. So, like you said, at the end of the day, what ends up happening is that we get a hybrid environment with at least one very larger kind of central monolith and a lot of microservices that surround.

00:07:59 Did you say one of your focuses is with services?

00:08:03 Correct.

00:08:05 My role definition is service observability.

00:08:10 Okay, service observability.

00:08:16 Let’s have an example. What’s an example? Service.

00:08:18 We can talk about, honestly, almost anything. The easiest way to think about an individual service code repo.

00:08:26 Right.

00:08:27 Anything that you would store in a single code repo. I don’t want to get too deep in the variance of like, oh, I have a mono repo, or we do something else. But at the end of the day, it’s a single unit of code that can run independently, takes inputs and provides outputs well.

00:08:46 To run independently is interesting because it’s also something that I’m guessing lots of assumptions here. But if it’s a single repo, it’s probably one team, and it can be tested in isolation as well.

00:09:03 And then is it monitored in isolation? Do you monitor service by itself or as part of a collective or whatever?

00:09:12 I love that question. So I’m going to kind of talk through the early part of your question. I’m going to get to the end because it gets more and more interesting, by and large, unless we’re talking about a monolith. So if we’re talking about a micro service based environment, micro services are generally owned by a single team. There’s cross ownership. It also means that there tend to be issues in operations. Who’s on call. How do we determine if we need to roll back a version when there’s single ownership? Things tend to be much easier in that respect. Monoliths do kind of require their own principles of operations and monitoring, et cetera. We can cover that separately, but then going into how we monitor those when and test and monitor, we can and should do both. And that’s kind of our approach to services where we say, okay, on the one hand, we want to make sure that you have golden signals.

00:10:11 Right.

00:10:12 So, like the ability to know throughput error rates latencies on a per service basis. And those are independent, there’s no reason to muddy those across multiple services. When a single service fails for example, or slows down. We want to know that we want to pinpoint that this particular service is the one that’s failing. The same thing applies for testing. Right. And testing could be anything here, from unit testing to testing in the CI pipeline to synthetic testing for APIs.

00:10:44 All of those, we want to make sure that they’re very localized to that service.

00:10:51 But the service can’t really be taken out of context. And when you have a production failure, figuring out where that is requires the entire mesh of services, and it requires knowing what service actually fails. If I go back to the store analogy that you brought up earlier. Right. Like, if a service goes down, is that equivalent to a store? Like not being able to open it’s? Exactly that. And the question is why if a user is complaining, like, hey, I tried to click checkout on this ecommerce website and I’m getting an error. There could be 30 40 services on the back end that power that specific checkout request.

00:11:30 Figuring out which one of them is failing right now for this user requires much more nuanced tooling. And this is where things like distributed tracing come into play, where for a given request, we’re able to see how it traverses all micro service or all services involved, and see where most of the time is spent, where errors are coming from. We’re able to compile all of the error call stacks from all services into a single place, all of the log lines across multiple services into a single place. And that’s where you look at it and you’re like, oh, wait, I know exactly what’s going on. Let me go to the service and fix it. Or I know what service is failing. Let me go figure out what’s going on. Someone deployed bad code. We didn’t do enough testing.

00:12:18 Something here is a miss. Let me go deeper into whatever’s happening. I need to upscale the infrastructure. Right?

00:12:24 Yeah. That example was great about talking about really how even if you tested the heck out of the service, something always goes wrong. So something is going to go wrong in the system and you have to figure out where to fix it.

00:12:41 It might actually be multiple places we have. It interesting. And I was actually thinking that my analogy wasn’t that great because it isn’t like one store goes down. It’s like all of your stores suddenly can’t print receipts or something like that.

00:12:57 That’s a bad thing.

00:13:00 And it’s exactly that. It becomes much more nuanced because it’s not a single store, it’s multiple, and it’s not a single like, it’s not the lock on the door and no one can enter. It’s a whole set of systems that operate in tandem, where printing receipts and scanning barcodes. And I’m not an expert on, like, brick and mortar stores.

00:13:21 I think we’re talking about the same thing here.

00:13:23 Yeah, this sounds fascinating and fun and also sort of terrifying.

00:13:28 I’m like suddenly, like, now I don’t want to do a startup anymore because this sounds too scary. But it grows over time, right? People don’t start with, like, tons of monitoring, tons of telemetry, and tons of services. They start small and build up, right?

00:13:43 Precisely. It’s an ever evolving practice. And I’ll use a phrase I learned from my engineering counterparts. One of the smartest people I get to work with and truly a privilege. He likes to talk about modes of failure, like, how well do we understand how our services fail? Right? Obviously, our services are perfect. We don’t have any failures.

00:14:08 Okay. Usually people laugh when I say that.

00:14:11 I’ll add it and laugh track later.

00:14:14 We add more and more monitoring and more and more nuanced monitoring, the better. We get familiar with how our services operate. And when they fail, for example, we can all start with alerting again. I mentioned them earlier. The golden signals. It’s the red metrics. Red.

00:14:33 Right.

00:14:33 Request errors and duration. It’s a very well respected framework to say, okay, this is how I detect that something is off. They don’t necessarily tell me what is wrong, but they helped me figure out that something is wrong.

00:14:46 It’s very good for detection.

00:14:49 But then as we understand, okay, this particular microservice is constrained on CPU, right? So when throughput goes up, it uses more CPU. So we know that’s its failure mode. Like, we need to alert on CPU of the underlying machines. Or Alternatively, it uses a lot of memory. So we need to alert on memory, things like that. So we become more and more nuanced as we go and we evolve our monitoring. We evolve the metrics that we have on it, or we evolve the logs that we kind of develop for it, and we have better and better visibility. And we learn this over time. And every time we deploy a new service, we kind of go through the same process. We get better at it. We learn faster from one to the other.

00:15:37 And with that, we also develop better and better practices.

00:15:40 Right.

00:15:40 So I don’t know if you’ve ever considered SLOs service level objectives, which is a way to reason about your production environment over time. We can also support that, and we reason about the world on those terms. I’m happy to kind of dive deeper into that just a little bit. So essentially it kind of stems from the Google SRE handbook, right. Published a couple of years ago. It’s freely available online.

00:16:06 The idea there is that a company can have a service level agreement that defines what happens when things start to break. And then we have under the service level agreement, service level objectives, each objective say, okay, here’s the actual definition of a metric, the threshold that if it crosses, like, let’s say I have uptime, I have some way of assessing uptime for a service. If that goes under 99.9% for the last month, then I’m not in accordance with my service level agreement. And I need to go provide my customers with some explanation of what’s going on or some compensation or anything like that. The agreement defines those. The service level objectives are the things that say, okay, here’s how we measure that, and here is how we say, okay, 99.9% of what and how do we define that? And over what time spans those get refined over time as well. Right. So we can start with something very simple, zero errors. But then over time we’ll learn, okay, every time we restart our application, we get some errors, but those don’t impact users. We’re going to define those definitions, get more and more nuanced over time.

00:17:27 What software doesn’t have errors?

00:17:33 One of the things I was thinking about when you’re talking about this is just things get as people scale.

00:17:40 You were talking about the three requests, errors and duration.

00:17:45 And then I was thinking about just some of the stuff is going to happen. Just as people get more popular, if they get more customers, they’re going to get more traffic and things are under more stress. Is there some case where you’re just kind of watching things and going, well, none of these are a problem, but something is going to break there soon.

00:18:05 How do you deal with that?

00:18:07 Does monitoring help there?

00:18:09 Oh, yes.

00:18:12 If we kind of continue on the e commerce example, it’s something that almost every e commerce provider that I’ve worked with worries about and even in my past lives at Wirecutter, because everyone worries about Black Friday.

00:18:27 Right.

00:18:27 And the brick and mortar analogy is as good here because you have all these people who are waiting to get in. You want to make sure that all of your systems are stable enough and that they operate correctly.

00:18:41 There are a few aspects of this. One of them is preparatory. How do we make sure, how do we know beforehand that we’re ready? And for that we do a lot of stress testing, load testing, all of the red metrics that we have and the more nuanced understandings that we develop over time are very, very relevant to know that. Right. So we do stress testing. We see for a given level of throughput what’s our error rate, what’s our latency is it acceptable, is it not acceptable? And then during the actual event, at these peak moments, if something goes awry, monitoring is our entryway. It’s like this is the first thing that happens or this is how we detect.

00:19:23 The first sign of a problem is with monitoring tools. So a lot of our ecommerce based customers or other streaming services, media companies, they all have these events that they look forward to, and they spend months in preparation, even if it’s a day, even if it’s a few hours, they spend months preparing for that, and they rely very heavily on monitoring.

00:19:50 Yeah. Well, I mean, one of the things you brought up earlier was when you get more familiar with a particular service or your system altogether, you’re going to be paying attention to things like CPU loads and memory usage and stuff. And those are things that I imagine on different services, probably all the services. Eventually you’re going to care about that because we all know with computers, when either of those things starts getting pegged, things go bad, bad things happen. And it’s not a software problem. It’s basically just junk thrashing and stuff like that happens.

00:20:31 So interesting.

00:20:33 Now I’m not so scared.

00:20:35 It sounds pretty good. Now, one of the things I was always curious about is logging. I rely heavily on logging to debug systems, but I’ve got a distributed system.

00:20:49 These logs are all over the place.

00:20:52 Does retrieving logs in any of these error cases happen through APM Systems?

00:21:00 Let me start by saying yes. Traditionally, Log Management Solutions and APM Solutions have been like distinct offerings. You look at the markets for both.

00:21:12 Historically, it used to be different companies that did it. Different services where Log Management Solutions centralize all of your logs into one place, make it easier for you to query, which becomes helpful in a distributed environment.

00:21:24 Right.

00:21:24 You have only, like, one service and accessing the logs. Isn’t that hard?

00:21:29 Apm Solutions help you with the Pinpointing, right? I mentioned distributed tracing, things like that. But as time evolves, they become closer and closer.

00:21:39 Data Dog was one of the first companies to offer what we call trace ID injection. This is an easy way to associate a trace.

00:21:47 Right.

00:21:47 A flame graph that has a visual representation of a single request with the logs that are generated for that particular request. It’s a really powerful capability because what it gives you is the ability to identify. Okay, this one user complained about this one thing. I found their session, I found the bad click. I can see the trace. And I get all of the log lines from all of the distributed systems that were part of that request. And I get all of the messages. I’m like, oh, this is what’s going on.

00:22:18 That’s really cool.

00:22:20 Yeah. Because that’s often really hard. Even on a couple of systems having I mean, I’m often dealing with just two systems talking to each other. That’s hard enough to have a logs because even if I have the logs, it’s hard to align them because the clocks are a little different. So the timing is off a little bit and things like that interesting.

00:22:43 There’s always something like that. So what ends up happening is that APM Solutions and Log Management Solutions come closer and closer, and you can easily see that in the market. Here at Data Dog, we offer both more and more companies are starting to offer similar solutions where they kind of merge the two or they have distinct offerings, but under the same roof.

00:23:04 Let’s go back to our service that has a team service and a repo. We’ve got a team handling failures for it and stuff. Is it the same team that developed it or is maintaining it, or is that often two teams?

00:23:17 Now, that really varies from organization to organization. It can change dramatically based on any company’s approach to software development. Enter releasing to production. So you could have a dedicated SRE group that is heavily involved in is very distinct from the development team and having a separate testing team in some cases. Or you can have an organization that fully adopts, like a DevOps mentality where developers do all the operations and they own everything, including testing. It really varies from one organization to another.

00:23:56 Okay, so it’s similar to testing, because I’ve seen that a lot in testing, that there’s some teams that do the development testing all at once and then others that some other team does it. And there’s good arguments on both sides. It really depends.

00:24:13 Interesting. Now, since I’m not an expert in this area, is there something that we haven’t covered that’s interesting that we should talk about?

00:24:24 There are two things that I kind of want to mention.

00:24:26 Maybe it’s actually one thing you said earlier, despite the best testing.

00:24:33 Right.

00:24:34 Something will always break in production, like if it’s because of the higher load, if it’s because of this weird edge case that no one thought about, if it’s because users are just weird and they’ll always push your systems to the edge.

00:24:49 One of the most common types of failures that we’ve detected is releasing a new version of code.

00:24:58 Right.

00:24:59 So when someone deploys a new version of code and it looks great and it passes tests and everything, they roll it out and something starts to fail.

00:25:09 To that end, we want to make sure that we provide the best workflows around that. Like, how do we detect as fast as we can, like within a minute or within 2 minutes that this new version of code is broken? And then that leads me to the second topic that I want to mention, which is automatic detection. Right.

00:25:29 We talked about telemetry. Right. It’s the raw data and then monitoring or all of these conditions that we can add in pools that we can add on top of it. One level above that is AI Ops.

00:25:40 Right.

00:25:41 So artificial intelligence Ops in that world, we do automatic detection, we do automatic alerting. And in the future, auto remediation is something that we hear demand for. But it’s definitely like an exciting part of what we work on, the ability to go to our users and to tell them, hey, you might want to keep your eye out for this particular thing that’s happening right now.

00:26:08 You haven’t set up alerts for it because it’s brand new or it’s different or someone hasn’t thought about it yet and we detected this anomaly. And you want to keep your eyes on.

00:26:19 Yeah, that actually be cool. So does that happen?

00:26:23 I’m imagining I got a recommender system or something like that, that it’s an extra thing. That’s cool. We make a little more money on it. So we roll out a tweak to that and suddenly everybody’s latency doubles or something.

00:26:39 It’d be cool to have a system just like see that and say, now we’re going to roll back the code. That’s just bad. Is that something that people have?

00:26:48 That’s exactly where we’re going to. So we already have tools that help you detect spikes and error rates, spikes and latency, the introduction of a new type of error, all these things.

00:27:02 Automatic rollback is something that we want to be very careful with.

00:27:07 I know a lot of engineers. I don’t know a lot of engineers will yield the power to do a rollback to a fully automated system, but they would put like a big red button in the alert where they say, okay, here’s how you do it. When the alert happens, have a human involved to do that press. That is something that we support and some of our users have kind of implemented.

00:27:31 Now, you also mentioned anomaly detection.

00:27:34 That is interesting. Just to see. I’m guessing you’re collecting a lot of data.

00:27:43 Something is weird going on that is different than it was before, and people might not even know to check for it. Is that something that is there tools for that then?

00:27:55 Yes, definitely.

00:27:57 The field of AI operator that I mentioned right now is developing fairly quickly. There are more and more kind of tools that are popping up in the market, more companies that are dedicated to that sort of thing. Everyone looks at it slightly differently, but at the end of the day, it boils down to various types of anomaly detection. Reading more and more types of telemetry and making sure that you reduce or we reduce noise to a minimum. Because if you have this cascading failure, for example, I have 30 services and they’re all breaking together.

00:28:36 Either I’m going to need to spend like the first 5 minutes of an incident, just like clicking through all of these alerts or something helps me and say, okay, all of these 30 alerts are actually one issue here is the root cause for it. Like automatic root cause analysis is something that we do.

00:28:55 It’s a very popular type of feature because it saves a lot of time decluttering. Like, how do I deal with this mess of data that’s thrown at me? I look at just this like, little golden nugget signal says, this is what I need to focus on.

00:29:12 Neat. That actually sounds exciting.

00:29:14 There’s a lot of cool things coming in the field then.

00:29:17 Definitely.

00:29:19 It’s a lot more than I realized.

00:29:22 Cool. This has been fascinating. I’m more and more excited about the monitoring thing. It’s a lot more than just can you just monitor and make sure my website doesn’t go down? There’s a lot more going on there.

00:29:35 Definitely so.

00:29:36 Thanks for your time.

00:29:37 Thank you.

00:29:45 Thank you, Omri. For that great interview. Thank you PyCharm for sponsoring go to testandcode.com PyCharm SaveTime with PyCharm. Thank you Patreon. Supporters. Become a supporter at testandcode.com support to get notified as soon as a new episode is out.

00:30:03 That’s all for now.

00:30:04 Now go out and test something.