Technical debt has to be dealt with on a regular basis to have a healthy product and development team. The impacts include emotional drain on engineers and slowing down development and can adversely affect your hiring ability and retention. But really, what is technical debt? Can we measure it? How do we reduce it, and when?


Transcript for episode 113 of the Test & Code Podcast

This transcript starts as an auto generated transcript.
PRs welcome if you want to help fix any errors.


00:00:00 Technical debt has to be dealt with on a regular basis to have a healthy product and development team. The impacts of technical debt include emotional drain on engineers and slowing down development and can adversely affect your hiring ability and retention. But really, what is technical debt? Can we measure it? How do we reduce it? And when James Smith, the CEO of Bug Snake, joins the show to talk about technical debt and all of these questions, this episode of Test and Code is brought to you by Configcat.com config cat feature flag service lets you release features faster with less risk. And by Ruvenlearners Ace Pythonserviews.com, get all the confidence you need to ace your next interview and by listeners like you that support the show through Patreon. Thank you.

00:01:01 Welcome to Testing Code, because software engineering should include more testing.

00:01:08 You’re the CEO of Bugs Nag, is that right?

00:01:11 Yeah. So I’m the CEO co founder. In my previous company, I was the CTO and so kind of picked up the top of the tree title this time around based on just my product background, product expertise, and the fact that our product sells to the technical audience. So it’s kind of a blessing and a curse. I don’t get to code much anymore. I don’t get to crack over my text editor make call request anymore. But I do spend a lot of time talking with VP, Engineering CPOs, and working on the product side of things as well.

00:01:42 The main problem space that bugs a solves is we’ve kind of refined that recently.

00:01:49 But fundamentally, the reason we started the company was that we felt like my co founder and I felt like we were flying blind when we were trying to build software. You deploy your software up to your customer base, whether that’s a web app or mobile app, desktop app, whatever it is, and you’d kind of cross your fingers. Maybe you’d have some log files. If you weren’t running a client side environment, you could look in your centralized logging. But really, that was a chore. You’d have to go and dig through gigabytes of text to find out what you wanted to find out. So, yeah, it was scratching an itch. We built desktop apps, mobile apps, web apps, and we were like, how do we know this is working once we give it to our customers and didn’t want to rely on the historical technique of waiting until your customers kick up a fit and complain? We started off saying, let’s figure out if our software is working as expected in user facing environments. But more recently, I think that the story that we’re telling is about striking a balance between fixing bugs and working on features because there’s an old guard that says, oh, you don’t ship with any bugs once it shipped, it shipped. But the new guard and the default these days is maybe a few bugs are okay as long as we understand the impact of those bugs, because then you can really set that slider between fast roadmap delivery and customer impact of bugs. So that’s really what we end up talking the most about these days. I guess before my professional career started, once you shipped software, it was printed on a CD and shipped out to Best Buy, that was it. There was no other way to fix it. But these days, pretty much every piece of consumer and BDB software is somehow internet connected or can be patched or fixed on a regular basis. And pretty much every software companies move to a subscription model to make sure that you’re getting the latest and greatest features and bug fixes. So yeah, the default has changed from make sure it’s 100% finished and complete to let’s make sure the things that our customers really want right now are finished. But I do think that those topics are directly related because striking that balance, I think, and trying to quantify or set a data driven approach to striking about some new features and bugs, I actually think played in really nicely with how you can talk about technical debt between the products and engineering teams, because customer impacting bugs versus shipping. That roadmap feature is the language of product teams speak.

00:04:11 It’s one way of talking about technical debt. I think there’s multiple other things that make up technical debt, but I think knowing when to slow down a roadmap delivery to fix bugs is, I think, a key lever in that.

00:04:25 Thank you, Config Cat, for sponsoring this episode. Config Cat is a feature flag service. It has a central dashboard where you can toggle your feature flags visually. You can hide it or expose features on your application without redeploying. You can set target rules to allow you to control who has access to new features easily use flags in your code with Config Cat libraries for Python and nine other platforms, get builds out faster, test in production and do easy rollbacks release new features with less risk, and release more often. With Config Cats simple API and clear documentation, you’ll have your initial proof of concept up and running in minutes. Train new team members in minutes also, and you don’t have to pay extra for team size with the simple UI. Even product managers can use it effectively. Whether you are an individual or a team. You can try it out with their forever free plan or get 35% off any paid plan with special code, test and code. All one word release features faster with less risk with Config Cat, check them out today at configat.com.

00:05:29 Yeah. So let’s jump into technical debt. What we haven’t really discussed on the podcast yet, and I think it’s something everybody deals with. So what are we talking about with technical debt?

00:05:38 So the way I see it, technical debt is intentional or unintentional decisions that your engineering team make in order to get something to market quicker. And so normally it’s trade offs to ship stuff quicker. It kind of breaks down in my head into three different categories. There’s Intentional Tech Debt where you’re like, look, we’re launching a new feature. We don’t even know if anyone is going to use this yet. So let’s put a half fast version out, the lean approach. Right. Let’s put the half fast version out, see if anyone likes it. And then if we do get adoption, then we can spend some time shoring it up and making it better. I think that there’s the design of Evolution Technical Debt and Design Evolution Technical Debt tends to be look, actually, when you built it and design it in the first place, it was pretty good. And there wasn’t any Technical Debt, but we’ve started bolting on things to the side of the car to make it work. Have you ever seen the old reference, but have you ever watched The Jetsons? They would Bolt lots of things onto these cars and they would all fall over. And that’s what I think this Design Evolution Technical Debt is. And then the last one I think is bitter off Technical Debt, as I call it, where you’re adding methods and functions and code without taking time to clean up the stuff that should have been removed. And then you’re wading through the mud to make any changes in that code base. But all of these have the same impact. And that is the more tech Debt you have, the more slows down future development. And of course, there’s a huge emotional impact on the development team. Your development team is going to be moaning about, hey, we need to refactor this. There’s Tech Debt here, but a lot of the time that, that’s a message that falls on deaf ears when talking to the business side or the product side.

00:07:16 Yeah. One of the things I run into occasionally, it used to not happen at all, but it happens more now is just the language moves on as well. Sometimes you have features of the language or features of libraries that you’re using that get deprecated or the API changes or something, or even just the old ways still works. But there’s a new way that is easier to maintain and you can move on to eat cleaner code. Yeah.

00:07:44 And that has an impact on things like hiring and ML as well. On an engineering team, everyone wants to be using the latest and greatest technology if it allows you to do your job better and quicker and easier. And so if you I remember in a previous company we were stuck on Rails Two forever, and I think Rails Four was out at the time and that was years of upgrades in between. And we just put down tools at one point and said, let’s get this upgraded. And we had to upgrade via Rails Three. And it took a really long time. But yeah, we did it because people like, we can’t use these new features, and we’re worried about hiring people with skills. Everyone’s using Rails four, we’re going to have to retrain people on an obsolete technology now. It kind of reminds me of when I started my career. People wanted us to learn Fortran 77 in the banking industry.

00:08:31 Oh, wow. Yeah.

00:08:33 So did you learn Fortran just enough to get by? I was building trading platforms and working on software for foreign exchange. And yeah, a lot of that code was written in Fort 177 just because it was fast and it worked really well on the systems that we’re running on. That’s a whole different world. Now, this is a long time ago, but it wasn’t that long ago that I should have been working for. It wasn’t something I was prepared for from College, that’s for sure.

00:08:56 It was something that I successfully avoided so far.

00:08:59 Cobalt.

00:09:01 Yeah, no cobalt either, but my co host on Python Bites, Michael Kennedy, he says that when he was going to College, he did have to learn Fortran at University, so I didn’t do that. But I had to learn Lisp or the scheme, which was Lisp offshoot.

00:09:20 But I haven’t used that in production ever either, except for maybe an Emax plug in or something like that. So technical debt. Why do you care about this so much?

00:09:29 Well, I’ve been in both sides of the coin, so I think that I’ve been an individual contributor and a software developer where I’ve been the one moaning about technical debt and trying to plead with people to refactor. I’ve been the one running a tech team at my previous company where a CTO, and then these days I’m more on the business and product side of things. And so now I’m the pain in the ass who’s saying we need to get the ship. So we’ve got a Press release coming out in a few days. What I realized was that once you’ve seen it from both sides, you can see how much of a lack of communication there is between the engineering and the product sides of the business. You communicate really well when you’re planning something and you’re giving the specs and talking to the engineering team, but we don’t really talk about is when should we slow down? And what ends up happening is by not having that conversation up front. It creates a lot of tension near the end of a project when everyone’s already stressed out anyway.

00:10:24 And that’s the absolute wrong time to start talking about technical debt. I’ve probably been part of the problem a lot of the time, but I’ve seen my teams get at my previous company, even at bugs, and get a bit stressed out about knowing when is the right time to bring up refactoring. When is the right time to bring up technical debt without sounding like that pain in the ass person who’s just doing it because they want the new shiny thing or they want to spend some time cleaning things up. And so when we built Bugs Tag, it turned out that our product did a pretty good job of highlighting some areas of technical debt with something that we’ve called the stability score, which is effectively a metric for what percentage of your user interactions with crash free did not end in an exception that ended a session. And so it turns out that a metric like that is a metric that both product and engineering care about. And it may be a way to set that slide I was talking about earlier, could you choose what number is an acceptable number? So it was something that we kind of built the business around because we cared about it. And I’ve seen on both the product and engineering side of the business, and it just so happens that it resonates a lot when we’re talking about how to set these goals and these numbers. It seems like a very emotional problem that teams are facing around the world.

00:11:38 Yeah, there’s a lot of communication and there’s a lot of interpersonal and even emotional stuff around technical debt because there is the I want to be proud of the code that I’m working, and so I want to be able to clean things up a bit. But there’s also shipping features fast and whatever. But then as well, I don’t know how willing I am to talk to people about schedule and stuff, because the technical debt might not show up, obviously, as technical debt to a developer. They might just think they just aren’t smart enough to understand the code. It might not be you, it might be the code.

00:12:13 Exactly. When you’ve been developing software for a long time, you start to get a gut feel for when things are getting out of control in the code base. You’re right emotional having a way to communicate that without just saying, My gut says this is wrong, because that’s never something that other people are going to get on board with.

00:12:29 We talk about stability score and measuring it in production and deployed applications, because that’s what our business does. But there’s actually a lot of other ways to think about it as well.

00:12:38 Even when I started my career, static analysis was big, and now it’s the hot new thing again. And static analysis can do a good job of this as well. There’s products like Code Climates that give you a ranking from A to F on particular code smells in your code base and other things. So you can start to say, hey, this isn’t just me saying, I’ve got a gut feeling here. We’ve run these heuristics over the code base, and we’ve seen that this type of problem typically causes these outcomes. So once you’ve been developing for a while, you get these gut feels. But gut feels don’t change the way that people operate, and they’re not very good at communicating. So you still need something there to measure and nudge it forward.

00:13:17 There’s the measurement side, but do you have any recommendations and solutions on the measurement side?

00:13:22 I think that there’s a couple of ways to do it. I think that you measure you do static analysis and dynamic analysis, right? You set up something like a code climate. You make sure that you’re measuring codes, notes as you commit the code and set some rules around that. Talk more about rules in a second, because I think that’s key to this. And then dynamic analysis measure customer impact of bugs. So if you measure if using something like bugs tag, for example, it will tell you what percentage of sessions saw a crash, and then you can drill into the impact of those crashes so you get something meaty out of it. But there’s no real point in having those measurements in place unless you have a plan and you’ve discussed with your product and what you’re going to do when you don’t hit some kind of goal or target. And so on the static side, again, I’m using code climate as an example here. You might say we don’t want to check in any code unless it’s a B or above on the CoClimate ratings. Or maybe you set up your Linting rules for similar effectiveness. And the good thing for that is you can literally block PR from being merged unless they pass these rules on the production side. Naturally, this is happening after it’s been merged and deployed. So you kind of have to look at it after the fact. But the perfect time to look at that is when you’ve got a deploy or pushed a release or a build of your application. So you can say, hey, our bugs next stability goal was 99.9% of sessions should be crashed through. Actually, we’ve landed at 99.1% based on that. We’ve agreed as a product and engineering team ahead of time because you can’t do it after the fact. That’s when those arguments occur, if we go below a certain number, an SLA, if you will, now is the time to pause on the roadmap and actually go and get that number under control. On the metric side, I think get the tools in place to measure things pre and postproduction, and then once you’re measuring them, make sure you have a plan of what you’re going to do if you don’t meet your targets. There’s a human aspect of all of this. You can’t just put the tools in place and ignore the outcome.

00:15:19 It has to have teeth.

00:15:21 Yeah, I like it. Pausing new features and causing things to go in so that at that point everybody is still working on stuff, but it’s trying to clean up things.

00:15:32 Are you looking for a new Python job? If so, then you’re probably already dreading the interview where you’ll be asked lots of coding questions. That’s where ruin learner’s free Ace Python Interviews course comes in in 6 hours of live coding screencasts. Ruben answers 50 questions you’re likely to encounter. Questions are divided into beginner, intermediate, and advanced levels, so you’ll definitely learn something new. From function arguments to decorators bytecode to the Gill, this course covers it all, giving you the confidence you need to get that job. Check out Ace Pythoninterviews for free at Ace Pythonserviews.com.

00:16:13 If we’ve decided there’s minor bugs in the system that we’re just not going to fix right away, maybe we’ll fix Sunday. There’s a kind of a depressing factor of having a big bug list, even if nothing in there is important. How do you deal with that? Do you have any thoughts?

00:16:33 Yeah, I’ve got a lot of opinions on this. My co founder calls the Jira graveyard. We’ll go to Jira sometimes, and this is before we changed our processes. You go in there and you see just a list of stuff that you’re like, we’re never going to get through this backlog. And I think that I’m obviously very biased here. Our company is designed to help you triage things before you decide to work on them.

00:16:55 The stance that we take and that we roll out on our engineering team is rather than just creating jewelry tickets for everything and just having an Infinity backlog, which is very depressing to look at, use a pre Jira triage. So therefore you can make Jira the source of truth of what you’re going to work on. And for us and for most companies in a safety critical environment like NASA is not going to be using, let’s see bugs through the production. And if it’s the ones that matter, that’s a different approach. Probably medical devices won’t do the same, but for 99% of companies, they’re probably leaning towards doing the approach that we’re doing, which is well, for the bugs that are in production, is there a way that we can actually measure the impact of those bugs so that it’s not just working through a backlog of equally ranked bugs or arbitrarily assigning severities in juror? Is there a way we can say we can demonstrate, say that this bug is worse than this one? And so again, by having some kind of production stability monitoring, like a bug snake in place, you can say this bug impacted a million unique customers and therefore it’s bad. Or this bug happened in the payment code base because we can see the stack trace. So we know this happened when someone was trying to purchase something, or even more kind of precise than that if you want to get the scalpel out. We have what we call customer segmentation capabilities in our products. So you can say, let me know, bugs that are affecting customers that are spending the most money with us. And so you can kind of set these rules up and set these views up in a production triage environment where it kind of makes it pretty straightforward to decide what to work on. And then once you’ve set up those rules everything else is kind of below the fold. And so you wouldn’t send that integer. You wouldn’t link a ticket into Juror or plan work in your sprint. That’s the dream, obviously. Unbiased. I hope everyone is moving towards them.

00:18:38 Yeah, I think trading issues. If they’re in your tracking system, that means they’re going to be worked on. Don’t put things in there that you’re not going to work on. I mean, I’ve heard the argument of, well, if there’s something that shows up there and you decide to not work on it, you can delete it. But then how are people going to know if it comes up again? I personally think if you’ve decided to not work on it and you remember that, you’ll know that the next time it comes in, and if you keep on deleting similar tickets, maybe it’s more important than you think it is.

00:19:08 That’s right. Yeah. The squeaky wheel gets the oil, I guess is the analogy there. But if you have, I guess the history associated with it. There are some bugs, obviously that aren’t due to crashes or exceptions. Maybe you forgot to hook up a click handler or a button. Maybe the customers are going to report and for that you do want the history in Jira. You do want to make sure that you know who’s asking for it. But where there are actual production signals that you can reference, the history is going to be in your triage system like a bug snake, because it’s going to tell you when this first happened, a list of customers that’s impacted and all that kind of stuff automatically. So what we do in bogstag our tech basically keeps that error grouping around if the error is still happening. So it kind of just keep alive almost. And if that error stops happening, production it just kind of ages off, just disappears. Okay, so you get the history if it still matters, and then once you’ve done that, you can link it through into a zero ticket.

00:20:00 Okay, cool. I like the idea of agreeing ahead of time. If you’ve got some sort of measure to say, if we hit below some number, then we need to slow down and focus on either the entire development team or some of the development team on fixing some things. Get it back up to more stable if we want to try to avoid even having to hit that trigger. Do you have a rule of thumb of how much time people should be spending fixing technical debt versus other stuff?

00:20:29 It’s really difficult because every team is different and even every development environment is different and the type of business you are will impact how much time you’ll spend on it. Like again, like I said earlier, if you’re making medical devices, probably you want to spend a lot of time making sure that’s incredibly stable and has very low tank for debt. Whereas if you’re making a game, mobile game, maybe you don’t care as much. But in terms of benchmarking, the main approach that we recommend and we talk about a lot is rather than setting one target for, let’s say your software stability, set two targets. And this is pretty common in the SRE world. Google’s SRE book talks about this as well. But set an SLA and set an Slo. So for the SLA, your service level agreement say, look, if we drop below this level of stability, we’re going to stop everything and we’re going to get things fixed.

00:21:18 Everything’s on fire. We need to fix this. But because that’s only handling like the worst case scenario, you need an objective as well to send Slo, which is more of a long term target, we are aiming to have a much higher level of stability. And in the uptime and availability world, in for an SRE, you’ll talk about the number of nines, right, the five nines of uptime and availability. And so in instability and bugs, we talk about the same thing. We say, look, maybe shoot for two nines or three nines of an SLA, but then four lines for an Slo. And in fact, that’s what our data shows as well.

00:21:55 This is just one set of applications, but we looked at consumer mobile applications as a segment. And for all of the apps that are on the bugs tag platform that has more than 4.5 stars in the App Store, we found that 80% of them had better than three nines of stability. So that’s kind of a benchmark to set. And I would kind of say adjust the amount of time you’re spending based on that goal that you’re setting rather than earmark the amount of time and then adjust that afterwards. Because again, maybe in gaming you don’t care about having more bugs because you want to launch that new feature pack or that new set of clothes for your character in the game or whatever it is, and the code just ages off itself. Whereas if you’re in banking and fintech, you might set a much higher stability goal and therefore spend more time cleaning fins up like you said before.

00:22:42 Also thinking about what your core value is for your product as well. Even in very important applications that have like medical or financial, where their main reason for being an application has to be rock solid, there might be other extra things like saving to different file formats that if those are glitchy occasionally, they’re not going to hurt anybody.

00:23:06 Yeah. And I think that the maturity of the product matters as well. Like if you’re early in software development, you’re going to take more risks, you’re going to gamble more, you’re going to have lower standards. But as you get more of a customer base and people come to rely on your product, then you’re going to take fewer risks. So, yeah, there’s a lot of variables involved in picking that number. So as long as it’s a conversation between the product and engineering side. When you’re not at crunch time earlier, rather than maybe at the beginning of a sprint rather than the end of the sprint, then I think that you can have a pretty straightforward, non emotional conversation about it.

00:23:38 So does technical debt happen also in your test code, or is it just something in production code?

00:23:43 I think there’s kind of three ways to measure technical debt. Like I said, there’s design debt. There are bugs that are measurable, mostly in production. And then I think there’s the emotional developer emotional side of things, which I’ve jokingly called groans per minute when you sat on a Dev pod and you were hearing people saying, I have to work on this file again, or you open that one file that’s got an ASCII art Dragon at the top that says, Here be Dragons, and you have to figure out where all the weird Klugers are in this code base. So, yeah, the one we’ve just been talking about, I think, is the measurable bugs, technical debt, but it happens even before then. The measurable code smells that code climate and things can measure we talked about before, but even then, there’s the bolting on product design. Sometimes you can look at product designs and say, we should have refactored this. We shouldn’t keep bolting things onto the side of this design because it’s just not sustainable. The emotional side is interesting as well. A lot of the time, really, you have to address that by making sure that you’re having, again, not at the end of a sprint or when it’s crunchy, but in one on ones or retrospectives after you’ve launched a project or even sending out a survey to develop a survey, figuring out if your team feels like you’re striking a good balance on technology, because it’s not necessarily always perfectly measurable, but you can take a pulse of your development team, and it has a huge impact on hiring and it has a huge impact on retention of your team as well. So it’s not something again, you can convince management that, hey, we need to do this because our survey told us, based on talking to our developers, that people are fed up with working in this crusty code base. And if our surveys say that that’s going to put us at risk of not being able to hire new devs or losing the good debts that we have.

00:25:27 Yeah. One of my favorite manager moments, I guess, was somebody came up to me and said, hey, we’ve been developing this. I’ve been developing this feature in conjunction with working back and forth, the DSP engineer and everything, and we’ve got it to the point where it works and we’re ready to ship. The tests are clean. However, in the course of developing all of this, I’m really not happy with passing this code on to anybody else I’d like to schedule. It’s going to probably take like two or three weeks, but I need to schedule a time to just clean it up so that I can be proud of it and to be able to work with the developer and say, yes, that’s important. We can’t do it right now, but can we do it? Are you okay with doing it after we ship in maybe four weeks from now or something and scheduling that it’s really important to have those communications be able to go back and forth.

00:26:20 It’s interesting when you talk about it in that way, and I’ve certainly seen the same thing. It really shows how much the software development is a craft is something that you’re creating, you want to be proud of. It’s not just I think a lot of people who aren’t in the space think of it as a very logical, technical, scientific thing. But actually, when you’re building a product or a feature or a piece of software, you really want to be proud of the way you built it and you want to be able to hand it over to someone else. And even if you’re not the kind of artisan coder that I’ve certainly met in my career, you want to make sure that no other developer that’s inheriting this code is going to think you’re an idiot.

00:26:57 That’s the bottom of the top end of the section. But it’s not some just Lego brick building thing. It’s artistic. It’s a craft that you want to feel pride in your work. So, yeah, 100%. And you need to carve out time to make sure that people do feel that feeling of pride. And they’re not just on an assembly line.

00:27:13 Yeah. And as we’re shifting more and more towards a place where you really can pass off code to having other developers look at it, I know that we don’t have cogs. Everybody has their own skill set, but we also don’t want to be the only person that can touch a piece of software. I don’t want to be that person.

00:27:34 If a problem comes up and I’m on vacation, that everybody just has to wait for me to get back, that’d be terrible.

00:27:41 Yeah. We’re in an era now where the era of tribal knowledge about software is kind of not really that acceptable anymore, that everyone is. We have such a good array of collaborative products and technologies to work in. I forget what it was called now, but we’re using a source control system. At the beginning of my career, it was just garbage. It was some branded layer on top of RCS, and it was awful. And now we are able to make code changes, have peers review them, have your manager review them, have them merge, put them in an automated testing pipeline, get the results back from that, and get the immediate feedback of that. It’s a whole different world.

00:28:16 There’s no reason that you can’t take advantage of this collaboration for sure.

00:28:21 Yeah. Even though I’m a little overloaded with all of the we don’t tend to get rid of collaboration mechanisms. We’re just adding to them. So now we’ve got Wiki and confluence and stuff.

00:28:33 Yeah, it’s the same as any communication thing. It’s like sometimes people think oh well I had a new tool for the stack that will solve all my problems but in reality it’s never the tool that solves the problem. It’s the process that solves problems. If you come up with a good communication and a good process oh, look, this tool actually fits into that process. That’s how you get successful. Roll up. It’s true. People like to think of new products and take a magic one sometimes.

00:28:54 Yeah. I really appreciate you talking with me about technical debt and what people can do about it. This has been a lot of fun so thanks for coming in.

00:29:01 Thanks for having me. It’s been a wideranging conversation and I’ve enjoyed it. And yeah, if you want to say Hi, I’m Luke jloopj on Twitter, lukej.

00:29:09 Okay, cool. Well, thanks a lot.

00:29:11 Thanks, Harry. Brian.

00:29:15 Thank you, James. That was an interesting discussion about technical debt. I hope all projects and teams take it seriously and start a plan to keep it in check. Thank you ConfigCat for sponsoring ConfigCat. Com’s feature flag service lets you release features faster with less risk and thank you Ruven learners ace Python interviews.com for sponsoring get all the confidence you need for your next interview. Thank you to listeners that support the show through Patreon join them by going to testandcode.com support. All of those links are in the show notes at testandcode.com one, one, three. That’s all for now. Now go out and test something or maybe put in place a plan to measure and control technical debt.