Use a Technical Debt Budget

How to buy roadmap insurance from your engineering team

Oct 13, 2021

TL;DR - To continue shipping features at a fast pace, a startup must pay down technical debt. But many startups struggle to do this, because warnings of engineers are often overridden by customers and CEOs demanding more features. This post outlines a simple strategy to manage tech debt by using a “Tech Budget”: a fixed percentage of capacity that follows the same rigorous process and prioritization that’s used for feature work, with one important exception: engineers not PMs should manage and prioritize the tech backlog.

Around year 4-5 of most startups, a funny thing usually starts happening: the engineering team can’t ship new features as fast as it used to.

At first, the problems feel like isolated issues: a release might get delayed a week; a feature might take 2x as long as the estimate; a buggy release might have to be rolled back; or the team might get pulled off feature work to handle a critical performance bottleneck. But the underlying pattern is unmistakable: after years of shipping features at a fast pace, the company has built up so much technical debt that new feature development slows to a crawl while reliability, performance, and security problems proliferate.

Companies generally take one of three approaches to tech debt:

Ignore it until it’s critical, then stop everything for weeks or months to fix it. For example, at Microsoft during the early 2000s security problems forced the entire Windows team to pause feature development for about a year to ship a bugs-and-security-focused update: Windows XP Service Pack 2. This free update cost the company billions of dollars in upgrade revenue.
Proactively schedule “tech sprints” where engineers do refactoring, bug fixing, and tool upgrades in big few-week bursts each year, and then go back to feature development and debt accumulation for a few months.
Consistently budget for tech debt in every sprint or iteration. Instead of big-bang tech sprints, this approach pays down debt in small chunks over time.

Most startups start with “ignore-until-critical.” This is usually the right business decision for early-stage startups. Why polish your underlying tech if you don’t even know if the market will buy your product, or if you’re planning to rewrite your code when you can afford to hire experienced engineers? Shipping as quickly as possible usually trumps all other concerns.

But as a startup matures, good companies recognize the risk of ignore-until-critical, if not proactively then often after the Nth major outage or performance issue.

My last startup Cantaloupe, the largest SaaS provider in the global vending industry, was no exception. One of the first meetings I attended as VP Product was an all-day “war-room” meeting to deal with a major outage. It was a symptom of a mountain of technical debt.

I joined the company soon after our CTO and Engineering Manager did, and together we agreed that tech debt was a problem and retiring it should be a priority.

We started with the Tech Sprint model, scheduling 3-4 per year. These helped us avoid more major outages, but it was clear after a while that this model wasn’t perfect:

Engineers were mostly paying down recently-acquired debt while not addressing larger, longer-term challenges.
Upcoming tech sprints were used to justify delaying test automation, performance work, and other critical parts of new features.
It was always tempting for business concerns to delay a tech sprint, so “3-4 tech sprints per year” inevitably meant 2.5-3.

A few years later, a new Engineering Manager was hired and he convinced us that the “Tech Budget” model would be better. Spoiler: he was right! But what’s interesting to me (and maybe to you too!) was *why* it was so successful, and what we learned about how to make it work even better. That’s what the rest of this post is about.

Tech Budget: How it Works

After a few years of experimentation and refinement, we learned that there were three core ingredients to a successful tech budget:

Allocate a fixed % of engineering capacity in every sprint to technical tasks.
Use the same process and prioritization that’s applied to feature work.
Engineers, not PMs, should manage and prioritize the tech backlog.

I’ll discuss each of these in turn, explaining how they work as well as tips for making them work well and pitfalls to avoid.

Allocate a fixed % of engineering capacity to technical tasks

The key to a successful tech budget was that it never varied. Making it always the same yielded many benefits:

Forecasting was easier because capacity (for both feature work and tech work) was predictable over time.
The inertia of a number that never changed helped to discourage PMs and executives from “borrowing” tech time. A fixed number made it easier for everyone to think of changes as rare exceptions rather than a constant negotiation.
It was simple, so everyone at the company could understand the plan.
It made it easier for engineers to work on long-running projects that required pauses in the middle, for example adding performance instrumentation in one release and evaluating and acting on the results after a few weeks of data was gathered.
It let engineers align tech tasks with related feature work. For example, if a new feature required reworking the schema of a database table, the team could also spend a few days to optimizing database indexes and query performance.
It made it less painful when a tech task didn’t fit into a sprint. In the old Tech Sprint model, engineers were tempted to shove more work into a tech sprint than they could safely finish, because they knew it’d be months before the next tech sprint. Knowing that postponed tech work could be finished in just a few weeks avoided these risks.
Some engineers really liked doing tech tasks while others preferred to work mostly on features. Some engineers liked to do a little of both. Others only liked doing tech tasks about specific topics; for example we had one engineer who was passionate about improving our UI test automation framework. Mixing tech tasks and feature work in the same sprint allowed engineering managers to tailor tech work assignments to to each team member’s preference. This helped morale and productivity.

The “right” percentage is probably between 15%-30% depending on the company’s needs. We had a mountain of debt and a relatively junior team, so we settled on 30%. But I think we also could have been successful at 25% or even 20%. Honestly, the number mattered less than the habits encouraged by having a number and sticking to it.

Use the same process and prioritization that’s applied to feature work

The second tenet of the Tech Budget is that technical tasks should, wherever possible, follow the same rigorous process that the team used for feature work. This means formal estimation, written specs with enough detail for engineers to estimate the work, putting tech tasks in the issue tracker, scheduling them in a sprint or kanban lane, maintaining and grooming a “tech backlog”, etc.

The benefits of doing this included:

The team was familiar with the process, so they didn’t need to invent something new and untested.
The same reports and metrics could be used for both feature work and tech tasks.
It made company-wide forecasting and metrics (e.g. for auditors) easier because there was only one bucket of things to measure.
It made it easier to spot disparities, e.g. estimates being more accurate for feature work.
By creating a Darwinian competition between tasks in every sprint, the quality of tech tasks went up because it's usually easier to find the top handful of tasks for each sprint than to ensure that 30+ tasks in a tech sprint are all “highest priority”.
It avoided risky megatasks by forcing large changes to be chopped up into manageable, testable, “checkpointable” chunks that could fit into one sprint.

We were already doing all these things for feature work, and doing them for essentially the same reasons. So it was an easy adjustment for the team to do the same thing for tech tasks.

But there was one really, really important process difference…

Engineers, not PMs, should manage and prioritize the tech backlog

We quickly realized that PM involvement was not needed in tech tasks. This saved a lot of Eng and PM time. For example, sprint planning meetings (our largest and therefore most “expensive” meetings) were shorter because they could ignore tech tasks. To prioritize tech tasks for each sprint, engineers met in a smaller group after the sprint planning meeting and decided which tech tasks would make it into the sprint. Because this smaller meeting could be done at a time and place of the engineers’ choosing (e.g. in a different time zone, or with beer!) it made planning more more efficient and more pleasant for them.

Also, PMs didn’t have to learn the technical details behind each tech task, and engineers didn’t have to waste time educating PMs about work that users would never see. This helped keep PMs focused on delivering features to customers instead of having to care about tooling upgrades and code refactoring!

Tech Budget: Tips, Tricks, and Problems

Like every good thing, the Tech Budget has some challenges and gotchas. Here’s a few that we found, along with tips to work around them:

Taxes suck (but insurance is tolerable)

The hardest part about the Tech Budget was paying that big tax on every sprint. But I looked at it like this: the 30% tax was the price we paid to avoid outages and to continue to ship features at a consistent pace. I tried to think of it as “roadmap insurance”: a predictable downside in exchange for preventing long-tail risks that could endanger the company and prevent us from delivering on the product roadmap.

Having roadmap insurance also meant that I could be much more secure in schedules and forecasts, because doing continual tech investment vastly reduced the chance that we’d have to cancel half the roadmap for a year in order to rewrite a lot of rickety old code.

Being more secure in the roadmap had good ripple effects beyond Product. For example, it made it less risky to share our roadmap with trusted customers. This helped salespeople win larger deals. It also built our investors’ confidence that we knew what we were doing, which made the CEOs job a little less hectic.

Did all those benefits mean I was happy paying that big tax every few weeks? No! I never got used to it. But like broccoli and motorcycle helmets and other things that are obviously good for you, I was OK with the tradeoff.

Borrowing from the Tech Budget should be rare and quickly repaid

Especially during crunch times where we really, really wanted to complete an important new feature, it was soooooo tempting to try to steal tech time.

When we occasionally needed extra feature work, we horse-traded with the engineering team for a few points, which we always repaid in the next sprint. Of course, both groups need to trust each other. Borrowing only works if Product actually returns the points!

Similarly, Eng may occasionally need more Tech Budget in a sprint, e.g. to complete a huge refactor. As long as this is rare, PM will win a lot of gratitude if you stay flexible.

But borrowing should be very rare, or you lose many of the predictability benefits of a Tech Budget. If you’re borrowing more than a few times per year, that’s probably too much.

Defining tech tasks vs. feature work can be tricky

Another challenge with the Tech Budget was jockeying that went on between developers and PMs around “borderline” tasks that could be considered feature work and could be a tech task.

Obviously, code refactoring and build-tool upgrades were tech tasks, while new mobile app screens were feature work. But what about addressing customer complaints that pages were slow to load? Is fixing page-load performance a tech task or feature work?

We solved this dilemma by using a simple heuristic:

Tech tasks were work required to maintain the *current* level of performance, stability, and security despite increased usage and server load. This includes regressions in functionality and performance.
Feature tasks were about *improving* customer experience beyond its current state, including performance and (especially) anything that made a change to user-facing UI.

Another shortcut was who wanted it. If PMs were asking for it to be in a sprint, it probably wasn’t a tech task. Conversely, if a bug really annoyed an engineer but PMs didn’t think it was important enough to fix, then the engineer could lobby her teammates to fix it a a tech task instead.

One important related point: tech tasks aren’t just about fixing old code. If the engineering team wants to, they’re free to use the Tech Budget to introduce new testing tools, to upgrade frameworks or libraries, to migrate to a new platform, or any other task they think is needed to make their work easier or to improve the infrastructure or code. Or to add Easter eggs! 😄

Test automation for new features is part of feature work, not tech tasks

Following the new vs. existing heuristic described above, tests for new features should be considered part of feature work, not tech tasks. Releasing features without automated tests is a recipe for disaster a few months down the line, because manual testing will slow down each release until it takes weeks to ship every version. Don’t do this!

Adding test automation for existing legacy features, or adding new kinds of tests (e.g. UI automation, load testing, penetration testing) should be part of the Tech Budget.

Don’t start huge tech tasks without letting PM know

Engineers should be free to make day-to-day technical decisions and to manage the Tech Budget without some nosey PM micro-managing and distracting them.

But PMs should definitely be consulted for big architectural decisions, like:

Should we switch from Oracle to Postgres for all our customer data?
Should we move from AWS Lambda to Cloudflare workers?
Should we kick off a year-long test automation push?
How are we going to scale to 10x current load?
How should we support our first non-English customers?
How much effort should we put into improving performance?

Note that “consulted” doesn’t mean that PM should drive or veto these decisions. Engineers are the owners of technical decisions. But PMs should be involved, where I think “involved” should mean at least:

PM understands the reason for doing the work, and can explain it to others in the company who question why it’s needed.
PM understands the costs and benefits of the work, and the tradeoffs.
PM understands the risks and fallback plan in case the work doesn’t go as expected.
If there are several options being considered, PM should be exposed to those options and should inject a customer/business perspective as options are being weighed against each other.
Engineers should try to convince PM that the work is justified. If you can’t convince the PM you work with every day, then do you think you’ll convince your CEO or head of Marketing who won’t have the patience or interest to learn the details? Convincing PM is a helpful dry run for other, more-skeptical folks.
Engineers should be willing to entertain concerns—especially about customer experience or risks to the roadmap—that PMs bring up. Code exists to serve the business, not the other way around. PM input, as long as it’s focused on customer and business outcomes, may help you avoid worst-case scenarios like the company going bankrupt before the big infrastructure project is done!

In practice, PMs serve in an advisory role for these kinds of big architectural decisions. They can be a helpful set of eyes to make sure that a big investment is made with buy-in from the business and awareness of what can go wrong. This will pay off later, especially if the work gets into trouble.

Don’t call them “Eng Tasks” and “PM Tasks”

A small but important learning: it’s easy to think of feature work vs. tech tasks as “stuff PMs want” and “stuff engineers want”. This sets up an us-vs-them vibe that isn’t good for team cohesion.

So I’d strongly suggest you avoid the urge to call them “Eng Tasks” and “PM Tasks”. I like “Tech Tasks” and “Feature Tasks” but any name is OK as long as it’s associated with a concept, not a group of people.

A corollary for PMs: don’t allow yourself to think that feature work is inherently more important than tech tasks. Both are needed for the company to survive for the long term. Don’t be arrogantly PM-centric!

PMs should adapt to engineering process, not vice versa

As a Product person, I intentionally try to be very flexible about engineering process. There are usually 5x-10x more engineers than PMs, so our process should be optimized for making the larger group (engineers!) happy and productive. PMs should defer to engineering leaders on decisions like:

Whether to use sprints/kanban/etc.
Whether to organize using a “big pool of people” or “pods”
Which bug-tracker to use
What kind of metadata should be in each issue
How PRDs/specs/user-stories should be formatted and how much detail should be in them
Which information should be written down vs. fleshed out in meetings
How many meetings are required for each issue/sprint/etc.

But the Tech Budget is the exceptional engineering-process topic that I *do* have a strong opinion about, because I’ve seen with my own eyes how much of a positive impact it had on our company.

Many thanks to the early readers of this post, and especially to Igor Schtein, who taught me the value of the Tech Budget and who put up with me always trying to steal his team’s points!

SaaS PM 101

Discussion about this post

Ready for more?