Last week I found my friends talking to me about the planning season in their companies. They found themselves confronted with the eternal question of features vs tech debt, question of balance between product- and engineering-driven initiatives. When a product side wins easily you can be sure there will be a pile of technical debt ignored by everyone. When that happens year over year you get systems with several migrations started and stopped, spaghetti code all around and developers afraid to touch the systems with 11 foot poles. That is, if you can find developers willing to look at that system at all. A happy product comes to you and asks to add functionality to a cool feature your team created last year. You remember that feature: it was a stripe of code threaded through several patches applied to a crutch somebody developed five years ago and left the team since. You give the estimate of man-centuries required to even understand how that stuff works and shit well and truly hits the fan.
When engineering takes over, the effort typically evolves into a super-flexible platform, full of very advanced features, capable of handling diverse workloads but utterly devoid of content, purpose or meaning. Somehow only really careful consideration is required to keep the balance, move the business forward yet not let ever-increasing mountain of technical debt slow the progress to a crowl. The consensus we arrived at is simple: it doesn’t matter but tech debt, just like real debt, has to be paid back continuously. Otherwise collectors come and take away your ability to change.
Another common cause of concern is downtime and methods to count it. Sometimes Ops teams come with a clearly unrealistic proposal on site availability. That is, clear to people reading site availability emails and incident emails. For example, the proposal is to keep the availability on the same level as last year, and last year’s goal of 99.9% was happily achieved. A simple calculation translates that into a little less than 9 hours. And that’s when engineers begin to wonder how come just three days ago the site was shut down for 3 hours straight for database upgrade and we still meet the goal. The proposal cutely excluded planned maintenance from the calculation. Is the site any less down because it was brought down on purpose? Are the losses not losses because we planned to lose? Best practices call us to acknowledge that there is no good reason for planned maintenance down time as that is essentially taking away from the bottom line.
Complete lifecycle of a feature is another interesting theme for conversation. Features are developed and thoroughly tested with A/B testing. Only the features providing clear gain in traffic, customer’s retention and satisfaction are enabled in production. And then the economic situation changes, the company expands in a few more regions and on top of that company’s priorities change. Do we have a provision for monitoring performance of older features, checking if they continue to provide gains we expected of them? Perhaps we should retire them if they don’t really generate as much positive results in the new environment as we expected. Somehow only a few companies have any continuously collected metrics measuring performance of existing features. Typically none of that is planned when a feature is developed and deployed. We just observe target metrics for a while to see if the feature is successful but as target metric is a combination of several factors and not feature-specific it can’t be used to measure performance of already deployed features individually. Lots of people desperately need to make sure there’s an embedded mechanism for monitoring feature performance over time and reporting if performance degrades.