Upgrades: Where, when, and why
One of the things I loved about Amazon was that there was always a lot of space for solving problems and building new things. That space existed for a very simple reason: the average lifetime of a service at Amazon was around 3 years. Any implementation developers put in was outgrown by increase in traffic and complexity. Good designs aged gracefully and provided enough warning to start building a replacement, bad designs just failed. One way or another, for most services “the legacy” was max 3 years old. That simplified things a lot: some of the original team was still there, languages and design principles of “legacy” software were still understandable and recognizable. In short, we were lucky.
I was known in Amazon for building systems that held up much longer than requisite 3 years. In retrospect I feel guilty as the teams that have to finally fully redesign my systems likely didn’t have an easy time as everyone on the team has left and the tech thanks to aging well didn’t see much updating. Now the shoe is on the other foot: at Upwork quite a few systems have survived since Odesk and Elance merged and some even precede that. And these systems still carry on their load, perhaps with quite a few issues. The problems we are facing are both business and technical: we need to get funding from the business for an upgrade from a system which business regards as perfectly working. We need to migrate the users off the old system, and we need to define the least expensive upgrade. At Amazon we were lucky: business and technical goals were constantly well aligned; if we didn’t upgrade the existing system with current business requirements and best technologies it would just stop working. Normally the business impact of technology outdated by several generations (most notably, slow down in development of new features) is hard to quantify and explain.
Thankfully, we at Upwork formulated several criteria that help in determining if the system needs an upgrade. These criteria are typically easy to justify to the business department. One such criteria is the level of reliability and availability. If the system needs to support a primary website but was designed without high availability in mind it clearly needs an upgrade. The ability to deploy the software using current deployment techniques (for example, containers and clouds) instead of what was used when the world was younger (for example, a copy command) is definitely worth considering. Security and privacy requirements are changing quickly over time, too. Authentication and authorization models wholly acceptable long ago (in particular models requiring no authentication on internal networks) can’t be used any longer.
We also established almost a requisite first step for successful migration or upgrade: the team develops a proxy facade sending requests to the existing system. Depending on the circumstances the proxy can either implement a new API that the upgraded system will eventually provide directly or an existing interface. The latter will eventually transform the data and send requests to the new system when some functionality is built there. As the new system provides more and more functionality, more and more requests will be forwarded or directly served from there.
At the end of the day it might still turn out that a “cosmetic” upgrade (like planting authentication on top of old APIs) satisfies the current product requirements and business has no problem with the slower time-to-market and larger resource expenditure on development that the legacy dependency introduced into new feature development (much like in the Dilbert strip shown below). It just means that the costs of having this legacy are not high enough to justify a fuller upgrade and need to wait until another planning cycle.