Metrics in Service-Oriented Architecture
No matter where I work, a significant part of my work day is spent discussing at an N+1st postmortem why the incident wasn’t resolved in 3 minutes or even automatically but took 3 hours and 10 people paged in instead. It seems to me the major reason is that software developers love writing code but think way less about supporting it. These days there is no need to write a service framework starting with a servlet (assuming you work in Java). A company can pick an open-source version of a service framework used by any of the big technology powerhouses. For most companies it will work out of the box. In fact, any of these frameworks can handle 20 transactions per second (tps) per instance with default settings. All a software developer needs to do is to code up business logic and plug it into the right place in the framework. The throughput of ~20 tps / instance is sufficient to cover the majority of the business use cases out there. Yet, every self-respecting company maintains a framework team whose main goal is “adjusting” the framework to the realities of a constantly adapting company.
Surprisingly enough, one aspect that really needs evaluation is changed incredibly rarely: the metrics. Every framework generates some metrics. Typically, it’s capable of generating metrics for every step of request processing and beyond. Frameworks are capable of generating way more metrics than are practical to store and process for any but the framework-originating company. That’s where the local team needs to choose which metrics to surface and, typically, only minimalistic metrics are chosen and these frequently cover just incoming and outgoing traffic latencies and numbers (often averaged across all APIs).
Then this overly simplified set of statistics gets used by actual service owners. They go through incidents, improve their own set of metrics, hack the framework to generate more useful stuff but more often than not this knowledge isn’t shared across the company, service framework isn’t updated. Every service owner is forced to go through a discovery process on his own, wasting time, money, company resources, and unnecessary pain.
I want to go through metrics that are absolutely necessary to decrease resolution time for almost any incident in almost any service framework. I’ll also talk about basic interpretation and the incident resolution decision tree. Every framework typically produces these metrics, too, so no extra effort is necessary. First things first: let’s look at the basic request processing anatomy:
As to what to do with the metrics:
P1 is the metric that indicates what’s happening with marshalling and unmarshalling requests. If errors appear: you have a problem with request definitions. If it narrows down to one caller — these guys have some problem with their client library. Spread across all callers: server-side issue. Latency too big: either a network problem or a poison-pill request. Network problems are easy to check using the organization’s standard tools for that sort of thing.
Latency too long (or too short) or errors while none of the underlying metrics are affected: a coder probably introduced a bug. If any of the other underlying metrics are affected you can narrow the problem down to the culprit.
Lets you differentiate between your service’s internal problem and downstream issues. Downstream issues can come from different sources as well: the network can misbehave, your routing, service registry or name resolution can hit the sack or it might be a problem with the downstream service. That’s where you need to add links to your service’s downstream dashboards to your operational dashboard so you can check them with one click instead of searching for them at incident time.
There are plenty of other metrics a service owner wants on his operational dashboard. There are instance metrics (CPU load, memory consumption, network in and out etc.). There are VM metrics if you are running on a VM (garbage collection frequency and latency, free memory, number of threads etc.). Then there are metrics specific to your service framework (thread counts in key pools, cache hit rates etc.). These metrics are typically provided and supported by the “framework adoption” team, as they are helpful in debugging the “framework adoption” itself. Business metrics and request processing metrics, however, are not and are left to the service owner(s) to discover and expose.
As a takeaway, I suggest looking at your service’s dashboard and if any of P1-P6 metrics are missing from it, add them. I guarantee you will gain much better understanding in your service’s behavior and health.