Incentives, Metrics, and Goodhart’s Law

I’ve been thinking about metrics a lot lately. Since I joined my new organization, I’ve been spending even more of my time on the psychological and sociological aspects of software development. One wall I find myself constantly running into is the comfort and familiarity senior leadership has with metrics and the counterintuitively negative effect implementing a measure has on the desired outcome. To be crystal clear the wall I keep running into has a name: Goodhart’s Law.

When a measure becomes a target, it ceases to be a good measure.”

Marilyn Strathern

Jump to the tl;dr below:

Leadership loves measures, they just do. However the moment you set a measure as a target, an OKR for a current popular example, it ceases to be a good read on the pulse of the organization or product. The incentives are diametrically opposed. Individuals are best served by taking whatever shortcuts are available to meet the numbers. The problem is these values are only proxies for the real progress towards an outcome leadership desires to measure. Yet savvy individuals remain just as incentivized as ever to game the numbers for their own benefit.

Despite the best efforts of countless intelligent people, many highly valuable attributes of a system are still qualitative thus immeasurable. With many attributes being immeasurable, metrics can only be set for some range of unknown knowns and known unknowns attributes and conditions. We’ve nothing in our toolbox that helps gauge or measure the impact of unknown unknown’s on the organization, and those are the attributes which simultaneously represent the biggest opportunity and risk to any endeavor. Yet by their very nature, we can’t plan for them so it’s impossible to directly measure an attribute which can help.

So, why not measure things indirectly? What happens if we try to align the incentives?

Since I usually go on and on about APIs, as a change of pace lets look at an API example.

A simplistic API OKR example

Leadership has decided we need to lower our error rates to increase reliability of our services across the board for our internal and external consumers. They have set an OKR for each team to measure their error rates (non auth related 4xx codes) and reduce them by half within 2 months.

Sounds like a great goal.

Narrator: Everything went wrong.

Team’s Solution

GET /widget/this-widget-does-not-exist
 200 OK
{“code”:”404”,”message”:”Not Found”}
PUT /widget/this-widget-is-out-of-date
{“name”:”foo”,”status”:”bar”}
200 OK
{“code”:”409”,”message”:”Conflict”}
GET /widget/a-dependency-is-literally-on-fire
200 OK
{“code”:”502”,”message”:”Bad Gateway”}

Met our objective? Exceeds expectations. Atrocious design? You bet!

The outcome leadership wanted was to make our services more flexible and resilient, to provide service in partially degraded states, to cut the unrecoverable error rate for consumers in half. We’ve completely eliminated the error rates, but the measure is just a proxy for the real value. Despite our success, we’ve made the situation objectively worse. What this sets up is a cat and mouse game where leadership sets a metric, and teams find their way around them because the real costs for the desired outcomes isn’t something leadership will support. So we gamed the system. Leadership got exactly the opposite of their desired outcome.

Let’s look at this problem from another angle, can we phrase the goal in a way where the incentives for the team are aligned with the leadership’s true objective? What’s the real thing we’re trying to improve? The real goal we’re trying to accomplish is to halve the rate of errors preventing our consumers from accomplishing their goals.

This poses a few questions:

  • How does this align incentives?
  • How do you account for the unknown unknowns?
  • How can you measure the success of your consumer for goals you don’t know or understand?

Aligning Incentives

We’re starting with the aligned incentives, as it will demonstrate the value of finding answers or approximations for the next two. Simply put you can’t fake consumer success, if the consumer accurately succeeds in their objective then whatever you’ve done is objectively correct. If they don’t succeed or it’s not accurate there’s no hiding the fact that your implementation is in some way objectively wrong. Success is a discrete boolean value, the consumer will be successful or not. The only way to game this metric is to make the implementation better for the consumer. 

On the Unknown Unknown

The risk the unknown unknown presents to our endeavor is not knowing our system is failing consumers despite all signs to the contrary. The unknown is uncomfortable, however one thing we have accomplished is to contain and mitigate a sizable portion of the risk by measuring the only thing which can’t be gamed – consumer success. Clearly there’s plenty of other things which could catch us unaware, but we’ve done all we can to prepare for them. The far more valuable outcome is in learning where your consumers are failing and making improvements or new offerings to remedy these failures. Every failure we see now has the potential to uncover an opportunity for improved consumer success and increased value delivery. Once our incentives are aligned and the basic resiliency goals are met, the failures now are valuable insight into consumer needs.

Measuring Consumer Success

So how do we measure consumer success for goals we don’t know or understand? Step one and two is simply to ask and learn. Take that feedback to create metrics which can identify these occurrences. In some cases this could be little to or no benefit, while others could see huge improvement from this simple activity. Regardless, the insight gained is actionable and immediately useful. This due diligence lays the foundation for the next steps where we create hypothesis metrics to guess at consumer success for different scenarios. Services don’t exist in a vacuum, and in modern microservices architectures most organizations will have records for plenty of consumers in their own environment. You may look for cases where a saga is consistently rolled back in certain circumstances despite multiple successful mutations to a resource, or one service is consistently unavailable or fails in certain context because your processing runs just longer than a downstream timeout. 

If all we’re doing is creating metrics, how does this help? We can’t create true metrics for unknown conditions and state, just like we can’t prove a negative case in logic. However, these metrics are created to mine and expose hints of an undesirable outcome not to obtain a value directly. What we’re measuring is the effects of any arbitrary metric on consumer success, so while the value of the metric could be seemingly nonsensical on its own, the knowledge of its effect may be of significant value. There’s no silver bullet or one size fits all set of metrics or circumstances to investigate, but by looking at consumer success we’ve constrained a wide range of qualitative attributes into something we can indirectly measure.

tl;dr

By establishing consumer success as our only measure we force organizational incentives to be in alignment with individual incentives. We also gain the ability to indirectly measure the effects of qualitative attributes and complex stateful conditions on consumer success. As we are only looking at impacts to consumer success, we also reduce our risk from, and increase the potential opportunity of, unknown unknowns.

Leave a Reply