Incentives, Metrics, and Goodhart’s Law

I’ve been thinking about metrics a lot lately. Since I joined my new organization, I’ve been spending even more of my time on the psychological and sociological aspects of software development. One wall I find myself constantly running into is the comfort and familiarity senior leadership has with metrics and the counterintuitively negative effect implementing a measure has on the desired outcome. To be crystal clear the wall I keep running into has a name: Goodhart’s Law.

When a measure becomes a target, it ceases to be a good measure.”

Marilyn Strathern

Jump to the tl;dr below:

Leadership loves measures, they just do. However the moment you set a measure as a target, an OKR for a current popular example, it ceases to be a good read on the pulse of the organization or product. The incentives are diametrically opposed. Individuals are best served by taking whatever shortcuts are available to meet the numbers. The problem is these values are only proxies for the real progress towards an outcome leadership desires to measure. Yet savvy individuals remain just as incentivized as ever to game the numbers for their own benefit.

Despite the best efforts of countless intelligent people, many highly valuable attributes of a system are still qualitative thus immeasurable. With many attributes being immeasurable, metrics can only be set for some range of unknown knowns and known unknowns attributes and conditions. We’ve nothing in our toolbox that helps gauge or measure the impact of unknown unknown’s on the organization, and those are the attributes which simultaneously represent the biggest opportunity and risk to any endeavor. Yet by their very nature, we can’t plan for them so it’s impossible to directly measure an attribute which can help.

So, why not measure things indirectly? What happens if we try to align the incentives?

Since I usually go on and on about APIs, as a change of pace lets look at an API example.

A simplistic API OKR example

Leadership has decided we need to lower our error rates to increase reliability of our services across the board for our internal and external consumers. They have set an OKR for each team to measure their error rates (non auth related 4xx codes) and reduce them by half within 2 months.

Sounds like a great goal.

Narrator: Everything went wrong.

Team’s Solution

GET /widget/this-widget-does-not-exist
 200 OK
{“code”:”404”,”message”:”Not Found”}
PUT /widget/this-widget-is-out-of-date
{“name”:”foo”,”status”:”bar”}
200 OK
{“code”:”409”,”message”:”Conflict”}
GET /widget/a-dependency-is-literally-on-fire
200 OK
{“code”:”502”,”message”:”Bad Gateway”}

Met our objective? Exceeds expectations. Atrocious design? You bet!

The outcome leadership wanted was to make our services more flexible and resilient, to provide service in partially degraded states, to cut the unrecoverable error rate for consumers in half. We’ve completely eliminated the error rates, but the measure is just a proxy for the real value. Despite our success, we’ve made the situation objectively worse. What this sets up is a cat and mouse game where leadership sets a metric, and teams find their way around them because the real costs for the desired outcomes isn’t something leadership will support. So we gamed the system. Leadership got exactly the opposite of their desired outcome.

Let’s look at this problem from another angle, can we phrase the goal in a way where the incentives for the team are aligned with the leadership’s true objective? What’s the real thing we’re trying to improve? The real goal we’re trying to accomplish is to halve the rate of errors preventing our consumers from accomplishing their goals.

This poses a few questions:

  • How does this align incentives?
  • How do you account for the unknown unknowns?
  • How can you measure the success of your consumer for goals you don’t know or understand?

Aligning Incentives

We’re starting with the aligned incentives, as it will demonstrate the value of finding answers or approximations for the next two. Simply put you can’t fake consumer success, if the consumer accurately succeeds in their objective then whatever you’ve done is objectively correct. If they don’t succeed or it’s not accurate there’s no hiding the fact that your implementation is in some way objectively wrong. Success is a discrete boolean value, the consumer will be successful or not. The only way to game this metric is to make the implementation better for the consumer. 

On the Unknown Unknown

The risk the unknown unknown presents to our endeavor is not knowing our system is failing consumers despite all signs to the contrary. The unknown is uncomfortable, however one thing we have accomplished is to contain and mitigate a sizable portion of the risk by measuring the only thing which can’t be gamed – consumer success. Clearly there’s plenty of other things which could catch us unaware, but we’ve done all we can to prepare for them. The far more valuable outcome is in learning where your consumers are failing and making improvements or new offerings to remedy these failures. Every failure we see now has the potential to uncover an opportunity for improved consumer success and increased value delivery. Once our incentives are aligned and the basic resiliency goals are met, the failures now are valuable insight into consumer needs.

Measuring Consumer Success

So how do we measure consumer success for goals we don’t know or understand? Step one and two is simply to ask and learn. Take that feedback to create metrics which can identify these occurrences. In some cases this could be little to or no benefit, while others could see huge improvement from this simple activity. Regardless, the insight gained is actionable and immediately useful. This due diligence lays the foundation for the next steps where we create hypothesis metrics to guess at consumer success for different scenarios. Services don’t exist in a vacuum, and in modern microservices architectures most organizations will have records for plenty of consumers in their own environment. You may look for cases where a saga is consistently rolled back in certain circumstances despite multiple successful mutations to a resource, or one service is consistently unavailable or fails in certain context because your processing runs just longer than a downstream timeout. 

If all we’re doing is creating metrics, how does this help? We can’t create true metrics for unknown conditions and state, just like we can’t prove a negative case in logic. However, these metrics are created to mine and expose hints of an undesirable outcome not to obtain a value directly. What we’re measuring is the effects of any arbitrary metric on consumer success, so while the value of the metric could be seemingly nonsensical on its own, the knowledge of its effect may be of significant value. There’s no silver bullet or one size fits all set of metrics or circumstances to investigate, but by looking at consumer success we’ve constrained a wide range of qualitative attributes into something we can indirectly measure.

tl;dr

By establishing consumer success as our only measure we force organizational incentives to be in alignment with individual incentives. We also gain the ability to indirectly measure the effects of qualitative attributes and complex stateful conditions on consumer success. As we are only looking at impacts to consumer success, we also reduce our risk from, and increase the potential opportunity of, unknown unknowns.

The Relationship Maturity Model

I gave a talk at RESTFest Midwest 2018 about this concept of a maturity model for relationships, and this post is intended to be the formalized version of those points. Through a lot of recent discussions on various Slacks, and over the course of our conversation in Grand Rapids, it has become pretty clear there is a wide range of definitions for relationship tags on web links.  I believe it’s crucial we develop a shared understanding of relationships in general so we can move forward determining how to create and enable affordance driven APIs.

At the highest level a relationship is merely the meaning which connects two concepts or contexts. A graph is a simple example, the relationships are the edges between two nodes. In the context of Web APIs relationships are the ‘rel’ attribute of RFC 8288 web links. The role of `rel` is to convey the semantics which join two contexts. In most discussions of hypermedia APIs links take a prevalent role, but their utility is often assumed or discussed in a narrow scope.

There is one concern which is mostly unaddressed in both models, which is the inevitable question a team will ask when considering implementing hypermedia driven APIs, WHY? 

Ok, I have links, now what?

We have two models which help us begin the story of the relationship, the Richardson (RMM) and Amundsen (AMM) Maturity models.  Zdenic Nemec wrote a (fantastic post)[https://blog.goodapi.co/api-maturity-fb25560151a3] comparing the two models, if you haven’t read it yet you may want to before continuing.  However for our purposes Amundsen himself provides us a useful and succinct summary: The RMM focuses on response documents; AMM focuses on API description documents.  A link therefor provides the foundation which binds the description to the response. Without the link we are dependent on static interactions, but without the `rel` we are teetering on the edge of a cliff shrouded in fog; we know something is on the other side of but we have no idea how to get there and what we will find.

Why ‘do’ hypermedia, what does it matter?

The key to understanding the power of hypermedia, is to understand the context of these links regardless of the serialization or mediaType. This clarity comes from the `rel` answering the question “How is the current context related to the target context?” There are many ways to answer this question, each one building upon the last adding quite a bit of power to the humble link. Why do we do hypermedia? So we can reveal to consumer agents the relationships between two contexts without sharing knowledge ahead of time, and if we do share our vocabulary ahead of time enabling consumers to rapidly build very rich interactions.

The Relationship Maturity Model

Level 0 – Anonymous Relationships

Level 1 – Generic Relationships

Level 2 – Named Relationships

Level 3 – Stateful Relationships

Anonymous Relationships

The least helpful, and unfortunately most commonly demonstrated relationship by an overwhelming majority is the anonymous relationship.  This empty structure string or array element provides no context to the consumer.  Most often this is the type of link shown when creating a demonstration to showcase hypermedia, and the frequent response is to question what exactly does a link like this provide? Nearly nothing. A link of this level provides no real additional benefit, it adds to system chattiness and may even add risk to the use of the application.

Generic Relationships

This is the foundation of all link contexts, these generic relationships can be found in the (IANA Link Relation Registry)[https://www.iana.org/assignments/link-relations/link-relations.xhtml].  The use of these simple and expressive relation types provides a wealth of general context to the consumer, not only do they provide some understanding of the relationship, but they begin to demonstrate the capabilities of an affordance centric API.  A generic agent now has the capacity to understand with `rel=“collection”` that the URL points to a resource collection root. If you receive a link with `rel=“item”` you know this addresses a single item within a resource collection. This won’t enable the richest interactions, but generic clients like the HAL-Browser use this level of detail to create GUIs for services the consumer has never uniquely integrated.

Named Relationships

Building upon the foundation of generic relationships with the use of custom rel names a designer can introduce new relationships which provide additional context to the links. User created relationships are required to be URIs, which allows a number of strategies from using the tag scheme through referenceable URIs. Using referenceable URIs provides the opportunity to include and control unique application context in every response.  This enables you to move from identifying a resource `item` to being able to identify _this_ resource is a person, and finally the relationship between the current context and the resource is `http://example.org/vocabulary/school/class/student`. By providing referenceable human and machine readable documentation as the `rel`, I have added a vast capacity for conveying meaning to a client.  As a consumer I now have a very rich understanding of the application’s vocabulary, it’s resources, and how they might relate to one another. These named relationships can provide consumers with hints on the composition of complex resource representations and an out-of-band vocabulary to safely use in creating rich resource interactions.

Stateful Relationships

Each previous relationship level is focused on conveying increasing detail of the current state of the resources, however to create a fully functional affordance centric API requires the ability to communicate resource affordances.  The capability to understand the state of a system with a generic client is powerful, but the ability to alter the state of a system with a generic client is truly revolutionary.  Revisiting the role of `rel` above, you’ll notice it doesn’t include mention of resource or affordance, because it simply adds the semantic context to determine how the two contexts are related. An affordance of `rel=“http://example.org/vocabulary/school/class/addStudent”` can be discovered and bound just like the Named Relationship and interpreted as “self context has the addStudent affordance, which is performed at the target context.” By adding affordance in a standard way outside of a specific mediaType, you have added the power of hypermedia with the flexibility and versatility of raw JSON or XML.

Bringing it together

The relationship maturity model is about understanding the nature of relationships between contexts, and removing preconceived notions on how they can relate.  It is easiest to accept a link between two contexts as simply a link between two resources, but the real power was in the intent of the standards to create a bridge to communicate all types of relationships.

Additional reading

While you’re at it, check out Jason Desrosiers fantastic hypermedia maturity model to learn another way to understand your API designs.

A pragmatic review of OAS 3

Disclaimer

Before I go any further I want to address the elephant in the room. Obviously I consider myself a hypermedia evangelist and I’m aware it is easy to make ivory tower arguments from this perspective. I am also an application architect which requires frank pragmatism where today’s OK solution is generally much preferred to next year’s better one.  In most of my previous posts I’ve focused my discussions on the distance between where we are as an industry, where I think we should go, and why it’s important.

Getting started

As part of my process of preparing for my upcoming talks at APIStrat on API Documentation and Hypermedia Clients, I’ve been reviewing the specification in depth for highlights and talking points.

On one of my first forays into the new world of twitter, I rather tongue-in-cheekily(https://twitter.com/hibaymj/status/865054487119089665) pointed out as a hypermedia evangelist my issue with the specification.  Going back, I probably would express the thought differently, but the crux of the issue is OAS does not support late binding.

I’ll get back to this point later, because first I want to talk about the highlights of the specification to acknowledge and applaud the hard work put into such a large undertaking.  Looking back on the state of the art of APIs only 10 years ago, it’s easy to see the vast improvements our current standards and tooling provide.

At this point I’m going to assume most have googled for the changes to the format in OAS 3.  My aim with this post is not to focus on changes, but evaluate OAS as it exists in the current version.

The Great Stuff

Servers Object

This is a very powerful element for the API designer which allows design time orchestration constraints to be placed on the operation of the services. This can greatly enhance the utility of OAS for use in many scenarios, including but not limited to: API Gateways, Microservices orchestration, and enabling implicit support for CQRS designs on separate infrastructure without intermediary.

Components

My previous experience with OAS 1.2 lead to a lot of redundancy, which the components structure of the current version very elegantly eliminates.  The elegance stems from the design choice of composition over definition allowing for reuse without redundancy.  It simplifies the definition of the bodies, headers, request, and response components as reuse becomes a matter composition.  The examples section is a developer experience approval multiplier, which is welcome and should be strongly encouraged.

Linking

As a hypermedia evangelist, my approval of this section should be not come as a surprise.  It mirrors in concept many of the beneficial aspects of an external profile definition like ALPS and is a welcome addition to the spec.

Callbacks

The standardization of the discovery or submission of webhook endpoints within the application contract itself is a very good step in supporting increased interoperability, internally and between organizations.

Runtime Expressions

With the inclusion of this well-defined runtime expression format, OAS removes a large amount of ambiguity for consumers and tool developers. This allows the API designer to add a lot of value enhancing the ease of use for consumers and integrators.

A Mixed Bag

These items are included simply because a tools utility isn’t determined when it is created.  The optional nature of the definition or use cases of the response object and the discriminator open them up the potential of unnecessary ambiguity and misuse.

Responses Object

All of the benefits I mentioned in the components section also apply to the responses object. My concern centers around the enumeration of the different expected responses.  The authors deserve credit in immediately pointing out this shouldn’t be relied on as the full range of possible responses.  My experience has shown that designers, tool developers, and end consumers are prone to missing the fine print or assumption, subsequently over relying on these types of features.

Discriminator

For the purpose it serves I think the discriminator as defined is a very elegant solution which helps to differentiate OAS from standard CRUD.  It allows for the use of hierarchical and non-hierarchical polymorphism alike, for more concise and reusable designs.  However, it still fundamentally ties the API to design time defined data formats.

Room for Improvement

The Extension Mechanism

With obvious resemblance to the now long deprecated format of custom HTTP headers, this section should follow the specs own well designed components format.  This upgrade could use the composition rules defined within the spec to allow much better support from tooling developers, and more consistent interoperability.

It’s All Static

While the authors have done an excellent job removing a lot of static portions out of the spec, it is still fundamentally static at its core.  Fortunately the static nature of the format is largely limited to a small section of the document thus allowing designers and developers much more room to innovate after design time.

Intertwined Protocol and Application Design

In computer science it is always immensely difficult to know precisely where to create boundaries for improved separation of concerns.  The OAS specification was not created from an ivory tower bubble.  It was created to solve real problems in real time.  Unfortunately, it still bears scars from this period by mixing protocol design concerns with application design concerns.  Each application design component is also able to declare protocol properties in a mix which wouldn’t allow for protocol portability.  If protocol concerns like HTTP headers and response codes were abstracted to external definitions or formats, then the reuse of OAS could bridge nearly all relevant protocols.  However, there would be one thing left to prevent the specification portability – the path.

Path Is The Base Abstraction

Getting back to the point raised in my cheeky tweet.  By using the URL path as the primary abstraction the specification creates the possibility of many future; operational, developmental, and maintenance issues.  Recently even the quickly growing GraphQL community has joined voices with hypermedia proponents to point out how this subtle design flaw can develop into severe issues.

Bringing It All Together

The purpose of this post isn’t pointing out all the flaws in OAS but to give a pragmatic review of the state of the specification.  If you want to see a more in depth analysis take a look at Swagger isn’t user friendly.

In the end, if you’re going to opt for an alternative to hypermedia then OAS is about as close as you can get at this point.  The ecosystem fits extremely well in the wide berth between a single user service and massive scale where every byte counts.  If your service design hasn’t been updated in the last 10 years or is nonstandard, it’s very likely OAS 3 would be a massive improvement and represents a today’s best ‘good enough’ solution.

Some of these necessary improvements are easy to handle, others will require more finesse to mitigate if they are addressed at all.  One thing is clear if your project is still using custom API designs, or spend too much time managing older service designs, and you don’t have time to contribute to a hypermedia alternative then OAS is worth your serious consideration.