Our Take on Observability

Jan 31

What Is Observability?

Before we take the idea of observability and what it is off the rails, we should probably start with the boring definition of it just so we’re all on the same page. The original definition comes from “control theory” and states that

“Observability is a measure of how well internal states of a system can be inferred from knowledge of its external outputs”

Now, given that this was a discussion that started becoming more formal in the late 1800s or so, we’ve changed it a little bit over time to be more meaningful to software, so instead of saying “external outputs”, we can instead say something more like “gathered telemetry”. This telemetry is typically the “three pillars of observability”, otherwise known as metrics, logs, and traces. All of this is pretty well agreed upon across the industry and while we largely agree with this definition, we also think it’s lacking. The only thing left to do now is to add our own definition into the world…

Observability is the practice of having a team culture that implements metrics, logs, and traces into their services from the beginning of development and is able to access this data in a reasonable manner in order to understand the state of the service at any given time.

Okay, so it’s more of a mouthful, but hopefully you can see the three areas that we believe are lacking across industry definitions:

The culture of your team should (and should want to) implement and utilize this data
Observability needs to come first
Data should be accessible in a reasonable manner

You (Should) Want and Need This Data

The first addition to the definition was around having a team that actually implements and wants to use this data. We have seen many instances of teams either straight up rejecting the implementation of telemetry, not caring about its existence, not understanding the value, or plenty of other “negative” sentiments around observability data. These thoughts lead us to a bad place where telemetry data either isn’t present or is ignored, which of course leads to bigger problems. At the end of the day your software is only as good as its regular operation. If your software does not operate, then it is not good. If you don’t have data to tell you if it is operating or not, then the software is not good as you have no guarantee that it is actually running. While this is a gross oversimplification, without the most basic of telemetry you can find your business losing money pretty quickly without knowing why. A developer once said to us “Why do I need to care as long as it works when I push it to production?”. This same developer was also pretty upset when the sole task of his next two sprints was performance improvements because of user complaints. Shortly after completing those tasks he realized that this could have been addressed over time if he had just paid attention to the telemetry that was implemented within the codebase.

When a team takes it upon themselves to add and enhance telemetry data coming from their services, it allows them to react faster when there are problems or have a better idea of where to start when it comes to improvements, and a myriad of other benefits. Having this data can also avoid the dreaded conversation with a V/C level about why revenue was lost over the weekend and nobody was aware. And if you’re still not sold on why you need this data, if you ever see your boss coming while you’re browsing Reddit, you can just pull up a dashboard and act deep in thought. Works every time.

Observability Must Come First

You should be thinking about observability before you start thinking about the first line of code, not after an incident occurred and you need to ensure it doesn’t happen again. When we say this, we don’t mean thinking about specific metrics or log lines - but start thinking along the lines of “if this app fails, how will I know?”. While most off-the-shelf observability tools will plug right into existing applications and give you data near immediately, many times this implementation is a knee-jerk reaction to that previously mentioned V/C level conversation. If you start designing your application with observability in mind, you can start building dashboards and alerts to identify failure states as you build to avoid that conversation in the first place. This process should also help you identify additional telemetry needs to fill any gaps that you’re finding so you’re moving into production with a rich understanding of the state of your service. There’s not much more to say here, as this point directly ties into the other point about your team wanting to have this data - hopefully observability coming first will influence the team, or vice versa when it comes to future development.

Data Must Be Accessible

There’s a number of ways to define accessible, but what we’re talking about in this context is the proper spread of tooling as well as the proper access for people utilizing it. It becomes hard to be prescriptive in this area because one size does not fit all across the industry. This is a blog post, and if we were to explain what accessible and proper tool spread means for a bank, a retail site, and a site to book dog groomers, we’d be here all day. So in that vein, the smarter way to describe this would be to say that data should only be siloed where necessary (due to regulations or legal requirements) and you should utilize the tools that are needed, no more. The first part of that statement should be obvious - an app developer should have access to network metrics and so on, but maybe an intern shouldn’t have access to payment information or protected health information. Don’t restrict data that doesn’t need to be restricted.

The second part of the statement is a little more difficult. The primary driver behind the statement is having seen many organizations implement one tool per use case, but still run into a long mean time to identify and resolve incidents. Having many tools with a single purpose requires having to look in many places when there is a novel problem. While you might get lucky looking in the first place, you cannot count on that. Condensing tools where you can (especially if they tie data together) is a huge benefit, and allows a clear understanding of what may be happening with additional context. There are many tools in the market that do all pillars of observability and more that are incredibly useful, but the flip side of this is that you also might want to avoid shoving everything into a single tool. If you’re running a Node based site in Kubernetes on AWS and all of this fits into a single tool out of the box, that is awesome. But, if you find yourself saying “Tool X does this, but tool Y doesn’t…how do I make tool Y do it”, you might be going down the wrong path.

To Wrap Up…

Observability is more than the data that a service generates - it should also include your team being willing and eager to implement and utilize that data, as well as making smart decisions regarding where that data is stored and how it’s accessed. At the end of the day, what good is data that isn’t being used?

Nick Vecellio