Observability is Cultural

November 04, 2022
1936 Unique Views
4 min read

Table of Contents

Chaos Engineering as InspirationDo We Need Experts?A Dashboard of Our OwnGrowing with ObservabilityFinally

I’m guilty of applying the word debugging for practically anything.

My kids' legos won’t fit, let’s debug that.

Observability is one of the few disciplines that actually warrant that moniker, it is debugging. But traditional debugging doesn’t really fit with observability practices. I usually call it “precognitive debugging”. We need to have a rough idea in advance of what our debugging process will look like for effective observability troubleshooting.

Note that this doesn’t apply to developer observervability which is a special case. That’s a more dynamic process that more closely resembles a typical debugging session. This is about more traditional monitoring and observability. Where we need to first instrument the system and add logs, metrics, etc. to cover the information we would need as we will later investigate the issue.

I wrote before about the scourge of over logging. The same applies to observability metrics, as we collect more and more data the costs for retention and processing quickly outweigh the benefits of observability. We end up with a bigger problem altogether. We need to pick our battles, log the “right amount” and monitor the “right amount”. No more and no less than we need. For that we need to understand the risks that we’re dealing with and try to maximize overlap in our investigation.

Chaos Engineering as Inspiration

In the tradition of Chaos Engineering we would organize a “game” orchestrated by the “master of disaster” to practice disaster readiness. This is a wonderful exercise and a great way to build that “muscle”. It isn’t the right fit for an observability architecture since observability deals with nuance as opposed to “fire”.

Observability requires a similar game, but a deliberate one, where our team competes on finding the ways in which our system can fail. Think of it as bingo. Once we have a spreadsheet full with potential failures, we need to map out the failures to the observability we would like to have for every potential failure. E.g. in case of a hack we’d like to have the user id logged when accessing any restricted resource.

Once we chart all of those desires we can review them, try to unify some metrics and logs. Then implement them so our observability can answer everything we need to track down an issue.

Will we miss some things?

Obviously. That’s part of the process. We will need to iterate and tune this. It will probably require a reduction of volume for some expensive data points to keep the costs reasonable. We will undoubtedly run into issues that aren’t covered by observability (or whose observability coverage isn’t obvious). In both cases we will need some help.

Do We Need Experts?

Some observability fans assume that we no longer need domain experience to debug a problem. Given a properly observable system we should be able to understand the problem without knowing anything about the system.

While I agree that an expert in debugging can probably solve a problem faster. Possibly faster than a domain expert. I still have my doubts. Over the course of a decade, I was a consultant and I would go to companies where I used profilers, debuggers, etc. As part of that job, I found the issues that escaped people who were greater domain experts than I was. So there’s some merit behind that claim.

But debugging requires some familiarity with the system that we’re trying to understand. It’s like diagnosing through Google. We might occasionally find the cause better than our GP but probably no better than an expert. Obviously there are exceptions to the rule, but in my experience. Experience matters for any type of debugging.

A Dashboard of Our Own

One thing I see often is a universal “one size fits all” dashboard in a company. Grafana is a fantastic tool with remarkable flexibility, yet some expose its visualizations as a single company dashboard. There should be at least three dashboards for the application:

High level - CTO/VP R&D level. This focuses on business metrics, users, reliability, costs
DevOps - Low level information about the environment
Developers - application specific metrics and platform information

There’s a lot of overlap there. But we need custom dashboards. The whole idea of the dashboard is to see everything that matters in one place. CPU utilization on the container might be interesting to me in general, but more likely than not it will just be a distraction. I want to know if there’s a problem with the authorization system because users are experiencing increased error rates logging in. These metrics should be front and center.

When I open a new tab in my browser, I see Grafana. This should be the home page for every team member. The “healthy” view of our system should be etched into our mind so we can instantly notice small deviations in the environment and act accordingly.

Growing with Observability

As our system grows we need to include observability and metrics in the pull request that introduces a feature. Nothing can launch without observability on day one. It should be etched into the code review process and should be on-par with test coverage requirements.

Unlike test coverage, we have no metric we can rely on to verify that observability is sufficient to the rapidly evolving needs so at this time this is a heavy load on the shoulders of the reviewers. But there’s an even bigger load: cost. As we grow these changes can affect cost which can suddenly spike to bankruptcy inducing heights. Cost isn’t always easy to monitor, but it’s a gauge we should look at on a daily basis. By keeping track of that metric and catching spikes in cost early on. We can keep our systems stable and manageable without giving up cost effectiveness.

Finally

Some engineers have an over infatuation with metrics. I’m not one of them. Some things can’t be measured. The value of personal relationships. The value of a team. A community. Because of this obsession, observability is gaining in popularity. That’s good and bad. With this obsession, we sometimes over log and observe which results in poor performance and cost overruns.

We should apply observability with a scalpel not with a shovel. This shouldn’t be something we delegate to the DevOps team as an afterthought. It should be a group effort that we constantly refine as we move along. We should keep our pulse on our metrics and have domain specific dashboards to keep the things that matter in our peripheral vision constantly. Observability doesn’t matter if we don’t bother looking.

Don’t Forget to Share This Post!

Shai Almog

Author

Author, DevRel, Blogger, Open Source Hacker, Java Rockstar, Conference Speaker, Instructor and Entrepreneur.

Preparing for Spring Framework 7 and Spring Boot 4

Domain-Driven Design in Java: A Practical Guide

New Java Benchmark for Coding LLMs puts GPT-5 at the top

New Features in Jakarta EE 11, with Examples

Managing MongoDB Database Changes Using Liquibase Pro

Project Panama for Newbies (Part 1)

JC-AI Newsletter #3

🧱 Monolith or 🧩 Microservices in 2025?

Understanding MCP Through Raw STDIO Communication

OpenTelemetry Tracing on the JVM

foojay: A Place for Friends of OpenJDK

Dashboard for OpenJDK Update Release Details

JDK14: New Features and Enhancements

Fun with Flags: My Top 10 Resources for JVM Flags

Performance of Modern Java on Data-Heavy Workloads: Real-Time Streaming

Performance of Modern Java on Data-Heavy Workloads: Batch Processing

How does Java handle different Images and ColorSpaces – Part 1

How does Java handle different Images and ColorSpaces – Part 2

How does Java handle different Images and ColorSpaces – Part 3

How does Java handle different Images and ColorSpaces – Part 4

Indexing all of Wikipedia, on a laptop

Working with Multiple Carets in IntelliJ IDEA

Clean Shutdown of Spring Boot Applications

Java 17 on the Raspberry Pi

Project Panama for Newbies (Part 1)

How to Create Mobile Apps with JavaFX (Part 1)

Foojay Slack: bit.ly/join-foojay-slack

Beginning JavaFX Applications with IntelliJ IDE

SpringBoot 3.2 + CRaC

Debugging Java on the Command Line

Apache Kafka Performance on Azul Platform Prime vs Vanilla OpenJDK

Learn about a number of experiments that have been conducted with Apache Kafka performance on Azul Platform Prime, compared to vanilla OpenJDK. Roughly 40% improvements in performance, both throughput and latency, are achieved.

Stable, Secure, and Affordable Java

Azul Platform Core is the #1 Oracle Java alternative, offering OpenJDK support for more versions (including Java 6 & 7) and more configurations for the greatest business value and lowest TCO.

Jakarta EE 11: Beyond the Era of Java EE

Stable, Secure, and Affordable Java

Step up your coding with the Continuous Feedback Udemy Course: Additional coupons are available

Observability is Cultural

Chaos Engineering as Inspiration

Do We Need Experts?

A Dashboard of Our Own

Growing with Observability

Finally

Shai Almog

Shai Almog

Thanks to our Sponsors!

Azul

Redis

CodeRabbit

Reo

Zencoder

Payara

Digma

adesso

Trending

Apache Kafka Performance on Azul Platform Prime vs Vanilla OpenJDK

Stable, Secure, and Affordable Java

Jakarta EE 11: Beyond the Era of Java EE

Step up your coding with the Continuous Feedback Udemy Course: Additional coupons are available

Stable, Secure, and Affordable Java

Comments (0)

Jakarta EE 11: Beyond the Era of Java EE

Stable, Secure, and Affordable Java

Step up your coding with the Continuous Feedback Udemy Course: Additional coupons are available

Do you want your ad here?

Observability is Cultural

Chaos Engineering as Inspiration

Do We Need Experts?

A Dashboard of Our Own

Growing with Observability

Finally

Shai Almog

Shai Almog

Thanks to our Sponsors!

Azul

Redis

CodeRabbit

Reo

Zencoder

Payara

Digma

adesso

Trending

Apache Kafka Performance on Azul Platform Prime vs Vanilla OpenJDK

Stable, Secure, and Affordable Java

Jakarta EE 11: Beyond the Era of Java EE

Step up your coding with the Continuous Feedback Udemy Course: Additional coupons are available

Stable, Secure, and Affordable Java

Do you want your ad here?

Related Articles

Comments (0)

Set Event Reminder

Subscribe to foojay updates:

Share with