Observability

It’s 2AM Do you Know What Your Code is Doing?

September 01, 2023
1883 Unique Views
6 min read

Table of Contents

Observability isn’t PillarsTracing and OpenTelemetryBackground Developer ObservabilityEnough ExpositionHow Does This Help at 2AM?Final Word

Observability isn’t Pillars
Tracing and OpenTelemetry
Background Developer Observability
Enough Exposition
How Does This Help at 2AM?
Final Word

Once we press the merge button, that code is no longer our responsibility. If it performs sub-optimally or has a bug, it is now the problem of the DevOps team, the SRE, etc. Unfortunately, those teams work with a different toolset. If my code uses up too much RAM, they will increase RAM. When the code runs slower, they will increase CPU. In case the code crashes, they will increase concurrent instances.

If none of that helps they will call you up at 2AM. A lot of these problems are visible before they become a disastrous middle of the night call. Yes. DevOps should control production, but the information they gather from production is useful for all of us. This is at the core of developer observability which is a subject I’m quite passionate about. I’m so excited about it I dedicated a chapter to it in my debugging book.

Back when I wrote that chapter I dedicated most of it to active developer observability tools like Lightrun, Rookout, et al. These tools work like production debuggers, they are fantastic in that regard. When I have a bug and know where to look I can sometimes reach for one of these tools (I used to work at Lightrun so I always use it).

But there are other ways. Tools like Lightrun are active in their observability, we add a snapshot similarly to a breakpoint and get the type of data we expect. I recently started playing with Digma which takes a radically different approach to developer observability. To understand that we might need to revisit some concepts of observability first.

Observability isn’t Pillars

I’ve been guilty of listing the pillars of observability just as much as the next guy. They’re even in my book (sorry). To be fair, I also discussed what observability really means…

Observability means we can ask questions about our system and get answers or at least have a clearly defined path to get those answers. Sounds simple when running locally, but when you have a sophisticated production environment and someone asks you: is anyone even using that block of code?

How do you know?

You might have lucked out and had a log in that code and it might still be lucky that the log is in the right level and piped properly so you can check. The problem is that if you added too many logs or too much observability data, you might have created a disease worse than the cure: over-logging or over-observing.

Both can bring down your performance and significantly impact the bank account, so ideally we don’t want too many logs (I discuss over-logging here) and we don’t want too much observability.

Existing developer observability tools work actively. To answer the question if someone is using the code I can place a counter on the line and wait for results. I can give it a week's timeout and find out in a week. Not a terrible situation but not ideal either, I don’t have that much patience.

Tracing and OpenTelemetry

It’s a sad state of affairs that most developers don’t use tracing in their day-to-day job. For those of you who don’t know it, it is like a call stack for the cloud. It lets us see the stack across servers and through processes. No, not method calls. More at the entry point level, but this often contains details like the database queries that were made and similarly deep insights.

There’s a lot of history with OpenTelemetry which I don’t want to get into, if you’re an observability geek you already know it and if not then it’s boring. What matters is that OpenTelemetry is taking over the world of tracing. It’s a runtime agent which means you just add it to the server and you get tracing information almost seamlessly. It’s magic.

It also doesn’t have a standard server which makes it very confusing. That means multiple vendors can use a single agent and display the information it collects to various demographics:

A vendor focused on performance can show the timing of various parts in the system.
A vendor focused on troubleshooting can detect potential bugs and issues.
A vendor focused on security can detect potential risky access.

Background Developer Observability

I’m going to coin a term here since there isn’t one: Background Developer Observability. What if the data you need was already here and a system already collected it for you in the background?

That’s what Digma is doing. In Digma's terms, it's called Continuous Feedback. Essentially, they’re collecting OpenTelemetry data, analyzing it and displaying it as information that’s useful for developers. If Lightrun is like a debugger, then Digma is like SonarQube based on actual runtime and production information.

The cool thing is that you probably already use OpenTelemetry without even knowing it. DevOps probably installed that agent already, and the data is already there!

Going back to my question, is anyone using this API?

If you use Digma you can see that right away. OpenTelelbery already collected the information in the background and the DevOps team already paid the price of collection. We can benefit from that too.

Enough Exposition

I know, I go on… Let’s get to the meat and potatoes of why this rocks. Notice that this is a demo, when running locally the benefits are limited. The true value of these tools is in understanding production, still they can provide a lot of insight even when running locally and even when running tests.

Digma has a simple and well-integrated setup wizard for IntelliJ/IDEA. You need to have Docker Desktop running for setup to succeed. Note that you don’t need to run your application using Docker, this is simply for the Digma server process where they collect the execution details.

Once it is installed, we can run our application, in my case I just ran the JPA unit test from my latest book and it produced standard traces which are already pretty cool, we can see them listed below:

Image description

When we click a trace for one of these, we get the standard trace view, this is nothing new, but it’s really nice to see this information directly in the IDE and readily accessible. I can imagine the immense value this will have for figuring out CI execution issues:

Image description

But the real value and where Digma becomes a “Developer Observability” tool instead of an Observability tool, is with the tool window here:

Image description

There is a strong connection to the code directly from the observability data and deeper analysis which doesn’t show in my particular overly simplistic hello world. This Toolwindow highlights problematic traces, errors and helps understand real-world issues.

How Does This Help at 2AM?

Disasters happen because we aren’t looking. I’d like to say I open my observability dashboard regularly but I don’t. Then when there’s a failure I take a while to get my bearings within it. The locality of the applicable data is important, it helps us notice issues when they happen. Detect regressions before they turn to failures and understand the impact of the code we just merged.

Prevention starts with awareness and as developers, we handed our situational awareness to the DevOps team.

When the failure actually happens, the locality and accessibility of the data makes a big difference. Since we use tools that integrate in the IDE daily this reduces the meantime to a fix. No, a background developer observability tool might not include the information we need to fix a problem. But if it does, then the information is already there and we need nothing else. That is fantastic.

Final Word

With all the discussion about observability and open telemetry, you would think everyone is using them. Unfortunately, the reality is far from that. Yes, there’s some saturation and familiarity in the DevOps crowd. This is not the case for developers.

This is a form of environmental blindness. How can our teams who are so driven by data and facts proceed with secondhand and often outdated data from OPS?

Should I spend time further optimizing this method or will I waste the effort, since few people use it?

We can benchmark things locally just fine, but real-world usage and impact are things that we all need to improve.

Don’t Forget to Share This Post!

Observability

Shai Almog

Author

Author, DevRel, Blogger, Open Source Hacker, Java Rockstar, Conference Speaker, Instructor and Entrepreneur.

Testing an OpenRewrite Recipe

Foojay Podcast #75: JCON Report, Part 4 – Tips and Tricks for Java Devs

Data Modeling for Java Developers: Structuring With PostgreSQL and MongoDB

Creating Scalable OpenAI GPT Applications in Java

Clean and Modular Java: A Hexagonal Architecture Approach

Building a Real-Time AI Fraud Detection System with Spring Kafka and MongoDB

Dissection of Joeffice: Open Source Office Suite in Java

Prime Time: The High Performance Java Event

Project Panama for Newbies (Part 1)

How I Improved Zero-Shot Classification in Deep Java Library (DJL) OSS

foojay: A Place for Friends of OpenJDK

Dashboard for OpenJDK Update Release Details

JDK14: New Features and Enhancements

Fun with Flags: My Top 10 Resources for JVM Flags

Performance of Modern Java on Data-Heavy Workloads: Real-Time Streaming

Performance of Modern Java on Data-Heavy Workloads: Batch Processing

How does Java handle different Images and ColorSpaces – Part 1

How does Java handle different Images and ColorSpaces – Part 2

How does Java handle different Images and ColorSpaces – Part 3

How does Java handle different Images and ColorSpaces – Part 4

Indexing all of Wikipedia, on a laptop

Working with Multiple Carets in IntelliJ IDEA

Clean Shutdown of Spring Boot Applications

Java 17 on the Raspberry Pi

How to Create Mobile Apps with JavaFX (Part 1)

Project Panama for Newbies (Part 1)

Foojay Slack: bit.ly/join-foojay-slack

Beginning JavaFX Applications with IntelliJ IDE

SpringBoot 3.2 + CRaC

Debugging Java on the Command Line

Apache Kafka Performance on Azul Platform Prime vs Vanilla OpenJDK

Learn about a number of experiments that have been conducted with Apache Kafka performance on Azul Platform Prime, compared to vanilla OpenJDK. Roughly 40% improvements in performance, both throughput and latency, are achieved.

Stable, Secure, and Affordable Java

Azul Platform Core is the #1 Oracle Java alternative, offering OpenJDK support for more versions (including Java 6 & 7) and more configurations for the greatest business value and lowest TCO.

Step up your coding with the Continuous Feedback Udemy Course: Additional coupons are available

Stable, Secure, and Affordable Java

Jakarta EE 11: Beyond the Era of Java EE

It’s 2AM Do you Know What Your Code is Doing?

Observability isn’t Pillars

Tracing and OpenTelemetry

Background Developer Observability

Enough Exposition

How Does This Help at 2AM?

Final Word

Shai Almog

Shai Almog

Thanks to our Sponsors!

Azul

Redis

CodeRabbit

Reo

Zencoder

Payara

Digma

adesso

Trending

Apache Kafka Performance on Azul Platform Prime vs Vanilla OpenJDK

Stable, Secure, and Affordable Java

Step up your coding with the Continuous Feedback Udemy Course: Additional coupons are available

Jakarta EE 11: Beyond the Era of Java EE

Stable, Secure, and Affordable Java

Comments (0)

Step up your coding with the Continuous Feedback Udemy Course: Additional coupons are available

Stable, Secure, and Affordable Java

Jakarta EE 11: Beyond the Era of Java EE

Do you want your ad here?

It’s 2AM Do you Know What Your Code is Doing?

Observability isn’t Pillars

Tracing and OpenTelemetry

Background Developer Observability

Enough Exposition

How Does This Help at 2AM?

Final Word

Shai Almog

Shai Almog

Thanks to our Sponsors!

Azul

Redis

CodeRabbit

Reo

Zencoder

Payara

Digma

adesso

Trending

Apache Kafka Performance on Azul Platform Prime vs Vanilla OpenJDK

Stable, Secure, and Affordable Java

Step up your coding with the Continuous Feedback Udemy Course: Additional coupons are available

Jakarta EE 11: Beyond the Era of Java EE

Stable, Secure, and Affordable Java

Do you want your ad here?

Related Articles

Comments (0)

Set Event Reminder

Subscribe to foojay updates:

Share with