The Costs of Hidden Logging

October 11, 2021
3224 Unique Views
4 min read

Table of Contents

The Beginning: Unusual Customer EscalationAnd Down the Rabbit Hole...Elimination RoundRoot CausingSolutionFinal words

A while ago I received a customer escalation ticket regarding performance degradation when using Datadog Continuous Profiler for Java. The degradation was observable as an increased CPU usage as well as unexpected latency.

The Beginning: Unusual Customer Escalation

To bootstrap the troubleshooting, I usually try to isolate the area which might be causing the regression.

The profiler is packaged as a Java agent and for ease of use it is bundled together with the Datadog Java tracer agent. The profiler itself uses JDK Flight Recorder (JFR) under the hood.

This lead me to two main suspects:

either the code injected by the tracer was affecting the application in unexpected ways
or the profiler simply generated too many JFR events: leading to extra strain on I/O as well as CPU resources (in case the events need to collect stack traces)

Both of these cases are easily verifiable: the tracer can be disabled such that the profiler runs in isolation and the profiler can be tuned to reduce the number of events it generates following the troubleshooting guide.

Unfortunately, at this point the regression was persisting no matter what profiler configuration was used.

To make things more interesting, the customer additionally reported that the system seemed to behave ‘normally’ even with both tracer and profiler enabled and with the default profiler settings, that is, until they ran a load test there. During the load test the latency would spike and after the load test had finished the CPU usage would stabilize at over 10% more than what had been observed before the load test run.

The CPU usage graph would look something like this.

Well, that makes the things clear, doesn’t it?

And Down the Rabbit Hole...

So, here I am. With a confirmed performance degradation, obviously caused by the profiler but with no explanation and, what was worse, no local reproducer. I had to rely on whatever data I was able to get from the collected profiles, preferably not disturbing the customer with too many restarts and redeployments. But the number of initial hints was surprisingly small, just that the regression was triggered by a common load test run and that the application was running on JDK 8u.

With nothing to start with, I resorted to checking the JFR recordings and the profiling metrics with the hopes of detecting some pattern. After some time spent randomly selecting profiles and trying to detect anything unexpected it finally hit me. The profiler metrics were really strange.

Why is the "JFR Periodic Task" thread even shown here? Usually, JFR is trying to stay as "quiet" as possible in terms of heap allocation not to disturb the observed application. So, why suddenly here JFR is THE memory hog?

Elimination Round

The "JFR Periodic Task" takes care of emitting periodical events. That was the good news: there are only a few built-in periodical events and the customer was not adding any of theirs. That meant that I would be able to identify the offender pretty fast, even considering that each change had to be discussed with the customer and then deployed by them.

And as luck had it my first candidate was jdk.NativeExecutionSample which is used to collect a sample of stacktraces for threads executing native code on behalf of JVM (e.g., JNI code). Lo and behold: after disabling that event the overhead was gone.

So, that would be it. Problem fixed!

Root Causing

Although disabling that particular event did resolve the escalation ticket, I was really curious what was the real root cause. I double checked the event implementation and I was 100% confident that it was not doing any on-heap allocation, especially since it is a "native" JFR event, written completely in C++.

So, in order to gain more insight into what was happening in the system, I added -Xlog:jfr JVM argument to enable JFR internal logging. The resulting logs didn’t show anything extraordinary, except the fact that each time a periodical event was emitted a bunch of log lines were added. But the log output should only be generated when requested, shouldn’t it? Well… apparently, it should not: https://hg.openjdk.java.net/jdk8u/jdk8u-dev/jdk/file/4eae74c62a51/src/share/classes/jdk/jfr/internal/Logger.java

As it turns out, during the JDK 8 backport of JFR a shortcut was made when adjusting the code to the missing JVM logging support. And as such the logger would always invoke the log message supplier, regardless of the current logging settings and forward the message to the internal logger. The internal logger, in turn, would promptly discard the message unless running with the required logging level.

This definitely has the potential to allocate a lot. And with a lot of garbage there comes quite high GC pressure, eating away the CPU and increasing the application latency.

Solution

Once it became clear that it was the partially implemented logging facility in the JDK 8 backport of JFR (for which I was also partially responsible) I opened a new JDK bug to track the work:

https://bugs.openjdk.java.net/browse/JDK-8266723 and went ahead to fix it.

The fix turned out to be not that complex, I just had to surface the log level check from native to Java and use it in the JFR Logger implementation to prevent the log message construction if it is not going to be used.

For JDK 8u JFR backport there are no real log levels, though, just ‘enabled’ and ‘disabled’ logging, making the fix even simpler. You can see the actual code changes here and here ) and it was included in JDK 8u302, and that’s the version since which you can enjoy minimal heap allocation from the JFR events again.

Final words

I wish I had some great words of wisdom to share here. So, I’ll leave you with only this: never underestimate the danger of a partial implementation (even if it is just a logging implementation) and the ‘invisible’ bugs are pretty darned hard to find and fix!

Don’t Forget to Share This Post!

Jaroslav Bachorik

Author

Software engineer at DataDog with eminent interest in Java and its performance, management and observability tooling. Long time OpenJDK contributor and co-author and maintainer of BTrace, a dynamic tracing tool for Java.

Testing an OpenRewrite Recipe

Foojay Podcast #75: JCON Report, Part 4 – Tips and Tricks for Java Devs

Data Modeling for Java Developers: Structuring With PostgreSQL and MongoDB

Creating Scalable OpenAI GPT Applications in Java

Clean and Modular Java: A Hexagonal Architecture Approach

Building a Real-Time AI Fraud Detection System with Spring Kafka and MongoDB

Dissection of Joeffice: Open Source Office Suite in Java

Prime Time: The High Performance Java Event

Project Panama for Newbies (Part 1)

How I Improved Zero-Shot Classification in Deep Java Library (DJL) OSS

foojay: A Place for Friends of OpenJDK

Dashboard for OpenJDK Update Release Details

JDK14: New Features and Enhancements

Fun with Flags: My Top 10 Resources for JVM Flags

Performance of Modern Java on Data-Heavy Workloads: Real-Time Streaming

Performance of Modern Java on Data-Heavy Workloads: Batch Processing

How does Java handle different Images and ColorSpaces – Part 1

How does Java handle different Images and ColorSpaces – Part 2

How does Java handle different Images and ColorSpaces – Part 3

How does Java handle different Images and ColorSpaces – Part 4

Indexing all of Wikipedia, on a laptop

Working with Multiple Carets in IntelliJ IDEA

Clean Shutdown of Spring Boot Applications

Java 17 on the Raspberry Pi

How to Create Mobile Apps with JavaFX (Part 1)

Project Panama for Newbies (Part 1)

Foojay Slack: bit.ly/join-foojay-slack

Beginning JavaFX Applications with IntelliJ IDE

SpringBoot 3.2 + CRaC

Debugging Java on the Command Line

Stable, Secure, and Affordable Java

Azul Platform Core is the #1 Oracle Java alternative, offering OpenJDK support for more versions (including Java 6 & 7) and more configurations for the greatest business value and lowest TCO.

Apache Kafka Performance on Azul Platform Prime vs Vanilla OpenJDK

Learn about a number of experiments that have been conducted with Apache Kafka performance on Azul Platform Prime, compared to vanilla OpenJDK. Roughly 40% improvements in performance, both throughput and latency, are achieved.

Jakarta EE 11: Beyond the Era of Java EE

Step up your coding with the Continuous Feedback Udemy Course: Additional coupons are available

Stable, Secure, and Affordable Java

The Costs of Hidden Logging

The Beginning: Unusual Customer Escalation

And Down the Rabbit Hole...

Elimination Round

Root Causing

Solution

Final words

Jaroslav Bachorik

Jaroslav Bachorik

Thanks to our Sponsors!

Azul

Redis

CodeRabbit

Reo

Zencoder

Payara

Digma

adesso

Trending

Stable, Secure, and Affordable Java

Apache Kafka Performance on Azul Platform Prime vs Vanilla OpenJDK

Step up your coding with the Continuous Feedback Udemy Course: Additional coupons are available

Jakarta EE 11: Beyond the Era of Java EE

Stable, Secure, and Affordable Java

Comments (0)

Jakarta EE 11: Beyond the Era of Java EE

Step up your coding with the Continuous Feedback Udemy Course: Additional coupons are available

Stable, Secure, and Affordable Java

Do you want your ad here?

The Costs of Hidden Logging

The Beginning: Unusual Customer Escalation

And Down the Rabbit Hole...

Elimination Round

Root Causing

Solution

Final words

Jaroslav Bachorik

Jaroslav Bachorik

Thanks to our Sponsors!

Azul

Redis

CodeRabbit

Reo

Zencoder

Payara

Digma

adesso

Trending

Stable, Secure, and Affordable Java

Apache Kafka Performance on Azul Platform Prime vs Vanilla OpenJDK

Step up your coding with the Continuous Feedback Udemy Course: Additional coupons are available

Jakarta EE 11: Beyond the Era of Java EE

Stable, Secure, and Affordable Java

Do you want your ad here?

Related Articles

Comments (0)

Set Event Reminder

Subscribe to foojay updates:

Share with