Elastic

Handling JDK & GC Options Dynamically in Elasticsearch

October 28, 2020
4026 Unique Views
5 min read

Table of Contents

Configuring JVM Options with Elasticsearch
Ergonomic Defaults
Summary

Today we will dive into the start up of Elasticsearch, how it parses the configurable JVM options and how it can ergonomically switch between JVM options on startup.

Elasticsearch is a distributed search & analytics engine. Elasticsearch's full text search capabilities are based on Apache Lucene. It's the heart of the Elastic Stack and powers its solutions Enterprise Search, Observability and Security as well as many well known internet websites like Wikipedia, GitHub or Stack Overflow.

Elasticsearch tries to be a good JVM ecosystem citizen and ships with a recent distribution of the JVM. Elasticsearch 7.9.3 ships with a recent OpenJDK 15 distribution. One of the core principles of Elasticsearch is to get up and running as simple as possible. This is the reason why Elasticsearch ships a JDK, so that the user does not have the trouble of installing one. Not everyone is a Java expert after all! At some point however, you need to become at least a small expert, as you need to configure some JDK options like setting the heap.

In order to be able to configure JDK options for Elasticsearch before startup, these options need to be parsed and evaluated. When the user runs ./bin/elasticsearch or ./bin/elasticsearch.bat, some more Java programs are started before the actual Elasticsearch process is fired up. First a program to create a temporary directory is launched, which acts differently on Windows than on other operating systems. Second, the JvmOptionsParser class is used to determine the Java options, and only after this is done, the output of the parser is used to start the main Elasticsearch process. This also allows to run the other Java programs with small heaps to make sure they are fast - by using the JDK defaults.

Let's dive into the mechanism to configure JVM options.

Configuring JVM Options with Elasticsearch

The most commonly used jvm option that requires configuration before the Elasticsearch Java process is started, is setting the heap size. In order to do so, Elasticsearch makes use of a mechanism, that not only reads the config/jvm.options file but also reads the config/jvm.options.d directory and appends the contents of all files to create a big list of JVM options. You could create a file like config/jvm.options.d/heap.options like this:

# make sure we configure 2gb of heap
-Xms2g
-Xmx2g

This would configure the heap on startup. However the configuration and parsing mechanism is more powerful. Not only you can configure options, you can also configure different options for different JDK major versions.

Side note: In case you are asking yourself, why is there a jvm.options.d directory and not just a file: this caters properly for package upgrades of RPM or debian packages, so that the original jvm.options can be replaced and does not need to be edited.

So, why is this useful you might ask yourself? Well, sometimes a new Java release deprecates features, and sometimes features get removed. One of those features was the CMS Garbage Collector, which got deprecated in Java 9 and finally removed more than two years later in Java 14. Elasticsearch has been a happy user of the CMS for years, but with the removal there had to be a mechanism to start with another garbage collector as of Java 14 onwards. In order to support this, the JVM options parser also supports the ability to set certain options only for a certain Java version like this:

## GC configuration
8-13:-XX:+UseConcMarkSweepGC
8-13:-XX:CMSInitiatingOccupancyFraction=75
8-13:-XX:+UseCMSInitiatingOccupancyOnly

## G1GC Configuration
# NOTE: G1 GC is only supported on JDK version 10 or later
# to use G1GC, uncomment the next two lines and update the version on the
# following three lines to your version of the JDK
# 10-13:-XX:-UseConcMarkSweepGC
# 10-13:-XX:-UseCMSInitiatingOccupancyOnly
14-:-XX:+UseG1GC
14-:-XX:G1ReservePercent=25
14-:-XX:InitiatingHeapOccupancyPercent=30

The same applies for different GC options with Java 8 and Java 9

## JDK 8 GC logging
8:-XX:+PrintGCDetails
8:-XX:+PrintGCDateStamps
8:-XX:+PrintTenuringDistribution
8:-XX:+PrintGCApplicationStoppedTime
8:-Xloggc:logs/gc.log
8:-XX:+UseGCLogFileRotation
8:-XX:NumberOfGCLogFiles=32
8:-XX:GCLogFileSize=64m

# JDK 9+ GC logging
9-:-Xlog:gc*,gc+age=trace,safepoint:file=logs/gc.log:utctime,pid,tags:filecount=32,filesize=64m

You can read more about setting JVM options in the official Elastic docs.

There is another safeguard to append all configured and dynamically created JVM flags and start a JVM is to check if those options are compatible, before starting Elasticsearch in order to fail fast.

Also, Elasticsearch logs all JVM options on start up to allow for easy comparison of what is assumed by the user. Also, those options are not only logged, but can be retrieved using the nodes info API.

Ergonomic Defaults

So, with an infrastructure in place like that, can we do more fancy things than just parsing JVM options? Of course we can! Ideas anyone?

One of the advantages is to supply some useful standard JVM options, when starting Elasticsearch. There is a SystemJvmOptions class, that lists a couple of interesting options like setting the default encoding to UTF-8 or configuring the DNS TTL caching - which is important as Elasticsearch always enables the Java Security Manager.

Also, we can enable some options only, when a certain JDK version is in use. This enables dereferenced null pointer exceptions in Java 14 and above

private static String maybeShowCodeDetailsInExceptionMessages() {
    if (JavaVersion.majorVersion(JavaVersion.CURRENT) >= 14) {
        return "-XX:+ShowCodeDetailsInExceptionMessages";
    } else {
        return "";
    }
}

But this infrastructure can go even further, and become smarter over time. How about providing different JVM options depending on configuration settings like the heap?

This is exactly what has been worked on in a recent addition to Elasticsearch.

If a small heap is configured in combination with the G1 garbage collectors, some additional options are configured.

final boolean tuneG1GCForSmallHeap = tuneG1GCForSmallHeap(heapSize);
final boolean tuneG1GCHeapRegion = 
    tuneG1GCHeapRegion(finalJvmOptions, tuneG1GCForSmallHeap);
final boolean tuneG1GCInitiatingHeapOccupancyPercent =
    tuneG1GCInitiatingHeapOccupancyPercent(finalJvmOptions);
final int tuneG1GCReservePercent =
    tuneG1GCReservePercent(finalJvmOptions, tuneG1GCForSmallHeap);

So, what happens here and why? If less than 8GB of heap are configured - which is more often than you think, as many users are also running smaller instances of Elasticsearch and there is an ongoing effort of using less heap and offload this to other parts of the system - three additional options are set. Of course everything can be manually overwritten.

First, the size of a G1 heap region is set to 4 MB, using XX:G1HeapRegionSize=4m.

Second, the heap occupancy threshold, which triggers a marking cycle is set to XX:InitiatingHeapOccupancyPercent=30, somewhat earlier than the default of 45.

Third, the G1ReservePercent options is set to 15 instead of 25 percent in the small heap case, in both cases deviating from the default of 10 percent.

It took months of benchmarking and testing to come to these numbers, if you are interested in the discussion, there is a lengthy GitHub issue. In case you are wondering how those kind of issues surface during testing Elasticsearch. Elasticsearch is using nightly benchmarks on bare metal hardware to easily spot and investigate regressions. You can check out those benchmarks here. The tool used for this is called rally, a macrobenchmarking framework for Elasticsearch. One of the great features of rally is, that you can use your own data and queries to test and benchmark, so having your own nightly benchmarks is possible.

So, why have those options been picked, you may ask yourself. Thanks to the benchmark infrastructure testing became easy, but not the reason for testing. After switching from CMS to G1 a few benchmark results got worse and required investigation. One of the approaches was also to test the ParallelGC for really small heaps instead of G1, but this was abandoned.

We even managed to find a bug in our G1 configuration options. In order to understand the issue let's explain some Elasticsearch functionality. Elasticsearch utilizes circuit breakers to prevent overloading of a single node by accounting memory, for example when creating an aggregation response or receiving requests over the network. Once a certain limit is reached, Elasticsearch's circuit breaker will trip and return an exception. The idea here is to prevent the famous OutOfMemoryError, and tell the user that the request cannot be processed and also indicate if that is temporal or permanent issue. Since Elasticsearch 7.0 a real memory circuit breaker has been added, that takes the total heap into account instead of only the currently accounted data, which is more exact.

However this circuit breaker did not work in combination with the shipped G1 settings, as the configured settings assumed a heap bigger than 100% of what was configured and so the circuit breaker tripped before the garbage collector started its job of garbage collection per the supplied configuration. Also, the memory circuit breaker was enhanced with some G1 specific code to nudge G1 to do a young GC at some point.

Summary

As you can see, properly handling and parsing as well as choosing good default JDK options like switching from one garbage collector to another involves quite a bit of steps, infrastructure, testing, running in production & verification - and the same probably applies to your own applications as well.

The same applies to all the new generation garbage collectors like ZGC and shenandoah. Those will require extensive testing, proper CI integration and maybe a even a few changes in the code. Albeit those GCs promise huge improvements, make sure you are testing properly with your own workloads before jumping on those.

Also, never forget, that a tiny portion of your users will want to set their own options and cater for that properly, including upgrades.

Don’t Forget to Share This Post!

Elastic

Alexander Reelsen

Author

Alexander Reelsen is a Developer & Advocate, Dad, works since 2013 distributed at Elastic, is interested in search, scale, JVM, crystallang, serverless and Basketball.

Foojay Podcast #75: JCON Report, Part 4 – Tips and Tricks for Java Devs

Testing an OpenRewrite Recipe

Creating Scalable OpenAI GPT Applications in Java

Data Modeling for Java Developers: Structuring With PostgreSQL and MongoDB

Clean and Modular Java: A Hexagonal Architecture Approach

Building a Real-Time AI Fraud Detection System with Spring Kafka and MongoDB

Dissection of Joeffice: Open Source Office Suite in Java

Project Panama for Newbies (Part 1)

Prime Time: The High Performance Java Event

How I Improved Zero-Shot Classification in Deep Java Library (DJL) OSS

foojay: A Place for Friends of OpenJDK

Dashboard for OpenJDK Update Release Details

JDK14: New Features and Enhancements

Fun with Flags: My Top 10 Resources for JVM Flags

Performance of Modern Java on Data-Heavy Workloads: Real-Time Streaming

Performance of Modern Java on Data-Heavy Workloads: Batch Processing

How does Java handle different Images and ColorSpaces – Part 1

How does Java handle different Images and ColorSpaces – Part 2

How does Java handle different Images and ColorSpaces – Part 3

How does Java handle different Images and ColorSpaces – Part 4

Indexing all of Wikipedia, on a laptop

Working with Multiple Carets in IntelliJ IDEA

Clean Shutdown of Spring Boot Applications

Java 17 on the Raspberry Pi

How to Create Mobile Apps with JavaFX (Part 1)

Project Panama for Newbies (Part 1)

Foojay Slack: bit.ly/join-foojay-slack

Beginning JavaFX Applications with IntelliJ IDE

SpringBoot 3.2 + CRaC

Debugging Java on the Command Line

Stable, Secure, and Affordable Java

Azul Platform Core is the #1 Oracle Java alternative, offering OpenJDK support for more versions (including Java 6 & 7) and more configurations for the greatest business value and lowest TCO.

Apache Kafka Performance on Azul Platform Prime vs Vanilla OpenJDK

Learn about a number of experiments that have been conducted with Apache Kafka performance on Azul Platform Prime, compared to vanilla OpenJDK. Roughly 40% improvements in performance, both throughput and latency, are achieved.

Jakarta EE 11: Beyond the Era of Java EE

Step up your coding with the Continuous Feedback Udemy Course: Additional coupons are available

Stable, Secure, and Affordable Java

Handling JDK & GC Options Dynamically in Elasticsearch

Configuring JVM Options with Elasticsearch

Ergonomic Defaults

Summary

Alexander Reelsen

Alexander Reelsen

Thanks to our Sponsors!

Azul

Redis

CodeRabbit

Reo

Zencoder

Payara

Digma

adesso

Trending

Stable, Secure, and Affordable Java

Apache Kafka Performance on Azul Platform Prime vs Vanilla OpenJDK

Step up your coding with the Continuous Feedback Udemy Course: Additional coupons are available

Stable, Secure, and Affordable Java

Jakarta EE 11: Beyond the Era of Java EE

Comments (0)

Jakarta EE 11: Beyond the Era of Java EE

Step up your coding with the Continuous Feedback Udemy Course: Additional coupons are available

Stable, Secure, and Affordable Java

Do you want your ad here?

Handling JDK & GC Options Dynamically in Elasticsearch

Configuring JVM Options with Elasticsearch

Ergonomic Defaults

Summary

Alexander Reelsen

Alexander Reelsen

Thanks to our Sponsors!

Azul

Redis

CodeRabbit

Reo

Zencoder

Payara

Digma

adesso

Trending

Stable, Secure, and Affordable Java

Apache Kafka Performance on Azul Platform Prime vs Vanilla OpenJDK

Step up your coding with the Continuous Feedback Udemy Course: Additional coupons are available

Stable, Secure, and Affordable Java

Jakarta EE 11: Beyond the Era of Java EE

Do you want your ad here?

Comments (0)

Set Event Reminder

Subscribe to foojay updates:

Share with