Marko Topolnik, PhD, has been a Java professional since 2001. His current position is in the core team of Hazelcast Jet, where he co-wrote the core execution engine based on coroutine-like suspendable code that runs many concurrent tasks on a fixed thread pool. Marko is also an active contributor on Stack Overflow on the kotlin-coroutines tag.
We’re preparing a scientific paper on Hazelcast Jet, describing its architecture based on symmetric, data-local, non-blocking distributed event processing. As a part of this effort, we implemented the vendor-neutral NEXMark benchmark suite, consisting of 8 streaming queries that aim to capture typical kinds of questions you’re likely to ask about your real-time data.
The queries deal with a domain model of auctions, sellers, and bids. For example, Query 5 asks: “Which auctions have achieved the highest price in the last period?”M. Topolnik
In Part 3 we showed that a modern JVM running live stream aggregation can achieve a 99.99% latency lower than 10 milliseconds. The focus of that post was comparing the different GC options available for the JVM. In order to maintain a level playing field, we kept to the default settings as much possible.
In this round we want to look at the same problem from the opposite angle: what can we do to help Hazelcast Jet achieve the best performance available on a JVM? How much throughput can we get while staying within the tight 10 ms bound for 99.99th percentile latency? We found our opportunity in a distinct design feature of Jet: the Cooperative Thread Pool.M. Topolnik
This is a followup on Part 1 of the blog post series we started earlier this month, analyzing the performance of modern JVMs on workloads that are relevant to the use case of real-time stream processing.
As a quick recap, in Part 1 we tested the basic functionality of Hazelcast Jet (sliding window aggregation) on two types of workload: lightweight with a focus on low latency, and heavyweight with a focus on the data pipeline keeping up with high throughput and large aggregation state. For the low latency benchmarks we chose the JDK 14 as the most recent stable version and three of its garbage collectors: Shenandoah, ZGC, and G1 GC.M. Topolnik
A batch pipeline processes a finite amount of stored data. There are no running results, we need the output of the aggregate function applied to the entire dataset. This changes our performance requirements: the key factor in streaming, latency, doesn’t exist here since we are not processing data in real time. The only metric that matters is the total run time of the pipeline.
For this reason, we considered the Parallel GC as a relevant candidate. In the first testing round, on a single node, it actually delivered the best throughput (but only after GC tuning). However, it achieves that throughput at the expense of GC pause duration. In a cluster, whenever any node enters a GC pause, it stalls the whole data pipeline. Since individual nodes enter GC pauses at different times, the amount of time spent in GC goes up with every node you add to the cluster. We explored this effect by comparing single-node tests with tests in a three-node cluster.M. Topolnik
The Java runtime has been evolving more rapidly in recent years and, after 15 years, we finally got a new default garbage collector: the G1. Two more GCs are on their way to production and are available as experimental features: Oracle’s ZGC and OpenJDK’s Shenandoah.
We at Hazelcast thought it was time to put all these new options to the test and find which choices work well with workloads typical for our distributed stream processing engine, Hazelcast Jet.M. Topolnik