Performance

Billion Events Per Second with Millisecond Latency: Streaming Analytics at Giga-Scale

March 25, 2021
2654 Unique Views
5 min read

Table of Contents

Our Setup
Benchmark 1: 99.99th Percentile Latency with 20 ms Time Resolution
Benchmark 2: 99th Percentile Latency at a Billion Events per Second

This post is a part of a series:

Part 1 (Intro and high-throughput streaming benchmark)
Part 2 (batch workload benchmark)
Part 3 (low-latency benchmark)
Part 4 (concurrent GC with green threads)
Part 5 (you are here)

We're preparing a scientific paper on Hazelcast Jet, describing its architecture based on symmetric, data-local, non-blocking distributed event processing. As a part of this effort, we implemented the vendor-neutral NEXMark benchmark suite, consisting of 8 streaming queries that aim to capture typical kinds of questions you're likely to ask about your real-time data.

The queries deal with a domain model of auctions, sellers, and bids. For example, Query 5 asks: "Which auctions have achieved the highest price in the last period?"

In contrast to our previous benchmarks, which focused on single-node performance (especially GC-induced latency spikes), this time we focus on horizontal scalability: how do Jet's latency and throughput behave as you add more and more nodes?

In a cluster of 45 nodes and 720 vCPUs, Jet reached 1 billion events per second at a 99% latency of 26 milliseconds.

Our Setup

We used the same configuration as in our earlier benchmarks:

EC2 instance type c5.4xlarge (16 vCPUs)
Oracle OpenJDK 15.0.1 with a single GC parameter: -XX:MaxGCPauseMillis
Hazelcast 4.3
Jet's thread pool size: 12 or 14

In our previous experience, we found you don't need any low-level GC tuning parameters to get great latency results on the JVM, but you do have to use a recent JDK. We let the JVM use its default G1 collector and configured it with our desired GC pause target. Also, note that we size Jet's thread pool a bit below the system capacity: up to 14 threads on 16-vCPU machines. This important trick allows the GC to work in the background without interfering with Jet's real-time computations.

As for the measuring methodology, it is the same as before. Events come from a generator that lives inside the same JVM as Jet. Every event has a predetermined moment of occurrence, and it is the generator's duty to emit it as soon as possible once that time has arrived. Any latency in emitting the event counts towards the reported latency.

On the output side, we stop the latency clock as soon we observe some data item corresponding to the newest result. We compare the system time at that point with the moment at which the query result would ideally become known. For example, if the query involves a time window, then the moment at which the window "closes", is that ideal moment.

We don't require Jet to send the results to any outside system, because we want our results to focus on Jet's performance alone. The code we used in the benchmark is available on GitHub.

Benchmark 1: 99.99th Percentile Latency with 20 ms Time Resolution

For the first benchmark we asked this question: using a time resolution of 50 result updates per second and an input rate of 1 million events per second, what is the 99.99% latency at different cluster sizes?

The main point of this benchmark is probing deep into the rare system events that may cause latency spikes, however the original definitions of NEXMark queries ask for a very low update frequency (once per minute). At this rate it would take 100 minutes to get barely enough data to report the 99th percentile latency, and a whole week to get the 99.99th percentile. Therefore we changed this to 50 times per second (20-millisecond update period).

Having to report another complete set of results (10,000 data points) every 20 milliseconds, Jet doesn't have time to recover from any hiccup greater than a few milliseconds. For this reason we configured the GC pause target to 5 milliseconds (-XX:MaxGCPauseMillis=5).

NEXMark doesn't define the number of distinct keys involved in the queries, we set it to 10,000 for this benchmark.

We used the setting of 12 Jet threads per node, mostly because that's what we used in the previous benchmarks. We got these results:

99.99% latency of NEXMark queries at 1M event/second vs. cluster
size

The overall worst case was Query 5 in the largest cluster (20 nodes),
and it came out as 16 milliseconds.

Benchmark 2: 99th Percentile Latency at a Billion Events per Second

Our second question was this: How much hardware would Jet need to reach a throughput of a billion events per second, while maintaining its millisecond-level latency?

To test this, we selected the NEXMark query that seemed the toughest on Jet given the results from Benchmark 1. This is Query 5, which showed a 99.99% latency 16 milliseconds in a 20-node cluster.

Since this time we'll create much more pressure on the input side, aiming for the highest possible throughput, we relaxed the timings a bit. We ask for two updates per second (a new result set every 500 milliseconds), and accordingly had to adjust our percentile to 99% in order to make it practical. We also relaxed the GC timing, setting -XX:MaxGCPauseMillis=50.

We started with just a single node, finding the maximum event rate Jet can keep up with. The definition of "keeping up" is that the latency stays bounded during the test and doesn't constantly increase. We repeated the process at increasing cluster sizes, until reaching our goal of one billion. Here is what we found:

Query 5 throughput vs. cluster size

First of all, Jet was able to handle 25 million events per second on a single node and, even more impressively, this number kept scaling linearly all the way to our target, which it reached at a size of 40 nodes.

During the process we realized we could let Jet use more threads without impacting the latency, and since we wanted to minimize the number of EC2
instances, we let it use 14 threads. This still leaves 2 vCPUs to the background GC work and other system needs. If you look carefully, you can see the curve in the chart bending slightly upwards at the 20-node mark, this is an artifact of the 40-node measurements being taken with Jet using 8.6% more threads per node (14 vs. 12).

The criterion for maximum throughput is just keeping up with the input, allowing the latency to temporarily reach higher values as long as it settles back down. In order to gauge Jet's stable latency at various scales, we did another round of tests at 80% of the maximum throughput determined for each cluster size. This gave us the following latency chart:

Jet throughput vs. threads used

We can see that, although it's increasing, the latency stays pretty flat across a range of sizes well exceeding the capacity you'd probably need from Jet in real life.

Finally, we started from the setup that supported a billion events per second (40 nodes) and continued adding nodes until we got a stable low latency.

This was our result: 45 nodes, 26 milliseconds.

If you enjoyed reading this post, check out Jet at GitHub and give us a star!

Don’t Forget to Share This Post!

Performance

Marko Topolnik

Author

Marko Topolnik, PhD, has been a Java professional since 2001. His current position is in the core team of Hazelcast Jet, where he co-wrote the core execution engine based on coroutine-like suspendable code that runs many concurrent tasks on a fixed thread pool. Marko is also an active contributor on Stack Overflow on the kotlin-coroutines tag.

Testing an OpenRewrite Recipe

Foojay Podcast #75: JCON Report, Part 4 – Tips and Tricks for Java Devs

Data Modeling for Java Developers: Structuring With PostgreSQL and MongoDB

Creating Scalable OpenAI GPT Applications in Java

Clean and Modular Java: A Hexagonal Architecture Approach

Building a Real-Time AI Fraud Detection System with Spring Kafka and MongoDB

Dissection of Joeffice: Open Source Office Suite in Java

Prime Time: The High Performance Java Event

Project Panama for Newbies (Part 1)

How I Improved Zero-Shot Classification in Deep Java Library (DJL) OSS

foojay: A Place for Friends of OpenJDK

Dashboard for OpenJDK Update Release Details

JDK14: New Features and Enhancements

Fun with Flags: My Top 10 Resources for JVM Flags

Performance of Modern Java on Data-Heavy Workloads: Real-Time Streaming

Performance of Modern Java on Data-Heavy Workloads: Batch Processing

How does Java handle different Images and ColorSpaces – Part 1

How does Java handle different Images and ColorSpaces – Part 2

How does Java handle different Images and ColorSpaces – Part 3

How does Java handle different Images and ColorSpaces – Part 4

Indexing all of Wikipedia, on a laptop

Working with Multiple Carets in IntelliJ IDEA

Clean Shutdown of Spring Boot Applications

Java 17 on the Raspberry Pi

How to Create Mobile Apps with JavaFX (Part 1)

Project Panama for Newbies (Part 1)

Foojay Slack: bit.ly/join-foojay-slack

Beginning JavaFX Applications with IntelliJ IDE

SpringBoot 3.2 + CRaC

Debugging Java on the Command Line

Apache Kafka Performance on Azul Platform Prime vs Vanilla OpenJDK

Learn about a number of experiments that have been conducted with Apache Kafka performance on Azul Platform Prime, compared to vanilla OpenJDK. Roughly 40% improvements in performance, both throughput and latency, are achieved.

Stable, Secure, and Affordable Java

Azul Platform Core is the #1 Oracle Java alternative, offering OpenJDK support for more versions (including Java 6 & 7) and more configurations for the greatest business value and lowest TCO.

Stable, Secure, and Affordable Java

Step up your coding with the Continuous Feedback Udemy Course: Additional coupons are available

Jakarta EE 11: Beyond the Era of Java EE

Billion Events Per Second with Millisecond Latency: Streaming Analytics at Giga-Scale

Our Setup

Benchmark 1: 99.99th Percentile Latency with 20 ms Time Resolution

Benchmark 2: 99th Percentile Latency at a Billion Events per Second

Marko Topolnik

Marko Topolnik

Thanks to our Sponsors!

Azul

Redis

CodeRabbit

Reo

Zencoder

Payara

Digma

adesso

Trending

Apache Kafka Performance on Azul Platform Prime vs Vanilla OpenJDK

Stable, Secure, and Affordable Java

Jakarta EE 11: Beyond the Era of Java EE

Stable, Secure, and Affordable Java

Step up your coding with the Continuous Feedback Udemy Course: Additional coupons are available

Comments (0)

Stable, Secure, and Affordable Java

Step up your coding with the Continuous Feedback Udemy Course: Additional coupons are available

Jakarta EE 11: Beyond the Era of Java EE

Do you want your ad here?

Billion Events Per Second with Millisecond Latency: Streaming Analytics at Giga-Scale

Our Setup

Benchmark 1: 99.99th Percentile Latency with 20 ms Time Resolution

Benchmark 2: 99th Percentile Latency at a Billion Events per Second

Marko Topolnik

Marko Topolnik

Thanks to our Sponsors!

Azul

Redis

CodeRabbit

Reo

Zencoder

Payara

Digma

adesso

Trending

Apache Kafka Performance on Azul Platform Prime vs Vanilla OpenJDK

Stable, Secure, and Affordable Java

Jakarta EE 11: Beyond the Era of Java EE

Stable, Secure, and Affordable Java

Step up your coding with the Continuous Feedback Udemy Course: Additional coupons are available

Do you want your ad here?

Related Articles

Comments (0)

Set Event Reminder

Subscribe to foojay updates:

Share with