7 Reasons to Choose Apache Pulsar over Apache Kafka

December 15, 2021
2222 Unique Views
5 min read

I wrote an earlier version of this article in 2019, while I was CEO of Kesque, a real-time messaging service built on Apache Pulsar, the cloud-native distributed messaging and streaming platform. A lot of big changes have happened in the interim; perhaps the most significant of these is the fact that the company I founded in early 2019 was acquired, in January, by DataStax. One thing that hasn’t changed, however, is the rationale behind our choice of Apache Pulsar.

At Kesque, it was our mission to empower developers to build cloud-native distributed applications by making cloud-agnostic, high-performance messaging technology easily available to everyone. Developers want to write distributed applications or microservice but don’t want the hassle of managing complex message infrastructure or getting locked into a particular cloud vendor. They need a solution that just works. Everywhere.

When you set out to build the best messaging infrastructure service, the first step is to pick the right underlying messaging technology. There are lots of choices out there, from various open-source projects like RabbitMQ, ActiveMQ, and NATS to proprietary solutions such as IBM MQ or Red Hat AMQ. And, of course, there is Apache Kafka, which is almost synonymous with streaming. But we didn’t go with Apache Kafka, we went with Apache Pulsar.

So why did we build our messaging service using Apache Pulsar? Here are the top seven reasons why we chose Apache Pulsar over Apache Kafka.

1. Streaming and queuing come together

Apache Pulsar is like two products in one. Not only can it handle high-rate, real-time use cases like Kafka, but it also supports standard message queuing patterns, such as competing consumers, fail-over subscriptions, and easy message fan out. Apache Pulsar automatically keeps track of the client read position in the topic and stores that information in its high-performance distributed ledger, Apache BookKeeper.

Unlike Kafka, Apache Pulsar can handle many of the use cases of a traditional queuing system, like RabbitMQ. So instead of running two systems — one for real-time streaming and one for queuing — you do both with Pulsar. It’s a two-for-one deal, and those are always good.

2. Partitions, but not necessarily partitions

If you use Kafka, you know about partitions. All topics are partitioned in Kafka. Partitioning is important because it increases throughput. By spreading the work across partitions and therefore multiple brokers, the rate that can be processed by a single topic goes up. But what if you have some topics that don’t need high rates. In these simple cases, wouldn’t it be nice to not have to worry about partitions and the API and management complexity that comes along with them?

Well, with Apache Pulsar it can be that simple. If you just need a topic, then use a topic. You don’t have to specify the number of partitions or think about how many consumers the topic might have. Pulsar subscriptions allow you to add as many consumers as you want on a topic with Pulsar keeping track of it all. If your consuming application can’t keep up, you just use a shared subscription to distribute the load between multiple consumers.

And if you really do need the performance of a partitioned topic, you can do that, too. Pulsar has partitioned topics if you need them — but only if you need them.

3. Logs are good, distributed ledgers are better

The Kafka team deserves credit for the insight that a log is a great abstraction for a real-time data exchange system. Because logs are append-only, data can be written to them quickly, and because the data in a log is sequential, it can be extracted quickly in the order that it was written. Sequential reading and writing is fast, random is not. Persistent storage interactions are a bottleneck in any system that offers data guarantees, and the log abstraction makes this about as efficient as possible.

Simple logs are great. But they can get you into trouble when they get large. Fitting a log on a single server becomes a challenge. What happens if it gets full and you need to scale out? And what happens if the server storing the log fails and needs to be recreated from a replica?

Copying a large log from one server to another, while efficient, can still take a long time. If your system is trying to do this while keeping up with real-time data, this can be quite a challenge. Check out “Adding a New Broker Results in Terrible Performance” in the blog post Stories from the Front: Lessons Learned from Supporting Apache Kafka for some color on this.

Apache Pulsar avoids the problem of copying large logs by breaking the log into segments. It distributes those segments across multiple servers while the data is being written by using Apache BookKeeper as its storage layer. This means that the log is never stored on a single server, so a single server is never a bottleneck. Failure scenarios are easier to deal with and scaling out is a snap. Just add another server. No rebalancing needed.

4. Stateless brokers, what?

Stateless is music to the ears of anyone building cloud-native applications. Stateless components start up quickly, are interchangeable, and scale seamlessly. Wouldn’t it be great if a message broker was stateless?

The Kafka broker is not stateless. Each broker contains the complete log for each of its partitions. If one broker fails, not just any broker can take over for it. If the load is getting too high, you can’t simply add another broker. Brokers must synchronize state from other brokers that contain replicas of its partitions.

In the Apache Pulsar architecture, the brokers are stateless. Yes, you heard that right. A completely stateless system wouldn’t be able to persist messages, so Apache Pulsar does maintain state, just not in the brokers. In Pulsar architecture, the brokering of data is separated from the storing of data. The brokers accept data from producers and send data to consumers, but the data is stored in Apache BookKeeper.

Because Pulsar brokers are stateless, if the load gets high, you just need to add another broker. The broker starts up quickly and gets to work right away.

5. Geo-replication for dummies

Geo-replication is a first-class feature in Pulsar. It’s not a bolt-on or a proprietary add-on. Pulsar was designed with geo-replication in mind. Configuring it is easy and it just works. So whether it’s a globally distributed application or disaster recovery scenario, you can set it up with Pulsar. No Ph.D. needed.

6. Consistently faster

Benchmark tests have shown that Pulsar delivers higher throughput along with lower and more consistent latency. Faster and more consistent is better. What else is there to say?

7. It’s all Apache open source

Pulsar has many of the same features as Kafka — such as geo-replication, in-stream message processing (Pulsar Functions), input and output connectors (Pulsar IO), SQL-based topic queries (Pulsar SQL), schema registry, as well as features Kafka doesn’t have like tiered storage and multi-tenancy.

All these features are part of the Apache open source project.

Pulsar is not a collection of open-source and closed-source features or open-source features controlled by a commercial entity. All its many useful features are open-source under the Apache umbrella. And unless the unthinkable happens, all this goodness will stay open source.

Conclusion

As you can see, we had lots of reasons to pick Apache Pulsar for building our messaging infrastructure service. And we didn’t even go into the reasons it’s easier to build a service around Pulsar, such as multi-tenancy, namespaces, authentication and authorization, documentation, Kubernetes friendliness.

If you’re looking to build out a messaging infrastructure service, give our managed Pulsar service, Astra Streaming, a try. You won’t regret it.

Don’t Forget to Share This Post!

Chris Bartholomew

Author

DataStax Engineering Lead, Streaming Engineering, Apache Pulsar

Testing an OpenRewrite Recipe

Foojay Podcast #75: JCON Report, Part 4 – Tips and Tricks for Java Devs

Data Modeling for Java Developers: Structuring With PostgreSQL and MongoDB

Creating Scalable OpenAI GPT Applications in Java

Clean and Modular Java: A Hexagonal Architecture Approach

Dissection of Joeffice: Open Source Office Suite in Java

Building a Real-Time AI Fraud Detection System with Spring Kafka and MongoDB

Prime Time: The High Performance Java Event

Project Panama for Newbies (Part 1)

How I Improved Zero-Shot Classification in Deep Java Library (DJL) OSS

foojay: A Place for Friends of OpenJDK

Dashboard for OpenJDK Update Release Details

JDK14: New Features and Enhancements

Fun with Flags: My Top 10 Resources for JVM Flags

Performance of Modern Java on Data-Heavy Workloads: Real-Time Streaming

Performance of Modern Java on Data-Heavy Workloads: Batch Processing

How does Java handle different Images and ColorSpaces – Part 1

How does Java handle different Images and ColorSpaces – Part 2

How does Java handle different Images and ColorSpaces – Part 3

How does Java handle different Images and ColorSpaces – Part 4

Indexing all of Wikipedia, on a laptop

Working with Multiple Carets in IntelliJ IDEA

Clean Shutdown of Spring Boot Applications

Java 17 on the Raspberry Pi

How to Create Mobile Apps with JavaFX (Part 1)

Project Panama for Newbies (Part 1)

Foojay Slack: bit.ly/join-foojay-slack

Beginning JavaFX Applications with IntelliJ IDE

SpringBoot 3.2 + CRaC

Debugging Java on the Command Line

Stable, Secure, and Affordable Java

Azul Platform Core is the #1 Oracle Java alternative, offering OpenJDK support for more versions (including Java 6 & 7) and more configurations for the greatest business value and lowest TCO.

Apache Kafka Performance on Azul Platform Prime vs Vanilla OpenJDK

Learn about a number of experiments that have been conducted with Apache Kafka performance on Azul Platform Prime, compared to vanilla OpenJDK. Roughly 40% improvements in performance, both throughput and latency, are achieved.

Stable, Secure, and Affordable Java

Step up your coding with the Continuous Feedback Udemy Course: Additional coupons are available

Jakarta EE 11: Beyond the Era of Java EE

7 Reasons to Choose Apache Pulsar over Apache Kafka

1. Streaming and queuing come together

2. Partitions, but not necessarily partitions

3. Logs are good, distributed ledgers are better

4. Stateless brokers, what?

5. Geo-replication for dummies

6. Consistently faster

7. It’s all Apache open source

Conclusion

Chris Bartholomew

Chris Bartholomew

Thanks to our Sponsors!

Azul

Redis

CodeRabbit

Reo

Zencoder

Payara

Digma

adesso

Trending

Stable, Secure, and Affordable Java

Apache Kafka Performance on Azul Platform Prime vs Vanilla OpenJDK

Step up your coding with the Continuous Feedback Udemy Course: Additional coupons are available

Stable, Secure, and Affordable Java

Jakarta EE 11: Beyond the Era of Java EE

Comments (0)

Stable, Secure, and Affordable Java

Step up your coding with the Continuous Feedback Udemy Course: Additional coupons are available

Jakarta EE 11: Beyond the Era of Java EE

Do you want your ad here?

7 Reasons to Choose Apache Pulsar over Apache Kafka

1. Streaming and queuing come together

2. Partitions, but not necessarily partitions

3. Logs are good, distributed ledgers are better

4. Stateless brokers, what?

5. Geo-replication for dummies

6. Consistently faster

7. It’s all Apache open source

Conclusion

Chris Bartholomew

Chris Bartholomew

Thanks to our Sponsors!

Azul

Redis

CodeRabbit

Reo

Zencoder

Payara

Digma

adesso

Trending

Stable, Secure, and Affordable Java

Apache Kafka Performance on Azul Platform Prime vs Vanilla OpenJDK

Step up your coding with the Continuous Feedback Udemy Course: Additional coupons are available

Stable, Secure, and Affordable Java

Jakarta EE 11: Beyond the Era of Java EE

Do you want your ad here?

Related Articles

Comments (0)

Set Event Reminder

Subscribe to foojay updates:

Share with