Developing an Enterprise-Level Apache Cassandra Sink Connector for Apache Pulsar

November 24, 2021

Unique Views: 1,226

sinceNovember, 2021

Author(s)

Enrico Olivelli
Apache Pulsar PMC member

When DataStax started investing in streaming with Apache Pulsar™, we knew that one of the first things people would want to do was connect existing enterprise data sources to Apache Cassandra™ using Pulsar.

Apache Pulsar has a powerful framework called Pulsar IO to enable this kind of use case, and at DataStax we already had a best-in-class Kafka Connect Sink that enables you to store structured data coming from one or more Kafka topics into DataStax Enterprise, Apache Cassandra, and Astra.

In this post, I'll describe how we built a powerful and versatile Pulsar Sink to connect data streams to Cassandra by abstracting the core logic from our Kafka Sink and expanding it to support Pulsar. This resulted in the recent release of our Pulsar Sink Connector (read the docs here), which provides users with the same power and flexibility that they get with our Kafka connector.

Streams of structured data

Complex data flows can lead to the need to manage many types of data; this is reflected in database tables and schemas. However, you also have to deal with the same level of complexity when you exchange data using streaming technologies.

Apache Pulsar offers built-in mechanisms to deal with Schemas and enables you to consistently manage all of the data structures in your information system, by providing:

An embedded Schema registry, no the need for third-party Schema Registries
Low-level APIs to handle schema information with Messages
High-level APIs to handle schema information on Records
Schema evolution support and automatic validation of compatibility during schema modifications
Native support for Apache Avro, JSON and Google Protobuf

With the Cassandra connector you can leverage Schema information in order to manage the contents of the topics and write them to Cassandra.

You can also map an existing Pulsar schema to the schema of the Cassandra table—even when they don’t match perfectly—by:

selecting only a subset of fields
selecting fields of inner structures in the case of nested types
applying simple data transformations
handling Cassandra TTL and TIMESTAMP properties

From Kafka Connect Sink to Pulsar IO Sink

The Apache Kafka Connect framework is very similar to Apache Pulsar IO in terms of APIs.

The purpose of a Sink is to write to an external system (in this case Cassandra) the data that is coming from the stream.

In order to leverage all of the existing features of the Kafka Cassandra Sink and keep the two projects on feature parity going forward, we created an abstraction over the two systems: the Messaging Connectors Commons framework.

The Messaging Connectors Commons framework implements:

An abstraction over Kafka Connect and Pulsar IO APIs (Record and Schema APIs)
Cassandra Client Driver handling (configuration, security, and compatibility with a big matrix of Cassandra, DataStax DSE, and Astra)
Mapping and Data transformations
Failure handling
Multi-topic management

We created this framework by extracting all of the common business logic from the Kafka Sink.

After this pure engineering work, the road to building the Pulsar Sink was straightforward: we only had to implement the relevant APIs within the Pulsar Sink framework.

Differences between Kafka Connect and Pulsar IO

Even if at first glance the APIs are very similar, the Pulsar IO framework offers a slightly different model, with some cool features and some tradeoffs.

Single Record Acknowledgement

In Kafka you are exposed to the partitioned nature of a topic and you also have to manually track the offset you have reached, with one value for each partition.

In Pulsar, you process each record independently and concurrently, then you finish by acknowledging the processing or by calling a negative acknowledgement in order to tell Pulsar that the processing failed locally and the record should be reprocessed. This way the framework can handle dispatching records to the various instances of the Sink and it saves the Sink from having to handle all of the failure-tracking work.

In Pulsar, the Sink receives only one record at a time, but with the single acknowledgement feature, you are free to implement your own batching strategies; in Kafka the batching is done at the framework level so you do not have control over the amount of records to process at one time.

Structured Data into the Key

In Pulsar the key of a Record is always a String and it cannot contain structured data. But with the Cassandra sink you can interpret the String value as a JSON-encoded structure and access the contents of each field and of nested structures.

Schema API is lacking type information

The Pulsar Schema API framework also doesn’t report type information about generic data structures like Kafka Connect. (This GitHub Pull Request is open to add that.)

The Messaging Connectors Commons framework, which deals with mapping message data to the Cassandra model, needs detailed type information in order to transform each field to the correct Cassandra data type.

Because in Pulsar we don’t have direct access to this in the Schema, we decided to use Java Reflection and to dynamically build the schema. This approach works very well and it supports schema evolution, as it is expected that the data type does not change for a given field.

One problem arises with null values, because in Java from a null value you cannot know the original data type. In this case, we are initially supposing that the field is of the type String, and we allow it to move from String to another datatype as soon as we see a non-null value.

Sample usage

Here’s a simple example about how to map your Pulsar Topic to a Cassandra table:

configs:
 topics: mytopic 
 ignoreErrors: None
 ......
 topic:
   mytopic:
     mykeyspace:
       mytable:
         mapping: 'id=key.id,name=value.name,age=value.age'
         consistencyLevel: LOCAL_ONE
         ttl: -1
         ttlTimeUnit: SECONDS
         timestampTimeUnit: MICROSECONDS
         nullToUnset: true
         deletesEnabled: true
     codec:
       locale: en_US
       timeZone: UTC
       timestamp: CQL_TIMESTAMP
       date: ISO_LOCAL_DATE
       time: ISO_LOCAL_TIME
       unit: MILLISECONDS

Here we are reading AVRO or JSON encoded messages from the mytopic topic and we are writing to the mykeyspace.mytable table on Cassandra. (The connection, security, and performance configurations are omitted in order to focus on the mapping features.)

You can map multiple topics to multiple tables, and for each pair you can configure a different mapping, selecting fields from the message and binding each field to a column in Cassandra.

You can also store the full payload as a UDT to Cassandra or extract nested inner data structures.

The Sink automatically handles conversions and efficient writing to Cassandra.

Compatibility with Cassandra versions

The Cassandra Sink works well with Apache Cassandra, Datastax Enterprise (DSE), and Astra.

At Datastax we are running tests on CI against a matrix of many different versions of Cassandra and DSE as well as against the cloud Astra service in order to provide 100% compatibility.

Wrapping up

I described how we have been able to refactor the existing Kafka Connect Sink for Apache Cassandra and deliver the same level of features to Pulsar IO users.

Datastax Pulsar Cassandra sink enables you to connect your Pulsar cluster to Cassandra and it offers lots of flexibility in mapping the source data to Cassandra tables.

We are maintaining a full compatibility matrix against many versions of DSE, Apache Cassandra and DataStax Astra.

To learn more about how it works please check out our repository on GitHub, and feel free to post your questions on the issue tracker.

Want to try out Apache Pulsar? Sign up now for Astra Streaming, our fully managed Apache Pulsar service. We’ll give you access to its full capabilities entirely free through beta. See for yourself how easy it is to build modern data applications and let us know what you’d like to see to make your experience even better.

Don’t Forget to Share This Post!

Author Enrico Olivelli

Apache Pulsar PMC member

Topics:

View All

A Case for Databases on Kubernetes from a Former Skeptic

Looking back at the pitfalls of running databases on Kubernetes I encountered several years ago, most of them have been resolved.

All of these problems are hard and require technical finesse and careful thinking. Without choosing the right pieces, we’ll end up resigning both databases and Kubernetes to niche roles in our infrastructure, as well as the innovative engineers who have invested so much effort in building out all of these pieces and runbooks.
Read More
Christopher Bradford

July 13, 2021
Apache Cassandra
Databases
DataStax
DevOps
Kubernetes
A Maturity Model for Cloud-Native Databases

Companies have been leveraging Kubernetes and other technologies to move workloads to the clouds. But several nagging challenges have remained: what to do with the data layer, what technologies should you use, where should an organization keep data and how …
Read More
Jeff Carpenter

November 10, 2021
Databases
DataStax
DevOps
Kubernetes
Microservices
Apache Cassandra 4.0: Taming Tail Latencies with Java 16 ZGC

With Apache Cassandra 4.0, you not only get the direct improvements to performance added by the Apache Cassandra committers, you also unlock the ability to take advantage of seven years of improvements in the JVM itself.

This article focuses on improvements in Java garbage collection that Cassandra 4.0 coupled with Java 16 offers over Cassandra 3.11 on Java 8.
Read More
Jonathan Ellis

June 22, 2021
Apache Cassandra
Apache Pulsar
Performance

Author(s)

Enrico Olivelli
Apache Pulsar PMC member

Comments (0)

Cancel reply

Your email address will not be published. Required fields are marked *

Highlight your code snippets using [code lang="language name"] shortcode. Just insert your code between opening and closing tag: [code lang="java"] code [/code]. Or specify another language.

Save my name, email, and website in this browser for the next time I comment.

Developing an Enterprise-Level Apache Cassandra Sink Connector for Apache Pulsar

Author(s)

Streams of structured data

From Kafka Connect Sink to Pulsar IO Sink

Differences between Kafka Connect and Pulsar IO

Single Record Acknowledgement

Structured Data into the Key

Schema API is lacking type information

Sample usage

Compatibility with Cassandra versions

Wrapping up

Related Articles

Author(s)

Comments (0)

Cancel reply

Set Event Reminder

Subscribe to foojay updates: