Apache Pulsar: A Unified Queueing and Streaming Platform

Nov 10th, 2021 6:00am by Addison Higham

Addison Higham is the Chief Architect at StreamNative, where he helps customers solve difficult data challenges. Addison has been involved in the Pulsar community for three and half years and is an Apache Pulsar Committer and has been involved in the streaming ecosystem for the last seven years of his career.

Here’s a bold statement — all signs indicate that the Apache Pulsar open source distributed messaging system is rapidly capturing the zeitgeist of modern application architecture and development.

So what are we talking about here? Let’s set the stage.

As engineering teams move to solve more and more complex challenges, the tech and tools needed to solve problems continue to evolve. One common tool in the engineering toolbox is messaging.

Messaging is based on the concept of reliable message queuing, where messages are queued asynchronously between client applications, with a “broker” acting as an intermediary between applications.

In the early days, brokers were pretty simple, but as needs have changed, so too have messaging systems. A distributed messaging system builds upon this concept and provides the benefits of reliability, scalability, and persistence, with multiple brokers that can help distribute the load.

Most distributed messaging systems support one of two types of semantics: streaming and queueing. Historically, each is best suited for certain kinds of use cases. Apache Pulsar is unique in that it supports both streaming and queueing use cases.

Before exploring the benefits of using a unified streaming and queueing platform, it’s good to step and back and look at queueing and streaming technologies individually.

Streaming platforms are a relatively new innovation in the industry that is really useful for moving large amounts of data in an ordered, relatively low latency fashion. Streaming platforms are ideal for moving data (like logs, metrics, click data, etc.) and centralizing it to one location, and doing so with high parallelism and throughput.

For example, imagine getting click or metrics data from 10,000 machines in a large cloud deployment — streaming platforms facilitate that.

A queueing platform is somewhat similar to a streaming platform in the sense that it’s about linking systems together. However, queueing systems have existed for a long time and are more often more about point-to-point communication, allowing a wide range of applications to exchange information.

The access patterns are also different between the two systems. Streaming systems are focused on messages arriving in order and dealing with groups of messages that are processed together, perhaps for aggregation or transforming data.

In contrast, in a queueing system, events are typically handled one at time, like in a work queue, where each message may represent some specific “task.” Put another way, while streaming is for moving and processing large amounts of data in groups, queueing is often about granular handling of individual messages to facilitate some work in a system.

The most common streaming platforms are Apache Kafka and Amazon Web Services‘ Kinesis. The most common queueing systems include RabbitMQ and ActiveMQ. In the cloud, there’s also Google Pub/Sub and AWS SQS and SNS.

Apache Pulsar Unifies Queueing and Streaming

First, a brief history lesson.

Pulsar was originally developed inside Yahoo around 2010 where there was a need for queueing workloads at a very high scale. Yahoo Services were vast and spread across many different teams and data centers.

At the time, they were using the Java Messaging Service (JMS), a standard in the Java community. Yahoo needed a system that could facilitate JMS-style workloads, but was designed in a more distributed scalable way.

While the API was initially focused on messaging workloads, the architecture of this system also made it an ideal candidate for streaming workloads, allowing the Yahoo team to use the system very flexibly in a wide range of use cases.

This service, known as the ”Cloud Messaging Service,” was very successful at Yahoo. It continued to be developed within Yahoo and was open sourced in 2016 as Pulsar. In 2018, the project became a top-level project in the Apache Software Foundation. Since then, Pulsar’s adoption has grown quickly. Many organizations, like Yahoo, have requirements for a more scalable messaging solution.

While streaming systems like Apache Kafka were capable of scaling — with a lot of manual effort around data rebalancing — the capabilities of streaming API were not always the right fit. It required developers to work around limitations of a pure streaming model while also requiring developers to learn a new way of thinking and designing, which made adoption for messaging use cases more difficult.

With Pulsar, the situation is different. Developers can use a familiar API that works in a familiar way while offering more scalability and the capabilities of a streaming system.

The need for scalable messaging plus streaming messaging is a challenge my team at Instructure faced. In the effort to solve this problem, we discovered Pulsar.

At Instructure, we were dealing with high-scale situations where we needed higher scale messaging. Initially, we tried to build this by re-architecting around streaming tech. Then, we found that Apache Pulsar was the perfect fit to help teams get the capabilities they needed but without the complexity of re-architecting around a streaming-based model.

When the teams at Instructure started using Pulsar, they saw its benefits immediately and adoption spread. The adoption of Pulsar at Instructure made it easier for applications to communicate and for us to share data across teams.

However, it isn’t just messaging workloads that Pulsar solves well, streaming is a real need that exists in most organizations, and here, Pulsar offers a system that is easier to use, operate, and integrate than other streaming tech.

For example, Pulsar is easy to scale, without the need to “rebalance” your cluster when you need to increase your cluster size. It offers support for multi-tenancy and millions of topics without suffering from large increases in latency, making it easy for many teams to share the same cluster.

This means that organizations don’t need to invest lots of effort in their own tooling, instead, they can focus on driving value from messages and data, not managing infrastructure.

For Iterable (which provides a cross-channel marketing platform), Pulsar provided the right balance of scalability, reliability, and features to replace RabbitMQ and, ultimately, to replace its other messaging systems, including Kafka and Amazon SQS. As Iterable’s Greg Methvin notes, Apache Pulsar was not only great for streaming workloads, they discovered that Pulsar was a great fit for their queueing needs as well.

Apache Pulsar Benefits

As those who have already adopted Apache Pulsar may have discovered, it provides more scalable message queuing capabilities than systems such as Red Hat‘s RabbitMQ or ActiveMQ, and more scalable and easier to use streaming capabilities, with built-in features such as geo-replication and multitenancy.

And! The added benefit is based on simple math. One unified streaming and the queueing platform is one less technology than having separate streaming and queueing technologies. With one technology, it’s easier to develop products and get them to market and get more utility out of the data that organizations already have.

Beyond the increased IT costs and spend of two separate technologies, they’re not well integrated which results in data silos. Whereas when you have one unified system you can take care of so much more including the data that you’re building and your applications, your queueing data, data analytics, streaming systems all in the same underlying system.

There’s no need to export data to another system or team because the organization is based around one central technology and one central store for data across its whole life cycle. With Apache Pulsar, users no longer have to manually deal with the different stages of the data life cycle. With Apache Pulsar, architectures are simplified because of how data is being moved back and forth.

As Iterable Staff Engineer Greg Methvin sums it up perfectly saying, “Pulsar is unique in that it supports both streaming and queueing use cases, while also supporting a wide feature set that makes it a viable alternative to many other distributed messaging technologies currently being used in our architecture. Pulsar covers all of our use cases for Kafka, RabbitMQ, and SQS. This lets us focus on building expertise and tooling around a single unified system.”

Apparently, with Apache Pulsar, you can have your cake and eat it, too.