SlideShare a Scribd company logo
Distributed pub/sub platform
github.com/yahoo/pulsar
Matteo Merli — mmerli@yahoo-inc.com
Bay Area Hadoop Meetup — 10/19/2016
What is Pulsar?
2
▪ Hosted multi-tenant pub/sub messaging platform
▪ Simple messaging model
▪ Horizontally scalable - Topics, Message throughput
▪ Ordering, durability & delivery guarantees
▪ Geo-replication
▪ Easy to operate (Add capacity, replace machines)
▪ Few numbers for production usage:
› 1.5 years — 1.4 M topics — 100 B msg/day — Zero data loss
› Average publish latency < 5ms, 99pct 15ms
› 80+ application onboarded — Self-serve provisioning
› Presence in 8 data centers
Pulsar
Common use cases
3
▪ Application integration
› Server-to-server control, status propagation, notifications
▪ Persistent queue
› Stream processing, buffering, feed ingestion, tasks dispatcher
▪ Message bus for large scale data stores
› Durable log
› Replication within and across geo-locations
Pulsar
Main features
4
▪ REST / Java / Command line administrative APIs
› Provision users / grant permissions
› Users self-administration
› Metrics for topics / brokers usage
▪ Multi tenancy
› Authentication / Authorization
› Storage quota management
› Tenant isolation policies
› Message TTL
› Backlog and subscriptions management tools
▪ Message retention and replay
› Rollback to redeliver already acknowledged messages
Pulsar
Why build a new system?
5
▪ No existing solution to satisfy requirements
› Multi tenant — 1M topics — Low latency — Durability — Geo replication
▪ Kafka doesn’t scale well with many topics:
› Storage model based on individual directory per topic partition
› Enabling durability kills the performance
▪ Ability to manage large backlogs
▪ Operations are not very convenient
› eg: replacing a server, manual commands to copy the data and involves clients
› clients access to ZK clusters not desirable
▪ No scalable support to keep consumer position
Pulsar
Messaging Model
6 Pulsar
Consumer-A1 receives all messages published on T; B1, B2, B3 receive one third each
Shared
Exclusive
Consumer-B1
Consumer-B2
Consumer-B3
Topic-T
Subscription-B
Subscription-A Consumer-A1
Producer-X
Producer-Y
7
Client API
Producer
PulsarClient client = PulsarClient.create(
"http://broker.usw.example.com:8080");
Producer producer = client.createProducer(
"persistent://my-prop/us-west/my-ns/my-topic");
// Handles retries in case of failure
producer.send("my-message".getBytes());
// Async version:
producer.sendAsync(“my-message”.getBytes())
.thenAccept(msgId -> {
// Message was persisted
});
Consumer
PulsarClient client = PulsarClient.create(
"http://broker.usw.example.com:8080");
Consumer consumer = client.subscribe(
"persistent://my-prop/us-west/my-ns/my-topic",
"my-subscription-name");
while (true) {
// Wait for a message
Message msg = consumer.receive();
// Process message …
// Acknowledge the message so that
// it can be deleted by broker
consumer.acknowledge(msg);
}
Pulsar
Main client library features
8
▪ Sync / Async operations
▪ Partitioned topics
▪ Transparent batching of messages
▪ Compression
▪ End-to-end checksum
▪ TLS encryption
▪ Individual and cumulative acknowledgment
▪ Client side stats
Pulsar
Architecture
9 Pulsar
Separate layers
between brokers and
storage (bookies)
‣ Broker and bookies can
be added
independently
‣ Traffic can be shifted
very quickly across
brokers
‣ New bookies will ramp
up on traffic quickly
Pulsar Cluster
ZK
Producer Consumer
Broker 1 Broker 3
Bookie
1
Bookie
2
Bookie
3
Bookie
4
Bookie
5
Broker 2
Architecture
10 Pulsar
Pulsar Cluster
Broker
Bookie
ZK
Global
ZK
Service
discovery
Producer
App
Pulsar
lib
Replication
Managed
Ledger
BK
Client
Global
replicators
Cache
Dispatcher
Consumer
App
Pulsar
lib
Load
Balancer
Broker
‣ End-to-end async
message processing
‣ Messages are relayed
across producers,
bookies and
consumers with no
copies
‣ Pooled ref-counted
buffers
‣ Cache recent
messages
BookKeeper
11
▪ Replicated log service
▪ Offer consistency and durability
▪ Why is it a good choice for Pulsar?
› Very efficient storage for sequential data
› For each topic we are creating multiple ledgers over time
› Very good distribution of IO across all bookies
› Isolation of write and reads
› Flexible model for quorum writes with different tradeoffs
Pulsar
BookKeeper - Storage
12
▪ A single bookie can serve
and store thousands of
ledgers
▪ Writes to journal, reads
come from ledger device:
› Avoid read activity to impact
write latency
› Writes are added to in-
memory write-cache and
committed to journal
› Write cache is flushed in
background to separated
ledger device
▪ Entries are sorted to allow
for mostly sequential reads
Pulsar
Performance — Single topic throughput and latency
13 Pulsar
Throughput and 99pct publish latency — 1 Topic — 1 Producer
Latency(ms)
0
1
2
3
4
5
6
Throughput (msg/s)
1,000 10,000 100,000 1,000,000 10,000,000
1,800,000
10 Bytes
100 Bytes
1KB
Final Remarks
• Check out the code and docs at github.com/yahoo/pulsar
• Give feedback or ask for more details on mailing lists:
• Pulsar-Users
• Pulsar-Dev

More Related Content

October 2016 HUG: Pulsar,  a highly scalable, low latency pub-sub messaging system

  • 1. Distributed pub/sub platform github.com/yahoo/pulsar Matteo Merli — mmerli@yahoo-inc.com Bay Area Hadoop Meetup — 10/19/2016
  • 2. What is Pulsar? 2 ▪ Hosted multi-tenant pub/sub messaging platform ▪ Simple messaging model ▪ Horizontally scalable - Topics, Message throughput ▪ Ordering, durability & delivery guarantees ▪ Geo-replication ▪ Easy to operate (Add capacity, replace machines) ▪ Few numbers for production usage: › 1.5 years — 1.4 M topics — 100 B msg/day — Zero data loss › Average publish latency < 5ms, 99pct 15ms › 80+ application onboarded — Self-serve provisioning › Presence in 8 data centers Pulsar
  • 3. Common use cases 3 ▪ Application integration › Server-to-server control, status propagation, notifications ▪ Persistent queue › Stream processing, buffering, feed ingestion, tasks dispatcher ▪ Message bus for large scale data stores › Durable log › Replication within and across geo-locations Pulsar
  • 4. Main features 4 ▪ REST / Java / Command line administrative APIs › Provision users / grant permissions › Users self-administration › Metrics for topics / brokers usage ▪ Multi tenancy › Authentication / Authorization › Storage quota management › Tenant isolation policies › Message TTL › Backlog and subscriptions management tools ▪ Message retention and replay › Rollback to redeliver already acknowledged messages Pulsar
  • 5. Why build a new system? 5 ▪ No existing solution to satisfy requirements › Multi tenant — 1M topics — Low latency — Durability — Geo replication ▪ Kafka doesn’t scale well with many topics: › Storage model based on individual directory per topic partition › Enabling durability kills the performance ▪ Ability to manage large backlogs ▪ Operations are not very convenient › eg: replacing a server, manual commands to copy the data and involves clients › clients access to ZK clusters not desirable ▪ No scalable support to keep consumer position Pulsar
  • 6. Messaging Model 6 Pulsar Consumer-A1 receives all messages published on T; B1, B2, B3 receive one third each Shared Exclusive Consumer-B1 Consumer-B2 Consumer-B3 Topic-T Subscription-B Subscription-A Consumer-A1 Producer-X Producer-Y
  • 7. 7 Client API Producer PulsarClient client = PulsarClient.create( "http://broker.usw.example.com:8080"); Producer producer = client.createProducer( "persistent://my-prop/us-west/my-ns/my-topic"); // Handles retries in case of failure producer.send("my-message".getBytes()); // Async version: producer.sendAsync(“my-message”.getBytes()) .thenAccept(msgId -> { // Message was persisted }); Consumer PulsarClient client = PulsarClient.create( "http://broker.usw.example.com:8080"); Consumer consumer = client.subscribe( "persistent://my-prop/us-west/my-ns/my-topic", "my-subscription-name"); while (true) { // Wait for a message Message msg = consumer.receive(); // Process message … // Acknowledge the message so that // it can be deleted by broker consumer.acknowledge(msg); } Pulsar
  • 8. Main client library features 8 ▪ Sync / Async operations ▪ Partitioned topics ▪ Transparent batching of messages ▪ Compression ▪ End-to-end checksum ▪ TLS encryption ▪ Individual and cumulative acknowledgment ▪ Client side stats Pulsar
  • 9. Architecture 9 Pulsar Separate layers between brokers and storage (bookies) ‣ Broker and bookies can be added independently ‣ Traffic can be shifted very quickly across brokers ‣ New bookies will ramp up on traffic quickly Pulsar Cluster ZK Producer Consumer Broker 1 Broker 3 Bookie 1 Bookie 2 Bookie 3 Bookie 4 Bookie 5 Broker 2
  • 10. Architecture 10 Pulsar Pulsar Cluster Broker Bookie ZK Global ZK Service discovery Producer App Pulsar lib Replication Managed Ledger BK Client Global replicators Cache Dispatcher Consumer App Pulsar lib Load Balancer Broker ‣ End-to-end async message processing ‣ Messages are relayed across producers, bookies and consumers with no copies ‣ Pooled ref-counted buffers ‣ Cache recent messages
  • 11. BookKeeper 11 ▪ Replicated log service ▪ Offer consistency and durability ▪ Why is it a good choice for Pulsar? › Very efficient storage for sequential data › For each topic we are creating multiple ledgers over time › Very good distribution of IO across all bookies › Isolation of write and reads › Flexible model for quorum writes with different tradeoffs Pulsar
  • 12. BookKeeper - Storage 12 ▪ A single bookie can serve and store thousands of ledgers ▪ Writes to journal, reads come from ledger device: › Avoid read activity to impact write latency › Writes are added to in- memory write-cache and committed to journal › Write cache is flushed in background to separated ledger device ▪ Entries are sorted to allow for mostly sequential reads Pulsar
  • 13. Performance — Single topic throughput and latency 13 Pulsar Throughput and 99pct publish latency — 1 Topic — 1 Producer Latency(ms) 0 1 2 3 4 5 6 Throughput (msg/s) 1,000 10,000 100,000 1,000,000 10,000,000 1,800,000 10 Bytes 100 Bytes 1KB
  • 14. Final Remarks • Check out the code and docs at github.com/yahoo/pulsar • Give feedback or ask for more details on mailing lists: • Pulsar-Users • Pulsar-Dev