SlideShare a Scribd company logo
streamnative.io
Stream/Segment - best way to access
events in Pulsar
Neng Lu
Who Am I
❏ StreamNative Software Engineer
❏ Ex-Twitter
❏ Contributed to Apache Projects - Heron, Pulsar
❏ Interested in event streaming technologies
Pulsar 1.X
Apache Pulsar
“Flexible Pub/Sub Messaging Backed by Durable Log Storage”
Pulsar 2.X
Apache Pulsar
“Cloud-native Messaging and Event Streaming Platform”
Pulsar Use Cases
❏ Unified Event Center/Bus (Queuing + Streaming)
❏ Billing Service
❏ Push Notification
❏ Worker Queue
❏ Logging Pipeline
❏ IoT
❏ Streaming-first, unified data processing
Data Processing with Apache Pulsar
Data Processing Categories
❏ Batch
❏ The amount of data is huge
❏ Can run on a huge cluster
❏ Fine-grained fault tolerance
Data Processing Categories
❏ Batch
❏ The amount of data is huge
❏ Can run on a huge cluster
❏ Fine-grained fault tolerance
❏ Streaming
❏ Long running jobs
❏ Time critical
❏ scalability as well as fault tolerant
Data Processing Categories
❏ Interactive
❏ Time critical
❏ Medium data size
❏ Rerun on failures
❏ Batch
❏ The amount of data is huge
❏ Can run on a huge cluster
❏ Fine-grained fault tolerance
❏ Streaming
❏ Long running jobs
❏ Time critical
❏ scalability as well as fault tolerant
Data Processing Categories
❏ Interactive
❏ Time critical
❏ Medium data size
❏ Rerun on failures
❏ Batch
❏ The amount of data is huge
❏ Can run on a huge cluster
❏ Fine-grained fault tolerance
❏ Streaming
❏ Long running jobs
❏ Time critical
❏ scalability as well as fault tolerant
❏ Serverless
❏ Simple, light-weight processing
❏ Processing data with high
velocity
Apache Pulsar Layered Architecture
Stateless Serving
Durable Storage
Pulsar Messaging API
❏ Read data from brokers with different Subscription Modes
❏ Consume / Seek / Receive
❏ Reprocessing data by rewinding (seeking) the cursors
Subscription Mode
❏ Exclusive
❏ Failover
❏ Shared
❏ Key_Shared
Pulsar Segment API
❏ Read data from storage (bookkeeper or tiered storage)
❏ Fine-grained Parallelism
❏ Predicate pushdown (publish timestamp)
Segment Centric Storage
❏ Topic Partition (Managed Ledger)
❏ The storage layer for a single topic
partition
❏ Segment (Ledger)
❏ Single writer, append-only
❏ Replicated to multiple bookies
Tired Storage
❏ Long retention
❏ Low cost
❏ Easy to access
Apache Pulsar Data APIs
Bookie1 Bookie2 Bookie3 Bookie4
Producer Consumer
Broker 1 Broker 2 Broker 3
Bookie5
HADOOPGCSS3
Messaging API
Segment API
Pulsar - Infinite Event Stream Storage
Pulsar - Topic
Pulsar - Topic Partitions
Pulsar - Segments
Pulsar - Stream
Pulsar - Infinite Event Stream Storage
Benefits
❏ Unlimited Topic Partition Storage
❏ Instant Scaling without Data Rebalancing
❏ Broker Failure Recovery
❏ Bookie Failure Recovery
❏ Cluster Expansion
❏ Low latency reading for messaging data
❏ High throughput reading for batch data
❏ Reduced cost for whole data storage
Pulsar SQL Case
Pulsar Flink Case
Flink
Job18 7 6 5 4 3 2 1
1
2
1
1
1
0
9
Flink
Job2
Conclusion
❏ Apache Pulsar is a cloud-native messaging streaming system
❏ Multi layered architecture
❏ Segment centric storage
❏ Two levels of reading API: Pub/Sub + Segment
❏ Apache Pulsar provides a unified view of data
Community
❏ Pulsar Website: https://pulsar.apache.org
❏ Twitter: @apache_pulsar / @streamnativeio
❏ Slack: https://apache-pulsar.herokuapp.com
❏ Mailing Lists dev@pulsar.apache.org , users@pulsar.apache.org
❏ Github: https://github.com/apache/pulsar
❏ Medium: https://medium.com/streamnative
Thank You!
Reference
❏ http://pulsar.apache.org/docs/en/concepts-overview/
❏ https://www.splunk.com/en_us/blog/it/comparing-pulsar-and-kafka-how-a-segment-based-architecture-delivers-better-
performance-scalability-and-
resilience.html#:~:text=Pulsar%20Architecture%20Basics&text=An%20Apache%20Pulsar%20cluster%20is,bookies%20that%2
0durably%20store%20messages.
❏ https://pulsar.apache.org/docs/en/concepts-tiered-storage/
❏ https://flink.apache.org/2019/05/03/pulsar-flink.html

More Related Content

Stream or segment : what is the best way to access your events in Pulsar_Neng