SlideShare a Scribd company logo
Elastic Data Processing with
Apache Flink and Apache Pulsar
Sijie Guo (sijieg)

2019-04-02
Who am I
• Apache Pulsar PMC Member

• Apache BookKeeper PMC Member

• Interested in technologies around Event Streaming
Agenda
• What is Apache Pulsar?

• A Pulsar View on Data - Segmented Stream

• Pulsar - Access Pattern & Tiered Storage

• Pulsar - Schema

• When Flink meets Pulsar
What is Apache Pulsar?
2010
2011
2012
2006
2003
Pub/Sub Messaging
“Flexible Pub/Sub messaging
backed by durable log/stream storage”
Pulsar - Pub/Sub
Pulsar - Multi Tenancy
Pulsar - Queue + Streaming
Pulsar - Cloud Native
• Independent Scalability
• Instant Failure Recovery
• Balance-free on cluster
expansions
Layered Architecture
A Pulsar View on Data
Batch - HDFS
Stream - Pub/Sub
A Flink View on Computing
“Batch processing is a special case of
Stream processing”
Pulsar = Segmented Stream
Topic
Topic
Producers
Consumers
Time
Partitions
P0
P1
P2
P3
Producers
Consumers
Time
Segments
Segment 1 Segment 2 Segment 3 Segment 4
Segment 1 Segment 2 Segment 3
Segment 1 Segment 2 Segment 3
Segment 1 Segment 2 Segment 3
P0
P1
P2
P3
Producers
Consumers
Time
Stream
Segment 1 Segment 2 Segment 3 Segment 4
Segment 1 Segment 2 Segment 3
Segment 1 Segment 2 Segment 3
Segment 1 Segment 2 Segment 3
P0
P1
P2
P3
Producers
Consumers
Time
Stream
Segment 1 Segment 2 Segment 3 Segment 4Stream
Producers
Consumers
Time
Segmented Stream
• Segmented Stream Systems

• Apache Pulsar, Twitter EventBus, EMC Pravega

• All Apache BookKeeper based

• Used BK in a different way

• Pulsar, EventBus - Uses BK as the segment store

• Pravega - Uses BK as the journal only
Access Patterns
✓ Write

写
✓ Tailing Read

追尾
✓ Catchup Read

追赶读
Segment 1 Segment 2 Segment 3 Segment 4Stream
Time
Access Patterns
✓ Write

写
✓ Tailing Read

追尾
✓ Catchup Read

追赶读
Segment 1 Segment 2 Segment 3 Segment 4Stream
Time
Write
Access Patterns
✓ Write

写
✓ Tailing Read

追尾
✓ Catchup Read

追赶读
Segment 1 Segment 2 Segment 3 Segment 4Stream
Time
Write
Tailing Read
Access Patterns
✓ Write

写
✓ Tailing Read

追尾
✓ Catchup Read

追赶读
Segment 1 Segment 2 Segment 3 Segment 4Stream
Time
Write
Tailing Read
Catchup Read
Write
PulsarKafka
Broker
Partition
(Leader)
Broker
Partition
(Follower)
Broker
Partition
(Follower)
Broker Broker Broker
Tailing Read
PulsarKafka
Broker
Partition
(Leader)
Broker
Partition
(Follower)
Broker
Partition
(Follower)
Broker Broker Broker
Catchup Read
PulsarKafka
Broker
Partition
(Leader)
Broker
Partition
(Follower)
Broker
Partition
(Follower)
Broker Broker Broker
IO Isolation
PulsarKafka
Broker
Partition
(Leader)
Broker
Partition
(Follower)
Broker
Partition
(Follower)
Broker Broker Broker
PulsarKafka
Broker
Partition
(Leader)
Broker
Partition
(Follower)
Broker
Partition
(Follower)
Broker Broker Broker
PulsarKafka
Broker
Partition
(Leader)
Broker
Partition
(Follower)
Broker
Partition
(Follower)
Broker Broker Broker
Infinite Stream
✓ Write

写
✓ Tailing Read

追尾
✓ Catchup Read

追赶读
Segment 1 Segment 2 Segment 3 Segment 4Stream
Time
Bookies Brokers
Infinite Stream
✓ Write

写
✓ Tailing Read

追尾
✓ Catchup Read

追赶读
Segment 1 Segment 2 Segment 3 Segment 4Stream
Time
Tiered Storage Bookies Brokers
Tiered Storage
• Offloader

• When: size-based, time-based, or triggered by pulsar-admin

• How: copy a segment to tiered storage, and delete it from bookkeeper

• Access: broker knows how to read the data back, or bypass read 

the offloaded segments directly

• Available Offloaders

• Cloud Offloder : AWS, GCS, Azure, …

• HDFS, Ceph, …
Stream as a Unified View on Data
Segment 1 Segment 2 Segment 3 Segment 4Stream
Producers
Consumers
Time
Segment 6Segment 5
Segment
Readers
Data Processing on Pulsar
Segment 1 Segment 2 Segment 3 Segment 4Stream Segment 6Segment 5
Time
Bounded Stream Bounded Stream
Unbounded Stream
Unbounded Stream
When Flink meets Pulsar
Goals
• Flink + Pulsar

• Streaming Connectors

• Source Connectors

• PulsarCatalog: Schema Integration

• PulsarStateBackend

• Pulsar for the unified view of Data, Flink for the unified view of Computing
Done
Streaming Source -> Streaming Sink
Streaming Source -> Streaming Table Sink
Batch Sink
Case Study - Zhaopin.com
Zhaopin.com
Zhaopin.com is the biggest online recruitment service
provider in China
Zhaopin.com provides job seekers a comprehensive resume service, latest
employment, and career development related information, as well as in-depth online
job search for positions throughout China
Zhaopin.com provides professional HR services to over 2.2 million clients and its
average daily page views are over 68 million.
Job Search
Data Processing
Metrics
50+ Namespaces
3000+ Topics
6+ billion Messages per day
3TB Storage per day
20+ Core Services
Roadmap
Batch Source
• Read Segments in Parallel

• Bypass Brokers

• Access tiered storage directly

• Scan Trimmer

• Select Segments by Publish Time
Schema Integration
• Pulsar has builtin schema registry

• Primitive types, Avro, Json, Protobuf, …

• Schema Evolution & Multi-versioning schemas

• PulsarCatalog
State Backend
• BookKeeperStateBackend

• Save State as Segments to BookKeeper
Unified Data Processing
Segment 1 Segment 2 Segment 3 Segment 4Stream Segment 6Segment 5
Time
Query
Past Now Future
Parallel Segment Reads Pub-Sub Streaming Reads
Segment 1
Segment 3
Segment 2
Segment 4 Segment 5 Segment 6
✓ Twitter: @apache_pulsar
✓ Wechat Subscription: ApachePulsar
✓ Mailing Lists

dev@pulsar.apache.org, users@pulsar.apache.org
✓ Slack

https://apache-pulsar.slack.com
✓ Localization

https://crowdin.com/project/apache-pulsar
✓ Github

https://github.com/apache/pulsar

https://github.com/apache/bookkeeper
Community

More Related Content

Elastic Data Processing with Apache Flink and Apache Pulsar