Elastic Data Processing with Apache Flink and Apache Pulsar

Elastic Data Processing with
Apache Flink and Apache Pulsar
Sijie Guo (sijieg)

2019-04-02

Who am I
• Apache Pulsar PMC Member

• Apache BookKeeper PMC Member

• Interested in technologies around Event Streaming

Agenda
• What is Apache Pulsar?

• A Pulsar View on Data - Segmented Stream

• Pulsar - Access Pattern & Tiered Storage

• Pulsar - Schema

• When Flink meets Pulsar

2010
2011
2012
2006
2003
Pub/Sub Messaging

“Flexible Pub/Sub messaging
backed by durable log/stream storage”

Pulsar - Cloud Native
• Independent Scalability
• Instant Failure Recovery
• Balance-free on cluster
expansions
Layered Architecture

A Flink View on Computing
“Batch processing is a special case of
Stream processing”

Topic
Topic
Producers
Consumers
Time

Partitions
P0
P1
P2
P3
Producers
Consumers
Time

Segments
Segment 1 Segment 2 Segment 3 Segment 4
Segment 1 Segment 2 Segment 3
P0
P1
P2
P3
Producers
Consumers
Time

Stream
Segment 1 Segment 2 Segment 3 Segment 4
P0
P1
P2
P3
Producers
Consumers
Time

Stream
Segment 1 Segment 2 Segment 3 Segment 4Stream
Producers
Consumers
Time

Segmented Stream
• Segmented Stream Systems

• Apache Pulsar, Twitter EventBus, EMC Pravega

• All Apache BookKeeper based

• Used BK in a diﬀerent way

• Pulsar, EventBus - Uses BK as the segment store

• Pravega - Uses BK as the journal only

Access Patterns
✓ Write 
写
✓ Tailing Read 
追尾
✓ Catchup Read 
追赶读
Time

Access Patterns
✓ Write 
写
✓ Tailing Read 
追尾
✓ Catchup Read 
追赶读
Time
Write

Access Patterns
✓ Write 
写
✓ Tailing Read 
追尾
✓ Catchup Read 
追赶读
Time
Write
Tailing Read

Access Patterns
✓ Write 
写
✓ Tailing Read 
追尾
✓ Catchup Read 
追赶读
Time
Write
Tailing Read
Catchup Read

Write
PulsarKafka
Broker
Partition
(Leader)
Broker
Partition
(Follower)
Broker
Partition
(Follower)
Broker Broker Broker

Tailing Read
PulsarKafka
Broker
Partition
(Leader)
Broker
Partition
(Follower)
Broker
Partition
(Follower)

Catchup Read
PulsarKafka
Broker
Partition
(Leader)
Broker
Partition
(Follower)
Broker
Partition
(Follower)

IO Isolation
PulsarKafka
Broker
Partition
(Leader)
Broker
Partition
(Follower)
Broker
Partition
(Follower)
PulsarKafka
Broker
Partition
(Leader)
Broker
Partition
(Follower)
Broker
Partition
(Follower)
PulsarKafka
Broker
Partition
(Leader)
Broker
Partition
(Follower)
Broker
Partition
(Follower)

Inﬁnite Stream
✓ Write 
写
✓ Tailing Read 
追尾
✓ Catchup Read 
追赶读
Time
Bookies Brokers

Inﬁnite Stream
✓ Write 
写
✓ Tailing Read 
追尾
✓ Catchup Read 
追赶读
Time
Tiered Storage Bookies Brokers

Tiered Storage
• Offloader

• When: size-based, time-based, or triggered by pulsar-admin

• How: copy a segment to tiered storage, and delete it from bookkeeper

• Access: broker knows how to read the data back, or bypass read  
the offloaded segments directly

• Available Offloaders

• Cloud Offloder : AWS, GCS, Azure, …

• HDFS, Ceph, …

Stream as a Uniﬁed View on Data
Producers
Consumers
Time
Segment 6Segment 5
Segment
Readers

Data Processing on Pulsar
Segment 1 Segment 2 Segment 3 Segment 4Stream Segment 6Segment 5
Time
Bounded Stream Bounded Stream
Unbounded Stream
Unbounded Stream

Goals
• Flink + Pulsar

• Streaming Connectors

• Source Connectors

• PulsarCatalog: Schema Integration

• PulsarStateBackend

• Pulsar for the uniﬁed view of Data, Flink for the uniﬁed view of Computing

Streaming Source -> Streaming Sink

Streaming Source -> Streaming Table Sink

Zhaopin.com
Zhaopin.com is the biggest online recruitment service
provider in China
Zhaopin.com provides job seekers a comprehensive resume service, latest
employment, and career development related information, as well as in-depth online
job search for positions throughout China
Zhaopin.com provides professional HR services to over 2.2 million clients and its
average daily page views are over 68 million.

Metrics
50+ Namespaces
3000+ Topics
6+ billion Messages per day
3TB Storage per day
20+ Core Services

Batch Source
• Read Segments in Parallel

• Bypass Brokers

• Access tiered storage directly

• Scan Trimmer

• Select Segments by Publish Time

Schema Integration
• Pulsar has builtin schema registry

• Primitive types, Avro, Json, Protobuf, …

• Schema Evolution & Multi-versioning schemas

• PulsarCatalog

State Backend
• BookKeeperStateBackend

• Save State as Segments to BookKeeper

Uniﬁed Data Processing
Segment 1 Segment 2 Segment 3 Segment 4Stream Segment 6Segment 5
Time
Query
Past Now Future
Parallel Segment Reads Pub-Sub Streaming Reads
Segment 1
Segment 3
Segment 2

✓ Twitter: @apache_pulsar
✓ Wechat Subscription: ApachePulsar
✓ Mailing Lists 
dev@pulsar.apache.org, users@pulsar.apache.org
✓ Slack 
https://apache-pulsar.slack.com
✓ Localization 
https://crowdin.com/project/apache-pulsar
✓ Github 
https://github.com/apache/pulsar 
https://github.com/apache/bookkeeper
Community

Elastic Data Processing with Apache Flink and Apache Pulsar

More Related Content

Elastic Data Processing with Apache Flink and Apache Pulsar