When apache pulsar meets apache flink

When Apache Pulsar meets Apache Flink
Sijie Guo (@sijieg)

Who am I
❏ Apache Pulsar PMC Member
❏ Apache BookKeeper PMC Chair
❏ StreamNative Founder
❏ Ex-Twitter, Ex-Yahoo
❏ Interested in event streaming
technologies

“Flexible Pub/Sub Messaging
Backed by durable log storage”

A brief history of Apache Pulsar
❏ 2012: Pulsar idea started
❏ 5+ years on production, 100+ applications, 10+ data centers
❏ 2016/09 Yahoo open sourced Pulsar
❏ 2017/06 Yahoo donated Pulsar to ASF
❏ 2018/09 Pulsar graduated as a Top-Level project
❏ 25+ committers, 154 contributors, 900+ forks, 4000+ stars
❏ Yahoo!, Yahoo! Japan, Tencent, Zhaopin, ...

Pulsar Use Cases
❏ Unified Event Center/Bus (Queuing + Streaming)
❏ Billing Service
❏ Push Notification
❏ Worker Queue
❏ Logging Pipeline
❏ IoT
❏ Streaming-first, unified data processing
❏ ...

Data Processing with Apache Pulsar

Data Processing Categories
❏ Interactive
❏ Time critical
❏ Medium data size
❏ Rerun on failures

❏ Interactive
❏ Time critical
❏ Batch
❏ The amount of data is huge
❏ Can run on a huge cluster
❏ Fine-grained fault tolerance

❏ Interactive
❏ Time critical
❏ Batch
❏ Streaming
❏ Long running jobs
❏ Time critical
❏ Need scalability as well as
resilient on failures

❏ Interactive
❏ Time critical
❏ Batch
❏ Streaming
❏ Long running jobs
❏ Time critical
❏ Need scalability as well as
resilient on failures
❏ Serverless
❏ Simple, light-weight processing
❏ Processing data with high
velocity

Streaming-First
Batch processing is a special case of stream processing
A Flink view on computing

Infinite segmented streams
(pub/sub + segment)
A Pulsar view on data

+
=
Streaming-first, unified data processing

Pulsar - A cloud-native architecture
Stateless Serving
Durable Storage

Pulsar - Segment Centric Storage
❏ Topic Partition (Managed Ledger)
❏ The storage layer for a single topic
partition
❏ Segment (Ledger)
❏ Single writer, append-only
❏ Replicated to multiple bookies

Pulsar - Infinite stream storage

Pulsar - Stream as a unified view on data

Pulsar - Two levels of reading API
❏ Pub/Sub (Streaming)
❏ Read data from brokers
❏ Consume / Seek / Receive
❏ Subscription Mode - Failover, Shared, Key_Shared
❏ Reprocessing data by rewinding (seeking) the cursors
❏ Segment (Batch)
❏ Read data from storage (bookkeeper or tiered storage)
❏ Fine-grained Parallelism
❏ Predicate pushdown (publish timestamp)

Unified data processing on Pulsar

Flink Integration
❏ Available Connectors
❏ Streaming Source
❏ Streaming Sink
❏ Table Sink
❏ Flink 1.6.0
When Flink & Pulsar come together: https://flink.apache.org/2019/05/03/pulsar-flink.html

Flink 1.9 Integration
❏ Pulsar Schema Integration
❏ Table API as first-class citizens
❏ Exactly-once source
❏ At-least-once sink

Pulsar Schema (1)
❏ Consensus of data at server-side
❏ Built-in schema registry
❏ Data schema on a per-topic basis
❏ Send and receive typed messages directly
❏ Validation
❏ Multi-version
❏ Schema evolution & compatibilities

Pulsar Schema (2)
// Create producer with Struct schema and send messages
Producer<User> producer = client.newProducer(Schema.AVRO(User.class)).create();
producer.newMessage()
.value(User.builder()
.userName("pulsar-user")
.userId(1L)
.build())
.send();
// Create consumer with Struct schema and receive messages
Consumer<User> consumer = client.newConsumer(Schema.AVRO(User.class)).create();
consumer.receive();

Pulsar Schema (3) - SchemaInfo
{
"type": "JSON",
"schema": "{
"type":"record",
"name":"User",
"namespace":"com.foo",
"fields":[
{
"name":"file1",
"type":["null","string"],
"default":null
},
{
"name":"file2",
"type":"string",
"default":null
},
{
"name":"file3",
"type":["null","string"],
"default":"dfdf"
}
]
}",
"properties": {}
}

Pulsar Schema (6) - Compatibility Strategy

Pulsar Schema (7) - Multi versions

Pulsar-Flink (1) - Schema <-> Row
https://github.com/streamnative/pulsar-flink
❏ Topics without schema or with primitive schemas
❏ `value` field for message payload
❏ Topics with struct schemas (AVRO, JSON)
❏ Field names and types are kept in the row
❏ Metadata Fields
❏ __key: Binary
❏ __topic: String
❏ __messageId: Binary
❏ __publishTime: Timestamp
❏ __eventTime: Timestamp

Pulsar-Flink (2) - Schema Examples
Primitive Schema Avro Schema
https://github.com/streamnative/pulsar-flink

Pulsar-Flink (3) - Pulsar Source

Pulsar-Flink (4) - Streaming Tables

Pulsar-Flink (5) - Topic Partitions Discovery
❏ Find matching topics
❏ Fetch schemas for each topic
❏ Build schema-specific deserializer
❏ Each reader is responsible one
topic partition
❏ Each source task has a partition
discover task to check newly
added partitions

Pulsar-Flink (6) Exactly-once Source
❏ Message order on partition basis
❏ Seek & read
❏ Checkpoints with MessageID
❏ Durable cursor to keep
un-checkpointed messages alive
❏ Move cursor when a checkpoint is
completed

Pulsar-Flink (7) - Pulsar Sink

Pulsar-Flink (8) - Write to streaming tables

Future directions
❏ Unified Source API for both batch and streaming execution
❏ FLIP-27
❏ Pulsar as a catalog
❏ Pulsar as a state backend
❏ Scale-out source parallelism
❏ Key_Shared & Sticky consumer
❏ End-to-end exactly-once
❏ Pulsar transaction in 2.5.0

Key_Shared Subscription
❏ Key based ordering
❏ Key can be message key or a separated *order* key
❏ HashRing based routing
❏ Key based batcher
❏ Policies for messages without *keys*
https://github.com/apache/pulsar/wiki/PIP-34:-Add-new-subscribe-type-Key_shared

Conclusion
❏ Apache Pulsar is a cloud-native messaging streaming system
❏ Multi layered architecture
❏ Segment centric storage
❏ Two levels of reading API: Pub/Sub + Segment
❏ Apache Pulsar provides a unified view of data
❏ Apache Flink provides a unified view of computing
❏ Pulsar + Flink for streaming-first, unified data processing

Community
❏ Pulsar Website: https://pulsar.apache.org
❏ Twitter: @apache_pulsar / @streamnativeio
❏ Slack: https://apache-pulsar.herokuapp.com
❏ Mailing Lists
dev@pulsar.apache.org, users@pulsar.apache.org
❏ Github
https://github.com/apache/pulsar
❏ Medium
https://medium.com/streamnative

When apache pulsar meets apache flink

More Related Content

When apache pulsar meets apache flink