SlideShare a Scribd company logo
© 2019 Ververica
Seth Wiesman
Senior Solutions Architect @ Ververica
Committer Apache Flink
Unified Data Processing with Apache Flink and Apache
Pulsar
© 2019 Ververica2
About Ververica (the company formerly known as “data Artisans”)
Original Creators of
Apache Flink®
Enterprise Stream Processing
With Ververica Platform
Subsidiary of
Alibaba Group
© 2019 Ververica3
Apache Flink
© 2019 Ververica4
Apache Flink
© 2019 Ververica5
2.5 B2M 985 PB
Sub-
Second 100TB
containers data size throughput latency state size
events / sec
Apache Flink at
The "Singles Day" (11/11/2019)
© 2019 Ververica6
Why Stream Processing?
© 2019 Ververica7
Stream Processing is
real-time data processing
and real-time data-driven actions
© 2019 Ververica8
Stream Processing is
the unification of real-time and
offline analytics
© 2019 Ververica9
Stream Processing is
the intersection of data
analytics and applications
© 2019 Ververica10
Stream Processing is
to event-driven applications what
the database is to request/response apps
© 2019 Ververica11
Stream Processing is
a flexible and extensible architecture
for data-driven applications
© 2019 Ververica12
Application /
Business Logic
Stream
Processor
(Datalake, Database)
Application /
Business Logic
Batch Proc. or Req/resp. Stream Processing
Stream Processing changes how Applications and Data interact
request/trigger result/response
event stream event stream
events are the data
events act as triggers
application logic triggered
by events/changes
© 2019 Ververica13
What is Stream Processing for?
data changes slowly
Ad-hoc queries, data exploration,
ML model training
Batch Proc. or Req/resp.
Most business logic
query/logic changes fast data changes fast
query/logic changes slowly
Continuous Streaming
© 2019 Ververica14
more lag time
data warehousing
OLAP / BI / reporting
continuous monitoring
(position, risk, …)
real-time ML model
training/evaluation
distributed
OLTP-style apps
more real time
continuous
ETL
real-time behavior modeling
(recommenders, pricing, ..)
The Spectrum of Streaming Data Use Cases
machine learning
model training
unified offline/
real-time analytics
real-time alerts
(fraud, security, …)
© 2019 Ververica15
Stateful Single Record Processing
© 2019 Ververica16
Everything is a Stream
Streams Of Records in a Log or MQ
© 2019 Ververica17
Everything is a Stream
Stream of Requests/Responses to/from Services
Service
DB
à event sourcing architecture
GET /a/b POST /b/c PUT /e/f 200 404 200 200 403
© 2019 Ververica18
Everything is a Stream
Stream of Rows in a Table or in Files
2016-3-1
12:00 am
2016-3-1
1:00 am
2016-3-1
2:00 am
2016-3-11
11:00pm
2016-3-12
12:00am
2016-3-12
1:00am
2016-3-11
10:00pm
2016-3-12
2:00am
2016-3-12
3:00am
…
© 2019 Ververica19
Everything is a Stream
Stream of Rows in a Table or in Files
2016-3-1
12:00 am
2016-3-1
1:00 am
2016-3-1
2:00 am
2016-3-11
11:00pm
2016-3-12
12:00am
2016-3-12
1:00am
2016-3-11
10:00pm
2016-3-12
2:00am
2016-3-12
3:00am
…
a batch
© 2019 Ververica20
Everything is a Stream
Streams may span storage systems
2016-3-1
12:00 am
2016-3-1
1:00 am
2016-3-1
2:00 am
2016-3-11
11:00pm
2016-3-11
10:00pm
…
Parquet files Avro records
more distant past
(e.g., compressed files in DFS/Object Store)
recent past
(e.g., events in MQ/Log)
© 2019 Ververica21
© 2019 Ververica22
Bounded and Unbounded Streams
© 2019 Ververica23
Components of a Streaming Data Architecture
Event producers
(applications, servers,
databases, sensors)
Log / Stream Storage
(Pulsar)
Stream
Processing
Stream
Processing
Stream
Processing
Results (Views)
(K/V stores, databases)
Triggered
Applications
(Apache Flink)
© 2019 Ververica24
Flink Runtime
Stateful Computations over Data Streams
Stateful
Stream Processing
Streams, State, Time
Event-driven
Applications
Stateful Functions
Streaming Analytics
SQL and Tables
Apache Flink: Analytics and Applications on Streaming Data
© 2019 Ververica25
Stateful Stream
Processing
© 2019 Ververica26
Flink Runtime
Stateful Computations over Data Streams
Stateful
Stream Processing
Streams, State, Time
Event-driven
Applications
Stateful Functions
Streaming Analytics
SQL and Tables
Apache Flink: Analytics and Applications on Streaming Data
© 2019 Ververica27
Stateful Stream Processing
Computation
Computation
Computation
Computation
Source (Stream)
Source (Static)
Sink Sink
Transformation
State
State
State
© 2019 Ververica28
Example Use Cases
•Real time search and recommendation models (e.g., Alibaba)
•Build a real-time session behavior profile of users (e.g., Netflix)
•Real time trade settlement dashboard (e.g., UBS)
•Real time revenue accounting (various AdTechs)
•Machine Learning-based anomaly/fraud detection (e.g., ING, Microsoft)
•Real-time data refinement and data pipelines (many)
© 2019 Ververica29
DataStream API
Source
Transformation
Windowed Transformation
Sink
val lines: DataStream[String] = env.addSource(new FlinkKafkaConsumer011(…))
val events: DataStream[Event] = lines.map((line) => parse(line))
val stats: DataStream[Statistic] = stream
.keyBy("sensor")
.timeWindow(Time.seconds(5))
.sum(new MyAggregationFunction())
stats.addSink(new RollingSink(path))
Streaming
Dataflow
Source Transform Window
(state read/write)
Sink
© 2019 Ververica30
DataStream API Process Functions
30
© 2019 Ververica31
Streaming Analytics
© 2019 Ververica32
Flink Runtime
Stateful Computations over Data Streams
Stateful
Stream Processing
Streams, State, Time
Event-driven
Applications
Stateful Functions
Streaming Analytics
SQL and Tables
Apache Flink: Analytics and Applications on Streaming Data
© 2019 Ververica33
Example Use Cases
•Realtime Analytics Platforms (e.g., Alibaba, Uber, Lyft, Yelp!, Tencent)
•Materializing Views (dashboards, data marts)
•ETL - batch and continuous
•Machine Learning Training (Alibaba, new ML library)
© 2019 Ververica34
SQL / Table API – Batch Queries
SQL
Query
Batch Query
Execution
SELECT
room,
TUMBLE_END(rowtime, INTERVAL '1' HOUR),
AVG(temperature)
FROM
sensors
GROUP BY
TUMBLE(rowtime, INTERVAL '1' HOUR), room
Full TPC-DS support
in Flink 1.10
© 2019 Ververica35
Interpreting Streams as Tables
© 2019 Ververica36
SQL / Table API – Streaming Data Case
SELECT
room,
TUMBLE_END(rowtime, INTERVAL '1' HOUR),
AVG(temperature)
FROM
sensors
GROUP BY
TUMBLE(rowtime, INTERVAL '1' HOUR), room
SQL
Query
Interpret Stream
as Table
Incremental
Query Execution output result
changes as stream
update database
with changes
© 2019 Ververica37
FLIP-72
Add Pulsar connectors and Catalog to Apache Flink
> CREATE CATALOG my_pulsar (
‘type’ = ‘pulsar’,
‘adminUrl’ = ‘localhost:9092’
);
> USE my_pulsar;
> INSERT INTO aggregations
SELECT
room,
TUMBLE_END(rowtime, INTERVAL '1' HOUR),
AVG(temperature)
FROM
sensors
GROUP BY
TUMBLE(rowtime, INTERVAL '1' HOUR), room
© 2019 Ververica38
Materialized Views Example
logCDC
Continuous
SQL Query
Continuous
SQL Query
Continuous
SQL Query
Materialized View
Materialized View
Archive
© 2019 Ververica39
Materialized Views Example
logCDC
Continuous
SQL Query Materialized Views
View Materialization
(streaming)
Dashboard:
Many short queries
(batch)
© 2019 Ververica40
Many handy SQL features: Temporal Joins, Pattern Matching, …
SELECT tf.time
tf.price * rh.rate as conv_fare
FROM taxiFare AS tf
LATERAL TABLE (Rates(tf.time)) AS rh
WHERE tf.currency = rh.currency;
© 2019 Ververica41
Event-driven
Applications
© 2019 Ververica42
Flink Runtime
Stateful Computations over Data Streams
Stateful
Stream Processing
Streams, State, Time
Event-driven
Applications
Stateful Functions
Streaming Analytics
SQL and Tables
Apache Flink: Analytics and Applications on Streaming Data
© 2019 Ververica43
Classical Tiered Application Architecture
App App App
© 2019 Ververica44
Consistency in Database Applications
App App App
© 2019 Ververica45
Consistency in Database Applications
App App App
For any failure in any call, it becomes
hard to reason about what effects did or did
not already happen
X
© 2019 Ververica46
Applying the Stream Processing Approach to Applications
App App App
X
© 2019 Ververica47
Stateful Functions
© 2019 Ververica48
Stream Processing F-a-a-S
λ
λ
λ
λ
simplicity / generality
state management
composability
lightweight resources
performance
event-driven
Can we combine some
of these properties
?
© 2019 Ververica49
Stateful Functions
f(a,b)
f(a,b)
f(a,b)
f(a,b)
f(a,b) mass storage
(S3, GCF, ECS, HDFS, …)
event ingress
event egress
f(a,b)
snapshot
state
© 2019 Ververica50
Stateful Functions compared to Stream Proc. & Apache Flink
Apache Flink
DataStream/Table
Stateful Functions
f(a,b)
f(a,b)
f(a,b)
Pool of Resources
(Apache Flink Cluster)
Arbitrary Function-to-Function
messaging. Not restricted to a DAG.
Functions are multiplexed and share resources.
Makes it possible to run many very small jobs.
Solves two major challenges
f(a,b)
f(a,b)
f(a,b)
f(a,b)
f(a,b)
© 2019 Ververica51
Example: Ride Sharing App
Driver status
updates
Passenger
ride requests
Ride
status update
Driver
Ride
Pass-
enger
Geo-
index
update create
bill
Inform /
book
bid
lookup
update cell
seeking
confirmed
riding
free
bidding
booked
© 2019 Ververica52
data preparation
combining knowledge/information
filtering, enriching,
aggregating, joining events
coordination,
(interacting) state machines
complex event/state
interactions
“occasional” actions or
spiky loads
compute-intensive
or blocking
Stream Processing
Streaming SQL
Stateful Functions F-a-a-S
f(a,b)
f(a,b)
f(a,b)
λ
λ
λ
λ
state-centricevent/stream-centric stateless / compute-centric
© 2019 Ververica53
Putting it all together
f(a,b)
f(a,b)
f(a,b)
λ
λ
λ
λ
FaaS
render map/route image
create a receipt PDF
send email
Stateful Functions
ride life-cycle
driver-to-ride matching
Stream Processing
traffic models
demand forecast & pricing
Billing
Passenger updates
Driver position updates
Driver status updates
© 2019 Ververica54
❤
Thank You!

More Related Content

Unified Data Processing with Apache Flink and Apache Pulsar_Seth Wiesman

  • 1. © 2019 Ververica Seth Wiesman Senior Solutions Architect @ Ververica Committer Apache Flink Unified Data Processing with Apache Flink and Apache Pulsar
  • 2. © 2019 Ververica2 About Ververica (the company formerly known as “data Artisans”) Original Creators of Apache Flink® Enterprise Stream Processing With Ververica Platform Subsidiary of Alibaba Group
  • 5. © 2019 Ververica5 2.5 B2M 985 PB Sub- Second 100TB containers data size throughput latency state size events / sec Apache Flink at The "Singles Day" (11/11/2019)
  • 6. © 2019 Ververica6 Why Stream Processing?
  • 7. © 2019 Ververica7 Stream Processing is real-time data processing and real-time data-driven actions
  • 8. © 2019 Ververica8 Stream Processing is the unification of real-time and offline analytics
  • 9. © 2019 Ververica9 Stream Processing is the intersection of data analytics and applications
  • 10. © 2019 Ververica10 Stream Processing is to event-driven applications what the database is to request/response apps
  • 11. © 2019 Ververica11 Stream Processing is a flexible and extensible architecture for data-driven applications
  • 12. © 2019 Ververica12 Application / Business Logic Stream Processor (Datalake, Database) Application / Business Logic Batch Proc. or Req/resp. Stream Processing Stream Processing changes how Applications and Data interact request/trigger result/response event stream event stream events are the data events act as triggers application logic triggered by events/changes
  • 13. © 2019 Ververica13 What is Stream Processing for? data changes slowly Ad-hoc queries, data exploration, ML model training Batch Proc. or Req/resp. Most business logic query/logic changes fast data changes fast query/logic changes slowly Continuous Streaming
  • 14. © 2019 Ververica14 more lag time data warehousing OLAP / BI / reporting continuous monitoring (position, risk, …) real-time ML model training/evaluation distributed OLTP-style apps more real time continuous ETL real-time behavior modeling (recommenders, pricing, ..) The Spectrum of Streaming Data Use Cases machine learning model training unified offline/ real-time analytics real-time alerts (fraud, security, …)
  • 15. © 2019 Ververica15 Stateful Single Record Processing
  • 16. © 2019 Ververica16 Everything is a Stream Streams Of Records in a Log or MQ
  • 17. © 2019 Ververica17 Everything is a Stream Stream of Requests/Responses to/from Services Service DB à event sourcing architecture GET /a/b POST /b/c PUT /e/f 200 404 200 200 403
  • 18. © 2019 Ververica18 Everything is a Stream Stream of Rows in a Table or in Files 2016-3-1 12:00 am 2016-3-1 1:00 am 2016-3-1 2:00 am 2016-3-11 11:00pm 2016-3-12 12:00am 2016-3-12 1:00am 2016-3-11 10:00pm 2016-3-12 2:00am 2016-3-12 3:00am …
  • 19. © 2019 Ververica19 Everything is a Stream Stream of Rows in a Table or in Files 2016-3-1 12:00 am 2016-3-1 1:00 am 2016-3-1 2:00 am 2016-3-11 11:00pm 2016-3-12 12:00am 2016-3-12 1:00am 2016-3-11 10:00pm 2016-3-12 2:00am 2016-3-12 3:00am … a batch
  • 20. © 2019 Ververica20 Everything is a Stream Streams may span storage systems 2016-3-1 12:00 am 2016-3-1 1:00 am 2016-3-1 2:00 am 2016-3-11 11:00pm 2016-3-11 10:00pm … Parquet files Avro records more distant past (e.g., compressed files in DFS/Object Store) recent past (e.g., events in MQ/Log)
  • 22. © 2019 Ververica22 Bounded and Unbounded Streams
  • 23. © 2019 Ververica23 Components of a Streaming Data Architecture Event producers (applications, servers, databases, sensors) Log / Stream Storage (Pulsar) Stream Processing Stream Processing Stream Processing Results (Views) (K/V stores, databases) Triggered Applications (Apache Flink)
  • 24. © 2019 Ververica24 Flink Runtime Stateful Computations over Data Streams Stateful Stream Processing Streams, State, Time Event-driven Applications Stateful Functions Streaming Analytics SQL and Tables Apache Flink: Analytics and Applications on Streaming Data
  • 25. © 2019 Ververica25 Stateful Stream Processing
  • 26. © 2019 Ververica26 Flink Runtime Stateful Computations over Data Streams Stateful Stream Processing Streams, State, Time Event-driven Applications Stateful Functions Streaming Analytics SQL and Tables Apache Flink: Analytics and Applications on Streaming Data
  • 27. © 2019 Ververica27 Stateful Stream Processing Computation Computation Computation Computation Source (Stream) Source (Static) Sink Sink Transformation State State State
  • 28. © 2019 Ververica28 Example Use Cases •Real time search and recommendation models (e.g., Alibaba) •Build a real-time session behavior profile of users (e.g., Netflix) •Real time trade settlement dashboard (e.g., UBS) •Real time revenue accounting (various AdTechs) •Machine Learning-based anomaly/fraud detection (e.g., ING, Microsoft) •Real-time data refinement and data pipelines (many)
  • 29. © 2019 Ververica29 DataStream API Source Transformation Windowed Transformation Sink val lines: DataStream[String] = env.addSource(new FlinkKafkaConsumer011(…)) val events: DataStream[Event] = lines.map((line) => parse(line)) val stats: DataStream[Statistic] = stream .keyBy("sensor") .timeWindow(Time.seconds(5)) .sum(new MyAggregationFunction()) stats.addSink(new RollingSink(path)) Streaming Dataflow Source Transform Window (state read/write) Sink
  • 30. © 2019 Ververica30 DataStream API Process Functions 30
  • 32. © 2019 Ververica32 Flink Runtime Stateful Computations over Data Streams Stateful Stream Processing Streams, State, Time Event-driven Applications Stateful Functions Streaming Analytics SQL and Tables Apache Flink: Analytics and Applications on Streaming Data
  • 33. © 2019 Ververica33 Example Use Cases •Realtime Analytics Platforms (e.g., Alibaba, Uber, Lyft, Yelp!, Tencent) •Materializing Views (dashboards, data marts) •ETL - batch and continuous •Machine Learning Training (Alibaba, new ML library)
  • 34. © 2019 Ververica34 SQL / Table API – Batch Queries SQL Query Batch Query Execution SELECT room, TUMBLE_END(rowtime, INTERVAL '1' HOUR), AVG(temperature) FROM sensors GROUP BY TUMBLE(rowtime, INTERVAL '1' HOUR), room Full TPC-DS support in Flink 1.10
  • 35. © 2019 Ververica35 Interpreting Streams as Tables
  • 36. © 2019 Ververica36 SQL / Table API – Streaming Data Case SELECT room, TUMBLE_END(rowtime, INTERVAL '1' HOUR), AVG(temperature) FROM sensors GROUP BY TUMBLE(rowtime, INTERVAL '1' HOUR), room SQL Query Interpret Stream as Table Incremental Query Execution output result changes as stream update database with changes
  • 37. © 2019 Ververica37 FLIP-72 Add Pulsar connectors and Catalog to Apache Flink > CREATE CATALOG my_pulsar ( ‘type’ = ‘pulsar’, ‘adminUrl’ = ‘localhost:9092’ ); > USE my_pulsar; > INSERT INTO aggregations SELECT room, TUMBLE_END(rowtime, INTERVAL '1' HOUR), AVG(temperature) FROM sensors GROUP BY TUMBLE(rowtime, INTERVAL '1' HOUR), room
  • 38. © 2019 Ververica38 Materialized Views Example logCDC Continuous SQL Query Continuous SQL Query Continuous SQL Query Materialized View Materialized View Archive
  • 39. © 2019 Ververica39 Materialized Views Example logCDC Continuous SQL Query Materialized Views View Materialization (streaming) Dashboard: Many short queries (batch)
  • 40. © 2019 Ververica40 Many handy SQL features: Temporal Joins, Pattern Matching, … SELECT tf.time tf.price * rh.rate as conv_fare FROM taxiFare AS tf LATERAL TABLE (Rates(tf.time)) AS rh WHERE tf.currency = rh.currency;
  • 42. © 2019 Ververica42 Flink Runtime Stateful Computations over Data Streams Stateful Stream Processing Streams, State, Time Event-driven Applications Stateful Functions Streaming Analytics SQL and Tables Apache Flink: Analytics and Applications on Streaming Data
  • 43. © 2019 Ververica43 Classical Tiered Application Architecture App App App
  • 44. © 2019 Ververica44 Consistency in Database Applications App App App
  • 45. © 2019 Ververica45 Consistency in Database Applications App App App For any failure in any call, it becomes hard to reason about what effects did or did not already happen X
  • 46. © 2019 Ververica46 Applying the Stream Processing Approach to Applications App App App X
  • 48. © 2019 Ververica48 Stream Processing F-a-a-S λ λ λ λ simplicity / generality state management composability lightweight resources performance event-driven Can we combine some of these properties ?
  • 49. © 2019 Ververica49 Stateful Functions f(a,b) f(a,b) f(a,b) f(a,b) f(a,b) mass storage (S3, GCF, ECS, HDFS, …) event ingress event egress f(a,b) snapshot state
  • 50. © 2019 Ververica50 Stateful Functions compared to Stream Proc. & Apache Flink Apache Flink DataStream/Table Stateful Functions f(a,b) f(a,b) f(a,b) Pool of Resources (Apache Flink Cluster) Arbitrary Function-to-Function messaging. Not restricted to a DAG. Functions are multiplexed and share resources. Makes it possible to run many very small jobs. Solves two major challenges f(a,b) f(a,b) f(a,b) f(a,b) f(a,b)
  • 51. © 2019 Ververica51 Example: Ride Sharing App Driver status updates Passenger ride requests Ride status update Driver Ride Pass- enger Geo- index update create bill Inform / book bid lookup update cell seeking confirmed riding free bidding booked
  • 52. © 2019 Ververica52 data preparation combining knowledge/information filtering, enriching, aggregating, joining events coordination, (interacting) state machines complex event/state interactions “occasional” actions or spiky loads compute-intensive or blocking Stream Processing Streaming SQL Stateful Functions F-a-a-S f(a,b) f(a,b) f(a,b) λ λ λ λ state-centricevent/stream-centric stateless / compute-centric
  • 53. © 2019 Ververica53 Putting it all together f(a,b) f(a,b) f(a,b) λ λ λ λ FaaS render map/route image create a receipt PDF send email Stateful Functions ride life-cycle driver-to-ride matching Stream Processing traffic models demand forecast & pricing Billing Passenger updates Driver position updates Driver status updates