SlideShare a Scribd company logo
Continuous Stream Processing
with Pulsar and Swim
Simon Crosby,
CTO
Swim
swim.a
i
SwimOS is an Apache 2.0 licensed platform that makes
it easy to build applications that deliver continuous
intelligence from streaming data, at scale
swimos.org
PM25 Pollution
NOX Pollution
• Apache Pulsar
• Apache Kafka
• Apache Beam
• CNCF NATS
• Amazon Kinesis
• Google pub/sub
• Azure Enterprise Data Bus
• Salesforce Kafka
• Confluent Cloud
• …
Streaming Platforms
Ø SwimOS is a stream processor that delivers continuous intelligence
from streaming data
• Support pub/sub at scale
• Buffer data between pubs & subs
• Event-time ordered delivery
• Events stored in arrival order
• Don’t run applications
• Stream processors subscribe to a broker to analyze
streaming event data
• Their insights can be asynchronously consumed by
publishing back to the broker
• The broker offers a low-latency API that gives the stream
processor events in real-time
• Pulsar does not control execution of the stream processor
Stream Processors
SwimOS is a Stateful, Real-time Stream Processor
• Builds and auto-scales apps from real-world event data, creating a
stateful graph that continuously computes – driven by data
• Automates infrastructure operation
• Load balances, secures, persists and auto-scales the application
• Apps are easy to develop
• Delivers unimaginable performance
Application: Distributed, stateful, concurrent graph of
Web Agents & real-time UIs
Infra: Distributed, p2p mesh of instances on k8s using
WebSockets
66
Major Mobile Provider
• > 150M devices
• > 10Gb/s of streaming data from Pulsar
• Continuous analysis, aggregation & reduction
• Millisecond latency
• Pervasively real-time UI
• Distributed across AZ
Pulsar’s Many Pros
• Event Processing
– Filtering
– Transformation
– Counts / Windows
– Alerts
• Serverless is a great abstraction
• SQL-style API
• Storage tiering
• Delivery guarantees
• Multi-tenancy
• Replication
• Scaling
Database
llll
• How many topics do you need?
Challenges…
!
l
l
l
l
üüüü
!
l
l
l
!
engine_temp: 290
fan_temp: 188
coolant_vol: 25
l
Challenges…
Application
Client
Client
Client
Client
Client
• Databases don’t drive computation!
(though in-memory is faster)
• What DB architecture do you need?
• Scaling / clustering / consistency …
Streaming analytics (#solved !)
☞ polling ® not real-time
engine_temp: 290
fan_temp: 188
coolant_vol: 25
Continuous Intelligence demands
• Data driven computation
• Analysis in context, everywhere, concurrently
• Stateful, in-memory, distributed
• Pervasively real-time computation
Challenges…
>
Palo Alto, CA
60
TB/day
~600
TB/day
(mostly ephemeral)
▶
There’s more data than your cloud could store
Intelligence is driven by (a flow of) state changes
- not raw data
Users Want Stateful, Continuous, Contextual Analysis
Streams are a sequence of state changes
They never stop… (so “store-then-analyze” is silly)
“Meaning” depends on granular contextual
relationships
Applications always have to have an answer
λ λ
xn-1
Introducing Swim Web Agents
• SwimOS subscribes to event streams from real-world sources
• It creates a stateful, concurrent web agent for each data source
• Each web agent cleans, labels, analyzes data from its real-world twin
• Agents dynamically link to related agents, creating a stateful in-memory graph
• Containment, proximity… logical relationships eg: pod/cluster …
• Computed relationships: correlated…
• Linked web agents share their states in real-time
• Web Agents are vertices in the graph
• Each continually computes on its own state & state of its links, as data flows
over the graph – and streams its results in real-time over its links
• This is data-driven, stateful, continuous computation
Web Agents Continuously Compute - Driven by Data
MapReduce
Graph
Analytics
Learning & Prediction
Analyze data to determine state
Relational
Relational Analysis
Real-world Stateful Web Agent
• Web agents are stateful
I’m green
• Raw data can typically be discarded
I’m red
• Noisy / redundant updates are discarded
I’m still red
I’m still red
I’m green
No push
No push
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
… …
…
…
…
Streaming data auto-scales the application –
composed of concurrent web agents - at low cost, in
real time, as data arrives
A Swim Application is an Active Graph
1000m
A graph of linked active Web Agents
Web agents continuously compute on their own
state and the state of linked web agents
enabling granular contextual analysis on-the-fly
① SwimOS creates a web agent
for each source in streaming data
② Agents interlink to reflect
real-world relationships
③ Powerful operators for analysis, learning &
prediction continuously compute on state
& stream results
Web Agents Link to Form a Computational Graph
Eg:Un-supervised Training & Prediction
Back
Propagation
Training
D
Predicted
Observed
• A scaled application is a graph
dynamically built from data
• Objects are stateful and
concurrent
SwimOS Eliminates “the Stack”
=
They continuously stream real-
time insights to UIs & applications
Web agents collaborate to
analyze, learn, predict and
respond on the fly
Swim builds a stateful, distributed,
graph of concurrent web agents
that statefully represent real-world
sources, from streaming data
*
Developer defines entities & their
relationships – as Java objects
Pulsar and Swim: Better Together
• Builds and auto-scales apps from real-world event data, creating a
stateful graph that continuously computes – driven by data
• Automates infrastructure operation
• Load balances, secures, persists and auto-scales the application
• Apps are easy to develop
• Delivers unimaginable performance
Application: Distributed, stateful, concurrent graph of
Web Agents & real-time UIs
Infra: Distributed, p2p mesh of instances on k8s using
WebSockets
Questions ?
swim.a
i
Fabric
Pulsar Broker
EventsInsights
Mesh of SwimOS
Instances
Distributed graph
of web agents
Compute continuously
as data flows over the
graph
Web agent
address space
Clustered Stream Processor Operation

More Related Content

Easily Build a Smart Pulsar Stream Processor_Simon Crosby

  • 1. Continuous Stream Processing with Pulsar and Swim Simon Crosby, CTO Swim swim.a i
  • 2. SwimOS is an Apache 2.0 licensed platform that makes it easy to build applications that deliver continuous intelligence from streaming data, at scale swimos.org
  • 4. • Apache Pulsar • Apache Kafka • Apache Beam • CNCF NATS • Amazon Kinesis • Google pub/sub • Azure Enterprise Data Bus • Salesforce Kafka • Confluent Cloud • … Streaming Platforms Ø SwimOS is a stream processor that delivers continuous intelligence from streaming data • Support pub/sub at scale • Buffer data between pubs & subs • Event-time ordered delivery • Events stored in arrival order • Don’t run applications
  • 5. • Stream processors subscribe to a broker to analyze streaming event data • Their insights can be asynchronously consumed by publishing back to the broker • The broker offers a low-latency API that gives the stream processor events in real-time • Pulsar does not control execution of the stream processor Stream Processors
  • 6. SwimOS is a Stateful, Real-time Stream Processor • Builds and auto-scales apps from real-world event data, creating a stateful graph that continuously computes – driven by data • Automates infrastructure operation • Load balances, secures, persists and auto-scales the application • Apps are easy to develop • Delivers unimaginable performance Application: Distributed, stateful, concurrent graph of Web Agents & real-time UIs Infra: Distributed, p2p mesh of instances on k8s using WebSockets
  • 7. 66 Major Mobile Provider • > 150M devices • > 10Gb/s of streaming data from Pulsar • Continuous analysis, aggregation & reduction • Millisecond latency • Pervasively real-time UI • Distributed across AZ
  • 8. Pulsar’s Many Pros • Event Processing – Filtering – Transformation – Counts / Windows – Alerts • Serverless is a great abstraction • SQL-style API • Storage tiering • Delivery guarantees • Multi-tenancy • Replication • Scaling Database llll
  • 9. • How many topics do you need? Challenges… ! l l l l üüüü
  • 11. Application Client Client Client Client Client • Databases don’t drive computation! (though in-memory is faster) • What DB architecture do you need? • Scaling / clustering / consistency … Streaming analytics (#solved !) ☞ polling ® not real-time engine_temp: 290 fan_temp: 188 coolant_vol: 25 Continuous Intelligence demands • Data driven computation • Analysis in context, everywhere, concurrently • Stateful, in-memory, distributed • Pervasively real-time computation Challenges…
  • 14. Intelligence is driven by (a flow of) state changes - not raw data
  • 15. Users Want Stateful, Continuous, Contextual Analysis Streams are a sequence of state changes They never stop… (so “store-then-analyze” is silly) “Meaning” depends on granular contextual relationships Applications always have to have an answer λ λ xn-1
  • 16. Introducing Swim Web Agents • SwimOS subscribes to event streams from real-world sources • It creates a stateful, concurrent web agent for each data source • Each web agent cleans, labels, analyzes data from its real-world twin • Agents dynamically link to related agents, creating a stateful in-memory graph • Containment, proximity… logical relationships eg: pod/cluster … • Computed relationships: correlated… • Linked web agents share their states in real-time • Web Agents are vertices in the graph • Each continually computes on its own state & state of its links, as data flows over the graph – and streams its results in real-time over its links • This is data-driven, stateful, continuous computation
  • 17. Web Agents Continuously Compute - Driven by Data MapReduce Graph Analytics Learning & Prediction Analyze data to determine state Relational Relational Analysis Real-world Stateful Web Agent
  • 18. • Web agents are stateful I’m green
  • 19. • Raw data can typically be discarded I’m red
  • 20. • Noisy / redundant updates are discarded I’m still red
  • 21. I’m still red I’m green No push No push … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … Streaming data auto-scales the application – composed of concurrent web agents - at low cost, in real time, as data arrives
  • 22. A Swim Application is an Active Graph
  • 23. 1000m A graph of linked active Web Agents
  • 24. Web agents continuously compute on their own state and the state of linked web agents enabling granular contextual analysis on-the-fly ① SwimOS creates a web agent for each source in streaming data ② Agents interlink to reflect real-world relationships ③ Powerful operators for analysis, learning & prediction continuously compute on state & stream results Web Agents Link to Form a Computational Graph
  • 25. Eg:Un-supervised Training & Prediction Back Propagation Training D Predicted Observed
  • 26. • A scaled application is a graph dynamically built from data • Objects are stateful and concurrent
  • 27. SwimOS Eliminates “the Stack” = They continuously stream real- time insights to UIs & applications Web agents collaborate to analyze, learn, predict and respond on the fly Swim builds a stateful, distributed, graph of concurrent web agents that statefully represent real-world sources, from streaming data * Developer defines entities & their relationships – as Java objects
  • 28. Pulsar and Swim: Better Together • Builds and auto-scales apps from real-world event data, creating a stateful graph that continuously computes – driven by data • Automates infrastructure operation • Load balances, secures, persists and auto-scales the application • Apps are easy to develop • Delivers unimaginable performance Application: Distributed, stateful, concurrent graph of Web Agents & real-time UIs Infra: Distributed, p2p mesh of instances on k8s using WebSockets
  • 30. Fabric Pulsar Broker EventsInsights Mesh of SwimOS Instances Distributed graph of web agents Compute continuously as data flows over the graph Web agent address space Clustered Stream Processor Operation