SlideShare a Scribd company logo
Data Platform at Twitter
Enabling Real-time & Batch Analytics at Scale
Sriram Krishnan | Twitter | @krishnansriram
Hadoop Innovation Summit, Feb 12, 2015
Who am I?
• Engineering Manager, Data Platform at Twitter
• Formerly:
• Tech Lead, Big Data Platform at Netflix
• Group Lead & Senior Researcher at the San
Diego Supercomputer Center
Twitter Scale (Feb 2015)
• More than 288M monthly
active users
• More than 500M unique
monthly logged out users
• 500M tweets per day
• 80% mobile users
• Thousands of advertisers
Twitter Data Platform
• Enables use of large-scale resources to perform data
analytics at scale
• trending tweets
• ad impressions
• most clicked, most followed, most retweeted
• platform health & statistics
• experimentation
Twitter Data Platform
• Enables use of large-scale resources to perform data
analytics at scale
• trending tweets
• ad impressions
• most clicked, most followed, most retweeted
• platform health & statistics
• experimentation
In real-time!
Use Case: Counting Impressions
Data Streams
(user_id, tweet, timestamp, hashtags, …)
(user_id, tweet, device_id, …)
event queue
Logs
Storm topology
Hadoop job
Real-time
results
Batch
results
Storm
• Streaming compute framework
• Jobs represented as
topologies
• Process tuples as they come
• Data comes in from Spouts
• Data is computed on in Bolts
http://storm.apache.org
Hadoop
http://www.glennklockwood.com/di/hadoop-overview.php
• Batch MapReduce library
• Uses the Hadoop
Distributed File System
(HDFS) to store and
process data
• Jobs consist of Map &
Reduce phases
• Synchronization barriers
between each stage
Challenges
• Hadoop and Storm present two different
programming models
• Written in different languages
• Each job has to be written twice
• Hard to optimize, specific for each job & platform
• Hard to compute complete up-to-the-moment
information
Goals
• Write job once
Goals
• Write job once!
• Portable
Goals
• Write job once!
• Portable!
• Fault tolerant
Goals
• Write job once!
• Portable!
• Fault tolerant!
• Up-to-the-moment
TSAR
• The TimeSeries AggregatoR
• https://blog.twitter.com/2014/tsar-a-timeseries-aggregator
We are still counting impressions
aggregate {
onKeys (
(TweetId)
) produce (
Count
) sinkTo (
Vertica
)
} fromProducer {
!
ClientEventSource(“client_events”)
.filter { e => isImpressionEvent(e) }
.map { e =>
val impr = Impression(e.tweetId)
(e.timestamp, impr)
}
}
The TSAR job
Dimensions
Metrics
Data Sinks
Data Sources
TSAR will then:
• Configure and launch jobs on Storm & Hadoop
• Create services for querying results
• Create tables and views and staging jobs
• Create alerts and observability graphs
Behind the scenes
Scalding Storm
Batch (Hadoop)!
Better accuracy!
Worse latency
Realtime!
Better latency!
Worse accuracy
TSAR
Behind the scenes
Scalding Storm
Batch (Hadoop)!
Better accuracy!
Worse latency
Realtime!
Better latency!
Worse accuracy
TSAR Vertica
Manual Data Exploration
Manhattan
Dashboards & !
Production Services
Behind the scenes
Summingbird
Scalding Storm
Batch (Hadoop)!
Better accuracy!
Worse latency
Realtime!
Better latency!
Worse accuracy
TSAR Vertica
Manual Data Exploration
Manhattan
Dashboards & !
Production Services
Glossary
• Summingbird - framework for integrating batch
and online MapReduce computations
• Scalding - Scala library for running batch
MapReduce jobs
• Manhattan - real-time multi-tenant distributed
database
What is Summingbird?
1) Model for streaming
multi-stage map-reduce
What is streaming map-reduce?
Service
Merge
SumByKey
Map
Map
Lookup
Source Source
!
Can push one
tuple through at a
time to update
state
!
=> can work on
batch and real-
time streams of
dataStore
One-at-a-time semantics, run
the job in realtime or in batch
What is Summingbird?
2) Implementations to run
on Storm, Scalding
(Hadoop), Spark, etc
What is Summingbird?
2) Implementations to run
on Storm, Scalding
(Hadoop), Spark, etc
Portable
Optimizers at the
Summingbird layer,
leverage those optimizers
across platforms
What is Summingbird?
3) Systematic
implementation of the
“Lambda Architecture”
summingbird-scalding
summingbird-storm
storehaus-memcache
storehaus-algebra
storehaus-manhattan
Kafka
Lambda Architecture with Summingbird
http://lambda-architecture.net
What is Summingbird?
3) Systematic
implementation of the
“Lambda Architecture”
Fault Tolerant
What is Summingbird?
3) Systematic
implementation of the
“Lambda Architecture”
Fault Tolerant
Up-to-the-moment
Restricting reduce operators
to a very general class
(semigroups, monoids)system.
Monoid
2 + 3 = 61 +
Monoid
2 + 3 = 61 +
=
3
Monoid
2 + 3 = 61 +
=
5
Example Monoids
• (a min b) min c = a min (b min c)
• (a max b) max c = a max (b max c)
• addition: (a + b) + c = a + (b + c)
• set union: (a u b) u c = a u (b u c)
• set intersection: (a n b) n c = a n (b n c)
• approximate unique count (HLL), approximate
counter (CMS)
!
!
Algebird - https://github.com/twitter/algebird
Batching and associativity yields reliability
Batch0 Batch1 Batch2 Batch3
Slow but
fault
tolerant
Noisy
but fast
Realtime
sums from
0, each
batch
Log
Hadoop Hadoop Hadoop Hadoop
Log Log Log
RT RT RT RTRT RT RT RT
Hadoop
keeps a total
sum (reliably)
Batching and associativity yields reliability
Batch0 Batch1 Batch2 Batch3
Log
Hadoop Hadoop Hadoop Hadoop
Log Log Log
RT RT RT RTRT RT RT RT Sum of RT
Batch(i) +
Hadoop
Batch(i-1)
has bounded
noise, bounded
read/write size.
!
Done at query
time
All of that sums up to
Summary
• Twitter Data Platform enables use of large-scale
resources to perform data analytics at scale
• Write jobs once, that are
• Portable, reliable, and up-to-the-moment
• A systematic implementation of the lambda
architecture
• Monoids & semi-groups for parallelism & performance
• Batching & associativity for reliability
Thank you!
@krishnansriram
Acknowledgements!
Ekaterina Gonina
Ian O’Connell
Gabriel Gonzalez
Oscar Boykin
!
Twitter Data Platform Team!
We are hiring!!
!
@summingbird
https://github.com/twitter/summingbird
@scalding
https://github.com/twitter/scalding

More Related Content

Data Platform at Twitter: Enabling Real-time & Batch Analytics at Scale

  • 1. Data Platform at Twitter Enabling Real-time & Batch Analytics at Scale Sriram Krishnan | Twitter | @krishnansriram Hadoop Innovation Summit, Feb 12, 2015
  • 2. Who am I? • Engineering Manager, Data Platform at Twitter • Formerly: • Tech Lead, Big Data Platform at Netflix • Group Lead & Senior Researcher at the San Diego Supercomputer Center
  • 3. Twitter Scale (Feb 2015) • More than 288M monthly active users • More than 500M unique monthly logged out users • 500M tweets per day • 80% mobile users • Thousands of advertisers
  • 4. Twitter Data Platform • Enables use of large-scale resources to perform data analytics at scale • trending tweets • ad impressions • most clicked, most followed, most retweeted • platform health & statistics • experimentation
  • 5. Twitter Data Platform • Enables use of large-scale resources to perform data analytics at scale • trending tweets • ad impressions • most clicked, most followed, most retweeted • platform health & statistics • experimentation In real-time!
  • 6. Use Case: Counting Impressions
  • 7. Data Streams (user_id, tweet, timestamp, hashtags, …) (user_id, tweet, device_id, …) event queue Logs Storm topology Hadoop job Real-time results Batch results
  • 8. Storm • Streaming compute framework • Jobs represented as topologies • Process tuples as they come • Data comes in from Spouts • Data is computed on in Bolts http://storm.apache.org
  • 9. Hadoop http://www.glennklockwood.com/di/hadoop-overview.php • Batch MapReduce library • Uses the Hadoop Distributed File System (HDFS) to store and process data • Jobs consist of Map & Reduce phases • Synchronization barriers between each stage
  • 10. Challenges • Hadoop and Storm present two different programming models • Written in different languages • Each job has to be written twice • Hard to optimize, specific for each job & platform • Hard to compute complete up-to-the-moment information
  • 12. Goals • Write job once! • Portable
  • 13. Goals • Write job once! • Portable! • Fault tolerant
  • 14. Goals • Write job once! • Portable! • Fault tolerant! • Up-to-the-moment
  • 15. TSAR • The TimeSeries AggregatoR • https://blog.twitter.com/2014/tsar-a-timeseries-aggregator
  • 16. We are still counting impressions
  • 17. aggregate { onKeys ( (TweetId) ) produce ( Count ) sinkTo ( Vertica ) } fromProducer { ! ClientEventSource(“client_events”) .filter { e => isImpressionEvent(e) } .map { e => val impr = Impression(e.tweetId) (e.timestamp, impr) } } The TSAR job Dimensions Metrics Data Sinks Data Sources
  • 18. TSAR will then: • Configure and launch jobs on Storm & Hadoop • Create services for querying results • Create tables and views and staging jobs • Create alerts and observability graphs
  • 19. Behind the scenes Scalding Storm Batch (Hadoop)! Better accuracy! Worse latency Realtime! Better latency! Worse accuracy TSAR
  • 20. Behind the scenes Scalding Storm Batch (Hadoop)! Better accuracy! Worse latency Realtime! Better latency! Worse accuracy TSAR Vertica Manual Data Exploration Manhattan Dashboards & ! Production Services
  • 21. Behind the scenes Summingbird Scalding Storm Batch (Hadoop)! Better accuracy! Worse latency Realtime! Better latency! Worse accuracy TSAR Vertica Manual Data Exploration Manhattan Dashboards & ! Production Services
  • 22. Glossary • Summingbird - framework for integrating batch and online MapReduce computations • Scalding - Scala library for running batch MapReduce jobs • Manhattan - real-time multi-tenant distributed database
  • 23. What is Summingbird? 1) Model for streaming multi-stage map-reduce
  • 24. What is streaming map-reduce? Service Merge SumByKey Map Map Lookup Source Source ! Can push one tuple through at a time to update state ! => can work on batch and real- time streams of dataStore
  • 25. One-at-a-time semantics, run the job in realtime or in batch
  • 26. What is Summingbird? 2) Implementations to run on Storm, Scalding (Hadoop), Spark, etc
  • 27. What is Summingbird? 2) Implementations to run on Storm, Scalding (Hadoop), Spark, etc Portable
  • 28. Optimizers at the Summingbird layer, leverage those optimizers across platforms
  • 29. What is Summingbird? 3) Systematic implementation of the “Lambda Architecture”
  • 31. What is Summingbird? 3) Systematic implementation of the “Lambda Architecture” Fault Tolerant
  • 32. What is Summingbird? 3) Systematic implementation of the “Lambda Architecture” Fault Tolerant Up-to-the-moment
  • 33. Restricting reduce operators to a very general class (semigroups, monoids)system.
  • 34. Monoid 2 + 3 = 61 +
  • 35. Monoid 2 + 3 = 61 + = 3
  • 36. Monoid 2 + 3 = 61 + = 5
  • 37. Example Monoids • (a min b) min c = a min (b min c) • (a max b) max c = a max (b max c) • addition: (a + b) + c = a + (b + c) • set union: (a u b) u c = a u (b u c) • set intersection: (a n b) n c = a n (b n c) • approximate unique count (HLL), approximate counter (CMS) ! ! Algebird - https://github.com/twitter/algebird
  • 38. Batching and associativity yields reliability Batch0 Batch1 Batch2 Batch3 Slow but fault tolerant Noisy but fast Realtime sums from 0, each batch Log Hadoop Hadoop Hadoop Hadoop Log Log Log RT RT RT RTRT RT RT RT Hadoop keeps a total sum (reliably)
  • 39. Batching and associativity yields reliability Batch0 Batch1 Batch2 Batch3 Log Hadoop Hadoop Hadoop Hadoop Log Log Log RT RT RT RTRT RT RT RT Sum of RT Batch(i) + Hadoop Batch(i-1) has bounded noise, bounded read/write size. ! Done at query time
  • 40. All of that sums up to
  • 41. Summary • Twitter Data Platform enables use of large-scale resources to perform data analytics at scale • Write jobs once, that are • Portable, reliable, and up-to-the-moment • A systematic implementation of the lambda architecture • Monoids & semi-groups for parallelism & performance • Batching & associativity for reliability
  • 42. Thank you! @krishnansriram Acknowledgements! Ekaterina Gonina Ian O’Connell Gabriel Gonzalez Oscar Boykin ! Twitter Data Platform Team! We are hiring!! ! @summingbird https://github.com/twitter/summingbird @scalding https://github.com/twitter/scalding