The Data Platform at Twitter supports engineers and data scientists running batch jobs on Hadoop clusters that are several 1000s of nodes, and real-time jobs on top of systems such as Storm. In this presentation, I discuss the overall Data Platform stack at Twitter. In particular, I talk about enabling real-time and batch analytics at scale with the help of Scalding, which is a Scala DSL for batch jobs using MapReduce, Summingbird, which is a framework for combined real-time and batch processing, and Tsar, which is a framework for real-time time-series aggregations.
Report
Share
Report
Share
1 of 42
More Related Content
Data Platform at Twitter: Enabling Real-time & Batch Analytics at Scale
1. Data Platform at Twitter
Enabling Real-time & Batch Analytics at Scale
Sriram Krishnan | Twitter | @krishnansriram
Hadoop Innovation Summit, Feb 12, 2015
2. Who am I?
• Engineering Manager, Data Platform at Twitter
• Formerly:
• Tech Lead, Big Data Platform at Netflix
• Group Lead & Senior Researcher at the San
Diego Supercomputer Center
3. Twitter Scale (Feb 2015)
• More than 288M monthly
active users
• More than 500M unique
monthly logged out users
• 500M tweets per day
• 80% mobile users
• Thousands of advertisers
4. Twitter Data Platform
• Enables use of large-scale resources to perform data
analytics at scale
• trending tweets
• ad impressions
• most clicked, most followed, most retweeted
• platform health & statistics
• experimentation
5. Twitter Data Platform
• Enables use of large-scale resources to perform data
analytics at scale
• trending tweets
• ad impressions
• most clicked, most followed, most retweeted
• platform health & statistics
• experimentation
In real-time!
8. Storm
• Streaming compute framework
• Jobs represented as
topologies
• Process tuples as they come
• Data comes in from Spouts
• Data is computed on in Bolts
http://storm.apache.org
10. Challenges
• Hadoop and Storm present two different
programming models
• Written in different languages
• Each job has to be written twice
• Hard to optimize, specific for each job & platform
• Hard to compute complete up-to-the-moment
information
17. aggregate {
onKeys (
(TweetId)
) produce (
Count
) sinkTo (
Vertica
)
} fromProducer {
!
ClientEventSource(“client_events”)
.filter { e => isImpressionEvent(e) }
.map { e =>
val impr = Impression(e.tweetId)
(e.timestamp, impr)
}
}
The TSAR job
Dimensions
Metrics
Data Sinks
Data Sources
18. TSAR will then:
• Configure and launch jobs on Storm & Hadoop
• Create services for querying results
• Create tables and views and staging jobs
• Create alerts and observability graphs
24. What is streaming map-reduce?
Service
Merge
SumByKey
Map
Map
Lookup
Source Source
!
Can push one
tuple through at a
time to update
state
!
=> can work on
batch and real-
time streams of
dataStore
37. Example Monoids
• (a min b) min c = a min (b min c)
• (a max b) max c = a max (b max c)
• addition: (a + b) + c = a + (b + c)
• set union: (a u b) u c = a u (b u c)
• set intersection: (a n b) n c = a n (b n c)
• approximate unique count (HLL), approximate
counter (CMS)
!
!
Algebird - https://github.com/twitter/algebird
38. Batching and associativity yields reliability
Batch0 Batch1 Batch2 Batch3
Slow but
fault
tolerant
Noisy
but fast
Realtime
sums from
0, each
batch
Log
Hadoop Hadoop Hadoop Hadoop
Log Log Log
RT RT RT RTRT RT RT RT
Hadoop
keeps a total
sum (reliably)
39. Batching and associativity yields reliability
Batch0 Batch1 Batch2 Batch3
Log
Hadoop Hadoop Hadoop Hadoop
Log Log Log
RT RT RT RTRT RT RT RT Sum of RT
Batch(i) +
Hadoop
Batch(i-1)
has bounded
noise, bounded
read/write size.
!
Done at query
time
41. Summary
• Twitter Data Platform enables use of large-scale
resources to perform data analytics at scale
• Write jobs once, that are
• Portable, reliable, and up-to-the-moment
• A systematic implementation of the lambda
architecture
• Monoids & semi-groups for parallelism & performance
• Batching & associativity for reliability