Data Platform at Twitter: Enabling Real-time & Batch Analytics at Scale

Data Platform at Twitter
Enabling Real-time & Batch Analytics at Scale
Sriram Krishnan | Twitter | @krishnansriram
Hadoop Innovation Summit, Feb 12, 2015

Who am I?
• Engineering Manager, Data Platform at Twitter
• Formerly:
• Tech Lead, Big Data Platform at Netﬂix
• Group Lead & Senior Researcher at the San
Diego Supercomputer Center

Twitter Scale (Feb 2015)
• More than 288M monthly
active users
• More than 500M unique
monthly logged out users
• 500M tweets per day
• 80% mobile users
• Thousands of advertisers

Twitter Data Platform
• Enables use of large-scale resources to perform data
analytics at scale
• trending tweets
• ad impressions
• most clicked, most followed, most retweeted
• platform health & statistics
• experimentation

Twitter Data Platform
• Enables use of large-scale resources to perform data
analytics at scale
• trending tweets
• ad impressions
• most clicked, most followed, most retweeted
• platform health & statistics
• experimentation
In real-time!

Use Case: Counting Impressions

Data Streams
(user_id, tweet, timestamp, hashtags, …)
(user_id, tweet, device_id, …)
event queue
Logs
Storm topology
Hadoop job
Real-time
results
Batch
results

Storm
• Streaming compute framework
• Jobs represented as
topologies
• Process tuples as they come
• Data comes in from Spouts
• Data is computed on in Bolts
http://storm.apache.org

Hadoop
http://www.glennklockwood.com/di/hadoop-overview.php
• Batch MapReduce library
• Uses the Hadoop
Distributed File System
(HDFS) to store and
process data
• Jobs consist of Map &
Reduce phases
• Synchronization barriers
between each stage

Challenges
• Hadoop and Storm present two different
programming models
• Written in different languages
• Each job has to be written twice
• Hard to optimize, speciﬁc for each job & platform
• Hard to compute complete up-to-the-moment
information

Goals
• Write job once!
• Portable

Goals
• Write job once!
• Portable!
• Fault tolerant

Goals
• Write job once!
• Portable!
• Fault tolerant!
• Up-to-the-moment

TSAR
• The TimeSeries AggregatoR
• https://blog.twitter.com/2014/tsar-a-timeseries-aggregator

We are still counting impressions

aggregate {
onKeys (
(TweetId)
) produce (
Count
) sinkTo (
Vertica
)
} fromProducer {
!
ClientEventSource(“client_events”)
.filter { e => isImpressionEvent(e) }
.map { e =>
val impr = Impression(e.tweetId)
(e.timestamp, impr)
}
}
The TSAR job
Dimensions
Metrics
Data Sinks
Data Sources

TSAR will then:
• Conﬁgure and launch jobs on Storm & Hadoop
• Create services for querying results
• Create tables and views and staging jobs
• Create alerts and observability graphs

Behind the scenes
Scalding Storm
Batch (Hadoop)!
Better accuracy!
Worse latency
Realtime!
Better latency!
Worse accuracy
TSAR

Behind the scenes
Scalding Storm
Batch (Hadoop)!
Better accuracy!
Worse latency
Realtime!
Better latency!
Worse accuracy
TSAR Vertica
Manual Data Exploration
Manhattan
Dashboards & !
Production Services

Behind the scenes
Summingbird
Scalding Storm
Batch (Hadoop)!
Better accuracy!
Worse latency
Realtime!
Better latency!
Worse accuracy
TSAR Vertica
Manual Data Exploration
Manhattan
Dashboards & !
Production Services

Glossary
• Summingbird - framework for integrating batch
and online MapReduce computations
• Scalding - Scala library for running batch
MapReduce jobs
• Manhattan - real-time multi-tenant distributed
database

What is Summingbird?
1) Model for streaming
multi-stage map-reduce

What is streaming map-reduce?
Service
Merge
SumByKey
Map
Map
Lookup
Source Source
!
Can push one
tuple through at a
time to update
state
!
=> can work on
batch and real-
time streams of
dataStore

One-at-a-time semantics, run
the job in realtime or in batch

2) Implementations to run
on Storm, Scalding
(Hadoop), Spark, etc

2) Implementations to run
on Storm, Scalding
(Hadoop), Spark, etc
Portable

Optimizers at the
Summingbird layer,
leverage those optimizers
across platforms

3) Systematic
implementation of the
“Lambda Architecture”

summingbird-scalding
summingbird-storm
storehaus-memcache
storehaus-algebra
storehaus-manhattan
Kafka
Lambda Architecture with Summingbird
http://lambda-architecture.net

3) Systematic
Fault Tolerant

3) Systematic
Fault Tolerant
Up-to-the-moment

Restricting reduce operators
to a very general class
(semigroups, monoids)system.

Example Monoids
• (a min b) min c = a min (b min c)
• (a max b) max c = a max (b max c)
• addition: (a + b) + c = a + (b + c)
• set union: (a u b) u c = a u (b u c)
• set intersection: (a n b) n c = a n (b n c)
• approximate unique count (HLL), approximate
counter (CMS)
!
!
Algebird - https://github.com/twitter/algebird

Batching and associativity yields reliability
Batch0 Batch1 Batch2 Batch3
Slow but
fault
tolerant
Noisy
but fast
Realtime
sums from
0, each
batch
Log
Hadoop Hadoop Hadoop Hadoop
Log Log Log
RT RT RT RTRT RT RT RT
Hadoop
keeps a total
sum (reliably)

Batching and associativity yields reliability
Batch0 Batch1 Batch2 Batch3
Log
Hadoop Hadoop Hadoop Hadoop
Log Log Log
RT RT RT RTRT RT RT RT Sum of RT
Batch(i) +
Hadoop
Batch(i-1)
has bounded
noise, bounded
read/write size.
!
Done at query
time

Summary
• Twitter Data Platform enables use of large-scale
resources to perform data analytics at scale
• Write jobs once, that are
• Portable, reliable, and up-to-the-moment
• A systematic implementation of the lambda
architecture
• Monoids & semi-groups for parallelism & performance
• Batching & associativity for reliability

Thank you!
@krishnansriram
Acknowledgements!
Ekaterina Gonina
Ian O’Connell
Gabriel Gonzalez
Oscar Boykin
!
Twitter Data Platform Team!
We are hiring!!
!
@summingbird
https://github.com/twitter/summingbird
@scalding
https://github.com/twitter/scalding

Data Platform at Twitter: Enabling Real-time & Batch Analytics at Scale

More Related Content

Data Platform at Twitter: Enabling Real-time & Batch Analytics at Scale