DataFu @ ApacheCon 2014
- 3. Apache DataFu
• Apache DataFu is a collection of libraries for working with
large-scale data in Hadoop.
• Currently consists of two libraries:
• DataFu Pig – a collection of Pig UDFs
• DataFu Hourglass – incremental processing
• Incubating
- 4. History
• LinkedIn had a number of teams who had developed
generally useful UDFs
• Problems:
• No centralized library
• No automated testing
• Solutions:
• Unit tests (PigUnit)
• Code coverage (Cobertura)
• Initially open-sourced 2011; 1.0 September, 2013
- 5. What it’s all about
• Making it easier to work with large scale data
• Well-documented, well-tested code
• Easy to contribute
• Extensive documentation
• Getting started guide
• i.e. for DataFu Pig – it should be easy to add a UDF,
add a test, ship it
- 6. DataFu community
• People who use Hadoop for working with data
• Used extensively at LinkedIn
• Included in Cloudera’s CDH
• Included in Apache Bigtop
- 8. DataFu Pig
• A collection of UDFs for data analysis covering:
• Statistics
• Bag Operations
• Set Operations
• Sessions
• Sampling
• General Utility
• And more..
- 9. Coalesce
• A common case: replace null values with a default
data = FOREACH data GENERATE (val IS NOT NULL ? val : 0) as result;!
data = FOREACH data GENERATE (val1 IS NOT NULL ? val1 : !
(val2 IS NOT NULL ? val2 : !
(val3 IS NOT NULL ? val3 : !
NULL))) as result; !
• To return the first non-null value
- 10. Coalesce
• Using Coalesce to set a default of zero
• It returns the first non-null value
data = FOREACH data GENERATE Coalesce(val,0) as result;!
data = FOREACH data GENERATE Coalesce(val1,val2,val3) as result;!
- 11. Compute session statistics
• Suppose we have a website, and we want to see how
long members spend browsing it
• We also want to know who are the most engaged
• Raw data is the click stream
pv = LOAD 'pageviews.csv' USING PigStorage(',') !
AS (memberId:int, time:long, url:chararray);!
- 12. Compute session statistics
• First, what is a session?
• Session = sustained user activity
• Session ends after 10 minutes of no activity
DEFINE Sessionize datafu.pig.sessions.Sessionize('10m');!
DEFINE UnixToISO
org.apache.pig.piggybank.evaluation.datetime.convert.UnixToISO();!
• Session expects ISO-formatted time
- 13. Compute session statistics
• Sessionize appends a sessionId to each tuple
• All tuples in the same session get the same sessionId
pv_sessionized = FOREACH (GROUP pv BY memberId) {!
ordered = ORDER pv BY isoTime;!
GENERATE FLATTEN(Sessionize(ordered)) !
AS (isoTime, time, memberId, sessionId);!
};!
!
pv_sessionized = FOREACH pv_sessionized GENERATE !
sessionId, memberId, time;!
- 14. Compute session statistics
• Statistics:
DEFINE Median datafu.pig.stats.StreamingMedian();!
DEFINE Quantile datafu.pig.stats.StreamingQuantile('0.90','0.95');!
DEFINE VAR datafu.pig.stats.VAR();!
• You have your choice between streaming (approximate)
and exact calculations (slower, require sorted input)
- 15. Compute session statistics
• Computer the session length in minutes
session_times = !
FOREACH (GROUP pv_sessionized BY (sessionId,memberId))!
GENERATE group.sessionId as sessionId,!
group.memberId as memberId,!
(MAX(pv_sessionized.time) - !
MIN(pv_sessionized.time))!
/ 1000.0 / 60.0 as session_length;!
- 16. Compute session statistics
• Compute the statistics
session_stats = FOREACH (GROUP session_times ALL) {!
GENERATE!
AVG(ordered.session_length) as avg_session,!
SQRT(VAR(ordered.session_length)) as std_dev_session,!
Median(ordered.session_length) as median_session,!
Quantile(ordered.session_length) as quantiles_session;!
};!
- 17. Compute session statistics
• Find the most engaged users
long_sessions = !
filter session_times by !
session_length > !
session_stats.quantiles_session.quantile_0_95;!
!
very_engaged_users = !
DISTINCT (FOREACH long_sessions GENERATE memberId);!
- 18. Pig Bags
• Pig represents collections as a bag
• In PigLatin, the ways in which you can manipulate a bag
are limited
• Working with an inner bag (inside a nested block) can be
difficult
- 19. DataFu Pig Bags
• DataFu provides a number of operations to let you
transform bags
• AppendToBag – add a tuple to the end of a bag
• PrependToBag – add a tuple to the front of a bag
• BagConcat – combine two (or more) bags into one
• BagSplit – split one bag into multiples
- 20. DataFu Pig Bags
• It also provides UDFs that let you operate on bags similar
to how you might with relations
• BagGroup – group operation on a bag
• CountEach – count how many times a tuple appears
• BagLeftOuterJoin – join tuples in bags by key
- 21. Counting Events
• Let’s consider a system where a user is recommended
items of certain categories and can act to accept or reject
these recommendations
impressions = LOAD '$impressions' AS (user_id:int, item_id:int,
timestamp:long);!
accepts = LOAD '$accepts' AS (user_id:int, item_id:int, timestamp:long);!
rejects = LOAD '$rejects' AS (user_id:int, item_id:int, timestamp:long);!
- 22. Counting Events
• We want to know, for each user, how many times an item
was shown, accepted and rejected
features: {!
user_id:int, !
items:{(!
item_id:int,
impression_count:int, !
accept_count:int, !
reject_count:int)}
}!
- 23. Counting Events
-- First cogroup!
features_grouped = COGROUP !
impressions BY (user_id, item_id),
accepts BY (user_id, item_id), !
rejects BY (user_id, item_id);!
-- Then count!
features_counted = FOREACH features_grouped GENERATE !
FLATTEN(group) as (user_id, item_id),!
COUNT_STAR(impressions) as impression_count,!
COUNT_STAR(accepts) as accept_count,!
COUNT_STAR(rejects) as reject_count;!
-- Then group again!
features = FOREACH (GROUP features_counted BY user_id) GENERATE !
group as user_id,!
features_counted.(item_id, impression_count, accept_count, reject_count) !
as items;!
One approach…
- 24. Counting Events
• But it seems wasteful to have to group twice
• Even big data can get reasonably small once you start
slicing and dicing it
• Want to consider one user at a time – that should be small
enough to fit into memory
- 25. Counting Events
• Another approach: Only group once
• Bag manipulation UDFs to avoid the extra mapreduce job
DEFINE CountEach datafu.pig.bags.CountEach('flatten');!
DEFINE BagLeftOuterJoin datafu.pig.bags.BagLeftOuterJoin();!
DEFINE Coalesce datafu.pig.util.Coalesce();!
• CountEach – counts how many times a tuple appears in a
bag
• BagLeftOuterJoin – performs a left outer join across
multiple bags
- 26. Counting Events
features_grouped = COGROUP impressions BY user_id, accepts BY user_id, !
rejects BY user_id;!
!
features_counted = FOREACH features_grouped GENERATE !
group as user_id,!
CountEach(impressions.item_id) as impressions,!
CountEach(accepts.item_id) as accepts,!
CountEach(rejects.item_id) as rejects;!
!
features_joined = FOREACH features_counted GENERATE!
user_id,!
BagLeftOuterJoin(!
impressions, 'item_id',!
accepts, 'item_id',!
rejects, 'item_id'!
) as items;!
A DataFu approach…
- 27. Counting Events
features = FOREACH features_joined {!
projected = FOREACH items GENERATE!
impressions::item_id as item_id,!
impressions::count as impression_count,!
Coalesce(accepts::count, 0) as accept_count,!
Coalesce(rejects::count, 0) as reject_count;!
GENERATE user_id, projected as items;!
}!
• Revisit Coalesce to give default values
- 28. Sampling
• Suppose we only wanted to run our script on a sample of
the previous input data
• We have a problem, because the cogroup is only going to
work if we have the same key (user_id) in each relation
impressions = LOAD '$impressions' AS (user_id:int, item_id:int,
item_category:int, timestamp:long);!
accepts = LOAD '$accepts' AS (user_id:int, item_id:int, timestamp:long);!
rejects = LOAD '$rejects' AS (user_id:int, item_id:int, timestamp:long);!
- 29. Sampling
• DataFu provides SampleByKey
DEFINE SampleByKey datafu.pig.sampling.SampleByKey(’a_salt','0.01');!
!
impressions = FILTER impressions BY SampleByKey('user_id');!
accepts = FILTER impressions BY SampleByKey('user_id');!
rejects = FILTER rejects BY SampleByKey('user_id');!
features = FILTER features BY SampleByKey('user_id');!
- 30. Left outer joins
• Suppose we had three relations:
• And we wanted to do a left outer join on all three:
input1 = LOAD 'input1' using PigStorage(',') AS (key:INT,val:INT);!
input2 = LOAD 'input2' using PigStorage(',') AS (key:INT,val:INT);!
input3 = LOAD 'input3' using PigStorage(',') AS (key:INT,val:INT);!
joined = JOIN input1 BY key LEFT, !
input2 BY key,!
input3 BY key;!
!
• Unfortunately, this is not legal PigLatin
- 31. Left outer joins
• Instead, you need to join twice:
data1 = JOIN input1 BY key LEFT, input2 BY key;!
data2 = JOIN data1 BY input1::key LEFT, input3 BY key;!
• This approach requires two MapReduce jobs, making it
inefficient, as well as inelegant
- 32. Left outer joins
• There is always cogroup:
data1 = COGROUP input1 BY key, input2 BY key, input3 BY key;!
data2 = FOREACH data1 GENERATE!
FLATTEN(input1), -- left join on this!
FLATTEN((IsEmpty(input2) ? TOBAG(TOTUPLE((int)null,(int)null)) : input2)) !
as (input2::key,input2::val),!
FLATTEN((IsEmpty(input3) ? TOBAG(TOTUPLE((int)null,(int)null)) : input3)) !
as (input3::key,input3::val);!
• But, it’s cumbersome and error-prone
- 33. Left outer joins
• So, we have EmptyBagToNullFields
• Cleaner, easier to use
data1 = COGROUP input1 BY key, input2 BY key, input3 BY key;!
data2 = FOREACH data1 GENERATE!
FLATTEN(input1), -- left join on this!
FLATTEN(EmptyBagToNullFields(input2)),!
FLATTEN(EmptyBagToNullFields(input3));!
- 34. Left outer joins
• Can turn it into a macro
DEFINE left_outer_join(relation1, key1, relation2, key2, relation3, key3)
returns joined {!
cogrouped = COGROUP !
$relation1 BY $key1, $relation2 BY $key2, $relation3 BY $key3;!
$joined = FOREACH cogrouped GENERATE !
FLATTEN($relation1), !
FLATTEN(EmptyBagToNullFields($relation2)), !
FLATTEN(EmptyBagToNullFields($relation3));!
}!
features = left_outer_join(input1, val1, input2, val2, input3, val3);!
- 35. Schema and aliases
• A common (bad) practice in Pig is to use positional notation
to reference fields
• Hard to maintain
• Script is tightly coupled to order of fields in input
• Inserting a field in the beginning breaks things
downstream
• UDFs can have this same problem
• Especially problematic because code is separated, so
the dependency is not obvious
- 36. Schema and aliases
• Suppose we are calculating monthly mortgage payments
for various interest rates
mortgage = load 'mortgage.csv' using PigStorage('|')!
as (principal:double,!
num_payments:int,!
interest_rates: bag {tuple(interest_rate:double)});!
- 37. Schema and aliases
• So, we write a UDF to compute the payments
• First, we need to get the input parameters:
@Override!
public DataBag exec(Tuple input) throws IOException!
{!
Double principal = (Double)input.get(0);!
Integer numPayments = (Integer)input.get(1);!
DataBag interestRates = (DataBag)input.get(2);!
!
// ...!
- 38. Schema and aliases
• Then do some computation:
DataBag output = bagFactory.newDefaultBag();!
!
for (Tuple interestTuple : interestRates) {!
Double interest = (Double)interestTuple.get(0);!
!
double monthlyPayment = computeMonthlyPayment(principal, numPayments, !
interest);!
!
output.add(tupleFactory.newTuple(monthlyPayment));!
}!
- 39. Schema and aliases
• The UDF then gets applied
payments = FOREACH mortgage GENERATE MortgagePayment($0,$1,$2);!
payments = FOREACH mortgage GENERATE !
MortgagePayment(principal,num_payments,interest_rates);!
• Or, a bit more understandably
- 40. Schema and aliases
• Later, the data is changes, and a field is prepended to
tuples in the interest_rates bag
mortgage = load 'mortgage.csv' using PigStorage('|')!
as (principal:double,!
num_payments:int,!
interest_rates: bag {tuple(wow_change:double,interest_rate:double)});!
• The script happily continues to work, and the output data
begins to flow downstream, causing serious errors, later
- 41. Schema and aliases
• Write the UDF to fetch arguments by name using the
schema
• AliasableEvalFunc can help
Double principal = getDouble(input,"principal");!
Integer numPayments = getInteger(input,"num_payments");!
DataBag interestRates = getBag(input,"interest_rates");!
for (Tuple interestTuple : interestRates) {!
Double interest = getDouble(interestTuple, !
getPrefixedAliasName("interest_rates”, "interest_rate"));!
// compute monthly payment...!
}!
- 42. Other awesome things
• New and coming things
• Functions for calculating entropy
• OpenNLP wrappers
• New and improved Sampling UDFs
• Additional Bag UDFs
• InHashSet
• More…
- 44. Event Collection
• Typically online websites have instrumented
services that collect events
• Events stored in an offline system (such as
Hadoop) for later analysis
• Using events, can build dashboards with
metrics such as:
• # of page views over last month
• # of active users over last month
• Metrics derived from events can also be useful
in recommendation pipelines
• e.g. impression discounting
- 45. Event Storage
• Events can be categorized into topics, for example:
• page view
• user login
• ad impression/click
• Store events by topic and by day:
• /data/page_view/daily/2013/10/08
• /data/page_view/daily/2013/10/09
• Hourglass allows you to perform computation over specific time
windows of data stored in this format
- 49. Improving Fixed-Start Computations
• Suppose we must compute page view counts per member
• The job consumes all days of available input, producing one
output.
• We call this a partition-collapsing job.
• But, if the job runs tomorrow it has
to reprocess the same data.
- 50. Improving Fixed-Start Computations
• Solution: Merge new data with previous output
• We can do this because this is an arithmetic operation
• Hourglass provides a
partition-collapsing job that
supports output reuse.
- 52. Improving Fixed-Length Computations
• For a fixed-length job, can reuse output using a
similar trick:
• Add new day to previous
output
• Subtract old day from result
• We can subtract the old day
since this is arithmetic
- 54. Improving Fixed-Length Computations
• But, for some operations, cannot subtract old data
• example: max(), min()
• Cannot reuse previous output, so how to reduce computation?
• Solution: partition-preserving job
• Partitioned input data,
• partitioned output data
• aggregate the data in advance
- 56. MapReduce in Hourglass
• MapReduce is a fairly general programming model
• Hourglass requires:
• reduce() must output (key,value) pair
• reduce() must produce at most one value
• reduce() implemented by an accumulator
• Hourglass provides all the MapReduce boilerplate for
you for these types of jobs
- 57. Summary
• Two types of jobs:
• Partition-preserving: consume partitioned
input data, produce partitioned output data
• Partition-collapsing: consume partitioned
input data, produce single output
- 58. Summary
• You provide:
• Input: time range, input paths
• Implement: map(), accumulate()
• Optional: merge(), unmerge()
• Hourglass provides the rest to make it easier to
implement jobs that incrementally process