Learn Hadoop and Bigdata Analytics, Join Design Pathshala training programs on Big data and analytics.
This slide covers the Advance Map reduce concepts of Hadoop and Big Data.
For training queries you can contact us:
Email: admin@designpathshala.com
Call us at: +91 98 188 23045
Visit us at: http://designpathshala.com
Join us at: http://www.designpathshala.com/contact-us
Course details: http://www.designpathshala.com/course/view/65536
Big data Analytics Course details: http://www.designpathshala.com/course/view/1441792
Business Analytics Course details: http://www.designpathshala.com/course/view/196608
Report
Share
Report
Share
1 of 59
More Related Content
Advance Map reduce - Apache hadoop Bigdata training by Design Pathshala
11. Map-Reduce Execution Engine
(Example: Color Count)
Shuffle & Sorting
based on k
Consumes(k, [v])
( ,
[1,1,1,1,1,1..])
Reduce
Reduce
Reduce
Produces (k, v)
( , 1)
Map
Map
Map
Map
Input blocks
on HDFS
Parse-hash
Parse-hash
Parse-hash
Parse-hash
Produces(k’, v’)
( , 100)
www.designpathshala.com 11
Users only provide the “Map” and “Reduce” functions
Part0001
Part0002
Part0003
That’s the output file, it
has 3 parts on probably 3
different machines
12. Apache Hadoop Bigdata
Training By Design Pathshala
Contact us on: admin@designpathshala.com
Or Call us at: +91 120 260 5512 or +91 98 188 23045
Visit us at: http://designpathshala.com
www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com |
http://designpathshala.com
12
14. Properties of MapReduce Engine
(Cont’d)
Task Tracker is the slave node (runs on each datanode)
Receives the task from Job Tracker
Runs the task until completion (either map or reduce task)
Always in communication with the Job Tracker reporting progress
Reduce
Reduce
Reduce
Map
Map
Map
Map
Parse-hash
Parse-hash
Parse-hash
Parse-hash
In this example, 1 map-reduce
job consists of 4 map
tasks and 3 reduce tasks
www.designpathshala.com 14
15. Key-Value Pairs
Mappers and Reducers are users’ code (provided functions)
Just need to obey the Key-Value pairs interface
Mappers:
Consume <key, value> pairs
Produce <key, value> pairs
Reducers:
Consume <key, <list of values>>
Produce <key, value>
Shuffling and Sorting:
Hidden phase between mappers and reducers
Groups all similar keys from all mappers, sorts and passes them to a certain
reducer in the form of <key, <list of values>>
www.designpathshala.com 15
16. Apache Hadoop Bigdata
Training By Design Pathshala
Contact us on: admin@designpathshala.com
Or Call us at: +91 120 260 5512 or +91 98 188 23045
Visit us at: http://designpathshala.com
www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com |
http://designpathshala.com
16
17. Example 2: Color Filter
Job: Select only the blue colorS
Input blocks
Produces (k, v)
on HDFS
( , 1)
Map
Map
Map
Map
Write to HDFS
Write to HDFS
Write to HDFS
Write to HDFS
• Each map task will select only
the blue color
• No need for reduce phase
Part0001
Part0002
Part0003
Part0004
That’s the output file, it
has 4 parts on probably 4
different machines
www.designpathshala.com 17
18. How does MapReduce work?
The run time partitions the input and provides it to
different Map instances;
Map (key, value) (key’, value’)
The run time collects the (key’, value’) pairs and
distributes them to several Reduce functions so that each
Reduce function gets the pairs with the same key’.
Each Reduce produces a single (or zero) file output.
Map and Reduce are user written functions
www.designpathshala.com 18
19. Example 3: Count Fruits
Job: Count the occurrences of each word in a data set
Map
Tasks
Reduce
Tasks
www.designpathshala.com 19
20. Word Count Example
Mapper
Input: value: lines of text of input
Output: key: word, value: 1
Reducer
Input: key: word, value: set of counts
Output: key: word, value: sum
Launching program
Defines this job
Submits job to cluster
www.designpathshala.com 20
21. Apache Hadoop Bigdata
Training By Design Pathshala
Contact us on: admin@designpathshala.com
Or Call us at: +91 120 260 5512 or +91 98 188 23045
Visit us at: http://designpathshala.com
www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com |
http://designpathshala.com
21
22. Example MapReduce: Mapper
public class WordCountMapper extends Mapper<LongWritable, Text, Text,
IntWritable> {
Text word = new Text();
public void map(LongWritable key, Text value, Context context) throws
IOException, InterruptedException {
String line = value.toString();
StringTokenizer tokenizer = new StringTokenizer(line);
while (tokenizer.hasMoreTokens()) {
word.set(tokenizer.nextToken().trim());
context.write(word, new IntWritable(1)); }}}
www.designpathshala.com 22
23. Reducer
public class WordCountReducer extends Reducer<Text, IntWritable, Text,
IntWritable> {
private IntWritable result = new IntWritable();
public void reduce(Text key, Iterable<IntWritable> values,
Context context) throws IOException, InterruptedException {
int sum = 0;
for (IntWritable val : values) {
sum += val.get();
}
result.set(sum);
context.write(key, result);}}
www.designpathshala.com 23
24. Job
public static void main(String[] args) throws Exception {
JobConf conf = new JobConf(WordCount.class);
conf.setJobName("wordcount");
conf.setOutputKeyClass(Text.class);
conf.setOutputValueClass(IntWritable.class);
conf.setMapperClass(Map.class);
conf.setReducerClass(Reduce.class);
conf.setInputFormat(TextInputFormat.class);
conf.setOutputFormat(TextOutputFormat.class);
FileInputFormat.setInputPaths(conf, new Path(args[0]));
FileOutputFormat.setOutputPath(conf, new Path(args[1]));
JobClient.runJob(conf);}
www.designpathshala.com 24
25. Terminology Example
Running “Word Count” across 20 files is one job
20 input splits to be mapped imply 20 map tasks + some number of reduce
tasks
At least 20 map task attempts will be performed… more if a machine
crashes, etc.
www.designpathshala.com 25
26. Apache Hadoop Bigdata
Training By Design Pathshala
Contact us on: admin@designpathshala.com
Or Call us at: +91 120 260 5512 or +91 98 188 23045
Visit us at: http://designpathshala.com
www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com |
http://designpathshala.com
26
27. MapReduce - Features
Fine grained Map and Reduce tasks
Improved load balancing
Faster recovery from failed tasks
Automatic re-execution on failure
In a large cluster, some nodes are always slow or flaky
Framework re-executes failed tasks
Locality optimizations
With large data, bandwidth to data is a problem
Map-Reduce + HDFS is a very effective solution
Map-Reduce queries HDFS for locations of input data
Map tasks are scheduled close to the inputs when possible
www.designpathshala.com 27
28. What is Writable?
Hadoop defines its own “box” classes for strings (Text), integers
(IntWritable), etc.
All values are instances of Writable
All keys are instances of WritableComparable
www.designpathshala.com 28
29. Hadoop Data Types
Class Size in bytes Description Sort Policy
BooleanWritable 1 Wrapper for a standard Boolean variable False before and true
after
ByteWritable 1 Wrapper for a single byte Ascending order
DoubleWritable 8 Wrapper for a Double Ascending order
FloatWritable 4 Wrapper for a Float Ascending order
IntWritable 4 Wrapper for a Integer Ascending order
LongWritable 8 Wrapper for a Long Ascending order
Text 2GB Wrapper to store text using the unicode
UTF8 format
Alphabetic order
NullWritable Placeholder when the key or value is not
needed
Undefined
Your
Writable
Implement the Writable Interface for a
value or WritableComparable<T> for a key
Your sort policy
www.designpathshala.com 29
30. WritableComparable
Compares WritableComparable data
Will call compareTo method to do comparison
www.designpathshala.com 30
31. Apache Hadoop Bigdata
Training By Design Pathshala
Contact us on: admin@designpathshala.com
Or Call us at: +91 120 260 5512 or +91 98 188 23045
Visit us at: http://designpathshala.com
www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com |
http://designpathshala.com
31
32. Input Split Size
Input splits are logical division of records whereas HDFS blocks are physical
division of the input data.
Its extremely efficient when they are same but in practice it’s never align.
Machine processing a particular split may fetch a fragment of a record from a
block other than its “main” block and which may reside remotely.
FileInputFormat will divide large files into chunks
Exact size controlled by mapred.min.split.size
RecordReaders receive file, offset, and length of chunk (or input splits)
www.designpathshala.com 32
33. Getting Data To The Mapper
Input file
Input file
InputSplit InputSplit InputSplit InputSplit
RecordReader RecordReader RecordReader RecordReader
Mapper
(intermediates)
Mapper
(intermediates)
Mapper
(intermediates)
Mapper
(intermediates)
InputFormat
www.designpathshala.com 33
34. Reading Data
Data sets are specified by InputFormats
Defines input data (e.g., a directory)
Identifies partitions of the data that form an InputSplit
Factory for RecordReader objects to extract (k, v) records from the input source
www.designpathshala.com 34
35. Apache Hadoop Bigdata
Training By Design Pathshala
Contact us on: admin@designpathshala.com
Or Call us at: +91 120 260 5512 or +91 98 188 23045
Visit us at: http://designpathshala.com
www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com |
http://designpathshala.com
35
36. FileInputFormat
TextInputFormat – Treats each ‘n’-terminated line of a file as a value & Key is the byte offset
of line
Key – LongWritable
Value - Text
KeyValueTextInputFormat – Each line in the text files is a record. Separator character
deivides each line. Any thing before separator is a key and after that is a value.
Separator is set by “key.value.separator.in.input.line.property”
Default separator is “t”
Key: Text
Value: Text
SequenceFileInputFormat – Input format for reading in sequence files. Key and values are user
defined. These are specific compression binary file format.
NLineInputFormat – Same as TextInputFormat, but each split is guaranteed to have exactly N
lines.
Mapred.line.input.format.linespermap.property
Key: LongWritable
Value: Text
www.designpathshala.com 36
37. Filtering File Inputs
FileInputFormat will read all files out of a specified directory and send them
to the mapper
Delegates filtering this file list to a method subclasses may override
e.g., Create your own “xyzFileInputFormat” to read *.xyz from directory list
www.designpathshala.com 37
38. Record Readers
Each InputFormat provides its own RecordReader implementation
Responsible for parsing input splits into records
Then parsing each record into a key value pair
LineRecordReader – Reads a line from a text file
Used in TextInputFormat
KeyValueRecordReader – Used by KeyValueTextInputFormat
Custom Record Readers can be created by implementing RecordReader<K,V>
www.designpathshala.com 38
39. Creating the Mapper
Extends Mapper Abstract Class
Class Mapper<KEYIN,VALUEIN,KEYOUT,VALUEOUT>
protected void cleanup(org.apache.hadoop.mapreduce.Mapper.Context context) -
Called once at the end of the task.
protected void map(KEYIN key, VALUEIN value,
org.apache.hadoop.mapreduce.Mapper.Context context) - Called once for each
key/value pair in the input split.
void run(org.apache.hadoop.mapreduce.Mapper.Context context) - Expert users
can override this method for more complete control over the execution of the
Mapper.
protected void setup(org.apache.hadoop.mapreduce.Mapper.Context context) -
Called once at the beginning of the task.OutputCollector receives the output of the
mapping process
www.designpathshala.com 39
40. Apache Hadoop Bigdata
Training By Design Pathshala
Contact us on: admin@designpathshala.com
Or Call us at: +91 120 260 5512 or +91 98 188 23045
Visit us at: http://designpathshala.com
www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com |
http://designpathshala.com
40
41. Mapper
public void map(
Object key,
Text value,
Context context)
Key types implement WritableComparable
Value types implement Writable
www.designpathshala.com 41
42. Some useful mappers
IdentityMapper<K,V> - Maps the input directly to output
InverseMapper<K,V> - Reverse key value pair
RegexMapper<K> - Implements Mapper<K,Text,Text,LongWritable> and
generates a (match,1) pair for every regular expression match.
TokenCountMapper<K> - - Implements Mapper<K,Text,Text,LongWritable> and
generates a (token,1) pair when input value is tokenized.
www.designpathshala.com 42
46. Some useful reducers
IdentityReducer<K,V> - Maps the input directly to output
LongSumReducer<K> - Implements Reducer<K,LongWritable, K,LongWritable>
and determines sum of all values corresponding to the given key.
www.designpathshala.com 46
47. OutputFormat
TextOutputFormat – Writes each record as a line of text. Key and values are
written as string and separated as “/t”
SequenceFileOutputFormat – Writes Key and value as hadoops proprietary
sequence file format.
NullOutputFormat – Output nothing. If for any reason you want to suppress
the output completely.
www.designpathshala.com 47
48. Apache Hadoop
Common MapReduce Algorithms
Design Pathshala
April 22, 2014
www.designpathshala.com 48
49. Apache Hadoop Bigdata
Training By Design Pathshala
Contact us on: admin@designpathshala.com
Or Call us at: +91 120 260 5512 or +91 98 188 23045
Visit us at: http://designpathshala.com
www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com |
http://designpathshala.com
49
50. Some handy tools
Partitioners
Combiners
Compression
Zero Reduces
Distributed File Cache
www.designpathshala.com 50
51. Partitioners
Partitioners are application code that define how keys are
assigned to reduces
Default partitioning spreads keys evenly, but randomly
Uses key.hashCode() % num_reduces
Custom partitioning is often required, for example, to
produce a total order in the output
Should implement Partitioner interface
Set by calling conf.setPartitionerClass(MyPart.class)
To get a total order, sample the map output keys and pick values
to divide the keys into roughly equal buckets and use that in your
partitioner
www.designpathshala.com 51
53. Apache Hadoop Bigdata
Training By Design Pathshala
Contact us on: admin@designpathshala.com
Or Call us at: +91 120 260 5512 or +91 98 188 23045
Visit us at: http://designpathshala.com
www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com |
http://designpathshala.com
53
54. Combiners
When maps produce many repeated keys
It is often useful to do a local aggregation following the map
Done by specifying a Combiner
Goal is to decrease size of the transient data
Combiners have the same interface as Reduces, and often are the same
class
Combiners must not side effects, because they run an intermediate
number of times
In WordCount, conf.setCombinerClass(Reduce.class);
www.designpathshala.com 54
55. Compression
Compressing the outputs and intermediate data will often yield huge
performance gains
Can be specified via a configuration file or set programmatically
Set mapred.output.compress to true to compress job output
Set mapred.compress.map.output to true to compress map outputs
Compression Types (mapred(.map)?.output.compression.type)
“block” - Group of keys and values are compressed together
“record” - Each value is compressed individually
Block compression is almost always best
Compression Codecs (mapred(.map)?.output.compression.codec)
Default (zlib) - slower, but more compression
LZO - faster, but less compression
www.designpathshala.com 55
56. Apache Hadoop Bigdata
Training By Design Pathshala
Contact us on: admin@designpathshala.com
Or Call us at: +91 120 260 5512 or +91 98 188 23045
Visit us at: http://designpathshala.com
www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com |
http://designpathshala.com
56
57. Zero Reduces
Frequently, we only need to run a filter on the input data
No sorting or shuffling required by the job
Set the number of reduces to 0
Output from maps will go directly to OutputFormat and disk
www.designpathshala.com 57
58. Distributed File Cache
Sometimes need read-only copies of data on the local
computer
Downloading 1GB of data for each Mapper is expensive
Define list of files you need to download in JobConf
Files are downloaded once per computer
Add to launching program:
DistributedCache.addCacheFile(new URI(“hdfs://nn:8020/foo”), conf);
Add to task:
Path[] files = DistributedCache.getLocalCacheFiles(conf);
www.designpathshala.com 58
59. Apache Hadoop Bigdata
Training By Design Pathshala
Contact us on: admin@designpathshala.com
Or Call us at: +91 120 260 5512 or +91 98 188 23045
Visit us at: http://designpathshala.com
www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com |
http://designpathshala.com
59