SlideShare a Scribd company logo
Apache Hadoop 
Design Pathshala 
April 22, 2014 
www.designpathshala.com 1
Apache Hadoop 
Interacting with HDFS 
Design Pathshala 
April 22, 2014 
www.designpathshala.com 2
Apache Hadoop Bigdata 
Training By Design Pathshala 
Contact us on: admin@designpathshala.com 
Or Call us at: +91 120 260 5512 or +91 98 188 23045 
Visit us at: http://designpathshala.com 
www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | 
http://designpathshala.com 
3
Basic file commands 
 Commads for HDFS User: 
 hadoop fs -mkdir /foodir 
 hadoop fs –ls / 
 hadoop fs –lsr / 
 hadoop fs –put abc.txt /usr/dp 
 hadoop fs –get /usr/dp/abc.txt . 
 hadoop fs -cat /foodir/myfile.txt 
 hadoop fs -rm /foodir/myfile.txt 
www.designpathshala.com 4
Reading & Writing Programatically 
 org.apache.hadoop.fs 
 Configuration conf = new Configuration(); 
 FileSystem hdfs = FileSystem.get(conf); 
 FileSystem local = FileSystem.getLocal(conf); 
www.designpathshala.com 5
 public static void main(String[] args) throws IOException { 
 Configuration conf = new Configuration(); 
 FileSystem hdfs = FileSystem.get(conf); 
 FileSystem local = FileSystem.getLocal(conf); 
 Path inputDir = new Path(args[0]); 
 Path hdfsFile = new Path(args[1]); 
 try { 
 FileStatus[] inputFiles = local.listStatus(inputDir); 
 FSDataOutputStream out = hdfs.create(hdfsFile); 
www.designpathshala.com 6
Apache Hadoop Bigdata 
Training By Design Pathshala 
Contact us on: admin@designpathshala.com 
Or Call us at: +91 120 260 5512 or +91 98 188 23045 
Visit us at: http://designpathshala.com 
www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | 
http://designpathshala.com 
7
 for (int i = 0; i < inputFiles.length; i++) { 
 System.out.println(inputFiles[i].getPath().getName()); 
 FSDataInputStream in = local.open(inputFiles[i].getPath()); 
 byte buffer[] = new byte[256]; 
 int bytesRead = 0; 
 while ((bytesRead = in.read(buffer)) > 0) { 
 out.write(buffer, 0, bytesRead); 
 } 
 in.close(); 
 } 
 out.close(); 
 } catch (IOException e) { 
 e.printStackTrace(); 
 } 
 } 
www.designpathshala.com 8
Apache Hadoop 
Map Reduce Basics 
Design Pathshala 
April 22, 2014 
www.designpathshala.com 9
MapReduce - Dataflow 
www.designpathshala.com 10
Map-Reduce Execution Engine 
(Example: Color Count) 
Shuffle & Sorting 
based on k 
Consumes(k, [v]) 
( , 
[1,1,1,1,1,1..]) 
Reduce 
Reduce 
Reduce 
Produces (k, v) 
( , 1) 
Map 
Map 
Map 
Map 
Input blocks 
on HDFS 
Parse-hash 
Parse-hash 
Parse-hash 
Parse-hash 
Produces(k’, v’) 
( , 100) 
www.designpathshala.com 11 
Users only provide the “Map” and “Reduce” functions 
Part0001 
Part0002 
Part0003 
That’s the output file, it 
has 3 parts on probably 3 
different machines
Apache Hadoop Bigdata 
Training By Design Pathshala 
Contact us on: admin@designpathshala.com 
Or Call us at: +91 120 260 5512 or +91 98 188 23045 
Visit us at: http://designpathshala.com 
www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | 
http://designpathshala.com 
12
Map <key, 1> Reducers (say, Count) 
Count Count Count 
Large scale data splits 
Parse-hash 
Parse-hash 
Parse-hash 
Parse-hash 
P-0000 
, count1 
P-0001 
, count2 
P-0002 
,count3 
www.designpathshala.com 13
Properties of MapReduce Engine 
(Cont’d) 
 Task Tracker is the slave node (runs on each datanode) 
 Receives the task from Job Tracker 
 Runs the task until completion (either map or reduce task) 
 Always in communication with the Job Tracker reporting progress 
Reduce 
Reduce 
Reduce 
Map 
Map 
Map 
Map 
Parse-hash 
Parse-hash 
Parse-hash 
Parse-hash 
In this example, 1 map-reduce 
job consists of 4 map 
tasks and 3 reduce tasks 
www.designpathshala.com 14
Key-Value Pairs 
 Mappers and Reducers are users’ code (provided functions) 
 Just need to obey the Key-Value pairs interface 
 Mappers: 
 Consume <key, value> pairs 
 Produce <key, value> pairs 
 Reducers: 
 Consume <key, <list of values>> 
 Produce <key, value> 
 Shuffling and Sorting: 
 Hidden phase between mappers and reducers 
 Groups all similar keys from all mappers, sorts and passes them to a certain 
reducer in the form of <key, <list of values>> 
www.designpathshala.com 15
Apache Hadoop Bigdata 
Training By Design Pathshala 
Contact us on: admin@designpathshala.com 
Or Call us at: +91 120 260 5512 or +91 98 188 23045 
Visit us at: http://designpathshala.com 
www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | 
http://designpathshala.com 
16
Example 2: Color Filter 
Job: Select only the blue colorS 
Input blocks 
Produces (k, v) 
on HDFS 
( , 1) 
Map 
Map 
Map 
Map 
Write to HDFS 
Write to HDFS 
Write to HDFS 
Write to HDFS 
• Each map task will select only 
the blue color 
• No need for reduce phase 
Part0001 
Part0002 
Part0003 
Part0004 
That’s the output file, it 
has 4 parts on probably 4 
different machines 
www.designpathshala.com 17
How does MapReduce work? 
 The run time partitions the input and provides it to 
different Map instances; 
 Map (key, value)  (key’, value’) 
 The run time collects the (key’, value’) pairs and 
distributes them to several Reduce functions so that each 
Reduce function gets the pairs with the same key’. 
 Each Reduce produces a single (or zero) file output. 
 Map and Reduce are user written functions 
www.designpathshala.com 18
Example 3: Count Fruits 
 Job: Count the occurrences of each word in a data set 
Map 
Tasks 
Reduce 
Tasks 
www.designpathshala.com 19
Word Count Example 
 Mapper 
 Input: value: lines of text of input 
 Output: key: word, value: 1 
 Reducer 
 Input: key: word, value: set of counts 
 Output: key: word, value: sum 
 Launching program 
 Defines this job 
 Submits job to cluster 
www.designpathshala.com 20
Apache Hadoop Bigdata 
Training By Design Pathshala 
Contact us on: admin@designpathshala.com 
Or Call us at: +91 120 260 5512 or +91 98 188 23045 
Visit us at: http://designpathshala.com 
www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | 
http://designpathshala.com 
21
Example MapReduce: Mapper 
public class WordCountMapper extends Mapper<LongWritable, Text, Text, 
IntWritable> { 
Text word = new Text(); 
public void map(LongWritable key, Text value, Context context) throws 
IOException, InterruptedException { 
String line = value.toString(); 
StringTokenizer tokenizer = new StringTokenizer(line); 
while (tokenizer.hasMoreTokens()) { 
word.set(tokenizer.nextToken().trim()); 
context.write(word, new IntWritable(1)); }}} 
www.designpathshala.com 22
Reducer 
public class WordCountReducer extends Reducer<Text, IntWritable, Text, 
IntWritable> { 
private IntWritable result = new IntWritable(); 
public void reduce(Text key, Iterable<IntWritable> values, 
Context context) throws IOException, InterruptedException { 
int sum = 0; 
for (IntWritable val : values) { 
sum += val.get(); 
} 
result.set(sum); 
context.write(key, result);}} 
www.designpathshala.com 23
Job 
public static void main(String[] args) throws Exception { 
JobConf conf = new JobConf(WordCount.class); 
conf.setJobName("wordcount"); 
conf.setOutputKeyClass(Text.class); 
conf.setOutputValueClass(IntWritable.class); 
conf.setMapperClass(Map.class); 
conf.setReducerClass(Reduce.class); 
conf.setInputFormat(TextInputFormat.class); 
conf.setOutputFormat(TextOutputFormat.class); 
FileInputFormat.setInputPaths(conf, new Path(args[0])); 
FileOutputFormat.setOutputPath(conf, new Path(args[1])); 
JobClient.runJob(conf);} 
www.designpathshala.com 24
Terminology Example 
 Running “Word Count” across 20 files is one job 
 20 input splits to be mapped imply 20 map tasks + some number of reduce 
tasks 
 At least 20 map task attempts will be performed… more if a machine 
crashes, etc. 
www.designpathshala.com 25
Apache Hadoop Bigdata 
Training By Design Pathshala 
Contact us on: admin@designpathshala.com 
Or Call us at: +91 120 260 5512 or +91 98 188 23045 
Visit us at: http://designpathshala.com 
www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | 
http://designpathshala.com 
26
MapReduce - Features 
 Fine grained Map and Reduce tasks 
 Improved load balancing 
 Faster recovery from failed tasks 
 Automatic re-execution on failure 
 In a large cluster, some nodes are always slow or flaky 
 Framework re-executes failed tasks 
 Locality optimizations 
 With large data, bandwidth to data is a problem 
 Map-Reduce + HDFS is a very effective solution 
 Map-Reduce queries HDFS for locations of input data 
 Map tasks are scheduled close to the inputs when possible 
www.designpathshala.com 27
What is Writable? 
 Hadoop defines its own “box” classes for strings (Text), integers 
(IntWritable), etc. 
 All values are instances of Writable 
 All keys are instances of WritableComparable 
www.designpathshala.com 28
Hadoop Data Types 
Class Size in bytes Description Sort Policy 
BooleanWritable 1 Wrapper for a standard Boolean variable False before and true 
after 
ByteWritable 1 Wrapper for a single byte Ascending order 
DoubleWritable 8 Wrapper for a Double Ascending order 
FloatWritable 4 Wrapper for a Float Ascending order 
IntWritable 4 Wrapper for a Integer Ascending order 
LongWritable 8 Wrapper for a Long Ascending order 
Text 2GB Wrapper to store text using the unicode 
UTF8 format 
Alphabetic order 
NullWritable Placeholder when the key or value is not 
needed 
Undefined 
Your 
Writable 
Implement the Writable Interface for a 
value or WritableComparable<T> for a key 
Your sort policy 
www.designpathshala.com 29
WritableComparable 
 Compares WritableComparable data 
 Will call compareTo method to do comparison 
www.designpathshala.com 30
Apache Hadoop Bigdata 
Training By Design Pathshala 
Contact us on: admin@designpathshala.com 
Or Call us at: +91 120 260 5512 or +91 98 188 23045 
Visit us at: http://designpathshala.com 
www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | 
http://designpathshala.com 
31
Input Split Size 
 Input splits are logical division of records whereas HDFS blocks are physical 
division of the input data. 
 Its extremely efficient when they are same but in practice it’s never align. 
 Machine processing a particular split may fetch a fragment of a record from a 
block other than its “main” block and which may reside remotely. 
 FileInputFormat will divide large files into chunks 
 Exact size controlled by mapred.min.split.size 
 RecordReaders receive file, offset, and length of chunk (or input splits) 
www.designpathshala.com 32
Getting Data To The Mapper 
Input file 
Input file 
InputSplit InputSplit InputSplit InputSplit 
RecordReader RecordReader RecordReader RecordReader 
Mapper 
(intermediates) 
Mapper 
(intermediates) 
Mapper 
(intermediates) 
Mapper 
(intermediates) 
InputFormat 
www.designpathshala.com 33
Reading Data 
 Data sets are specified by InputFormats 
 Defines input data (e.g., a directory) 
 Identifies partitions of the data that form an InputSplit 
 Factory for RecordReader objects to extract (k, v) records from the input source 
www.designpathshala.com 34
Apache Hadoop Bigdata 
Training By Design Pathshala 
Contact us on: admin@designpathshala.com 
Or Call us at: +91 120 260 5512 or +91 98 188 23045 
Visit us at: http://designpathshala.com 
www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | 
http://designpathshala.com 
35
FileInputFormat 
 TextInputFormat – Treats each ‘n’-terminated line of a file as a value & Key is the byte offset 
of line 
 Key – LongWritable 
 Value - Text 
 KeyValueTextInputFormat – Each line in the text files is a record. Separator character 
deivides each line. Any thing before separator is a key and after that is a value. 
 Separator is set by “key.value.separator.in.input.line.property” 
 Default separator is “t” 
 Key: Text 
 Value: Text 
 SequenceFileInputFormat – Input format for reading in sequence files. Key and values are user 
defined. These are specific compression binary file format. 
 NLineInputFormat – Same as TextInputFormat, but each split is guaranteed to have exactly N 
lines. 
 Mapred.line.input.format.linespermap.property 
 Key: LongWritable 
 Value: Text 
www.designpathshala.com 36
Filtering File Inputs 
 FileInputFormat will read all files out of a specified directory and send them 
to the mapper 
 Delegates filtering this file list to a method subclasses may override 
 e.g., Create your own “xyzFileInputFormat” to read *.xyz from directory list 
www.designpathshala.com 37
Record Readers 
 Each InputFormat provides its own RecordReader implementation 
 Responsible for parsing input splits into records 
 Then parsing each record into a key value pair 
 LineRecordReader – Reads a line from a text file 
 Used in TextInputFormat 
 KeyValueRecordReader – Used by KeyValueTextInputFormat 
 Custom Record Readers can be created by implementing RecordReader<K,V> 
www.designpathshala.com 38
Creating the Mapper 
 Extends Mapper Abstract Class 
 Class Mapper<KEYIN,VALUEIN,KEYOUT,VALUEOUT> 
 protected void cleanup(org.apache.hadoop.mapreduce.Mapper.Context context) - 
Called once at the end of the task. 
 protected void map(KEYIN key, VALUEIN value, 
org.apache.hadoop.mapreduce.Mapper.Context context) - Called once for each 
key/value pair in the input split. 
 void run(org.apache.hadoop.mapreduce.Mapper.Context context) - Expert users 
can override this method for more complete control over the execution of the 
Mapper. 
 protected void setup(org.apache.hadoop.mapreduce.Mapper.Context context) - 
Called once at the beginning of the task.OutputCollector receives the output of the 
mapping process 
www.designpathshala.com 39
Apache Hadoop Bigdata 
Training By Design Pathshala 
Contact us on: admin@designpathshala.com 
Or Call us at: +91 120 260 5512 or +91 98 188 23045 
Visit us at: http://designpathshala.com 
www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | 
http://designpathshala.com 
40
Mapper 
 public void map( 
Object key, 
Text value, 
Context context) 
 Key types implement WritableComparable 
 Value types implement Writable 
www.designpathshala.com 41
Some useful mappers 
 IdentityMapper<K,V> - Maps the input directly to output 
 InverseMapper<K,V> - Reverse key value pair 
 RegexMapper<K> - Implements Mapper<K,Text,Text,LongWritable> and 
generates a (match,1) pair for every regular expression match. 
 TokenCountMapper<K> - - Implements Mapper<K,Text,Text,LongWritable> and 
generates a (token,1) pair when input value is tokenized. 
www.designpathshala.com 42
Reducer 
 void reduce( 
Text key, 
Iterable<IntWritable> values, 
Context context) 
 Key types implement WritableComparable 
 Value types implement Writable 
www.designpathshala.com 43
Apache Hadoop Bigdata 
Training By Design Pathshala 
Contact us on: admin@designpathshala.com 
Or Call us at: +91 120 260 5512 or +91 98 188 23045 
Visit us at: http://designpathshala.com 
www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | 
http://designpathshala.com 
44
Finally: Writing The Output 
Reducer Reducer Reducer 
RecordWriter RecordWriter RecordWriter 
output file output file output file 
OutputFormat 
www.designpathshala.com 45
Some useful reducers 
 IdentityReducer<K,V> - Maps the input directly to output 
 LongSumReducer<K> - Implements Reducer<K,LongWritable, K,LongWritable> 
and determines sum of all values corresponding to the given key. 
www.designpathshala.com 46
OutputFormat 
 TextOutputFormat – Writes each record as a line of text. Key and values are 
written as string and separated as “/t” 
 SequenceFileOutputFormat – Writes Key and value as hadoops proprietary 
sequence file format. 
 NullOutputFormat – Output nothing. If for any reason you want to suppress 
the output completely. 
www.designpathshala.com 47
Apache Hadoop 
Common MapReduce Algorithms 
Design Pathshala 
April 22, 2014 
www.designpathshala.com 48
Apache Hadoop Bigdata 
Training By Design Pathshala 
Contact us on: admin@designpathshala.com 
Or Call us at: +91 120 260 5512 or +91 98 188 23045 
Visit us at: http://designpathshala.com 
www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | 
http://designpathshala.com 
49
Some handy tools 
 Partitioners 
 Combiners 
 Compression 
 Zero Reduces 
 Distributed File Cache 
www.designpathshala.com 50
Partitioners 
 Partitioners are application code that define how keys are 
assigned to reduces 
 Default partitioning spreads keys evenly, but randomly 
 Uses key.hashCode() % num_reduces 
 Custom partitioning is often required, for example, to 
produce a total order in the output 
 Should implement Partitioner interface 
 Set by calling conf.setPartitionerClass(MyPart.class) 
 To get a total order, sample the map output keys and pick values 
to divide the keys into roughly equal buckets and use that in your 
partitioner 
www.designpathshala.com 51
Partition And Shuffle 
Mapper 
(intermediates) 
Mapper 
(intermediates) 
Mapper 
(intermediates) 
Mapper 
(intermediates) 
Partitioner Partitioner Partitioner Partitioner 
(intermediates) (intermediates) (intermediates) 
Reducer Reducer Reducer 
shuffling 
www.designpathshala.com 52
Apache Hadoop Bigdata 
Training By Design Pathshala 
Contact us on: admin@designpathshala.com 
Or Call us at: +91 120 260 5512 or +91 98 188 23045 
Visit us at: http://designpathshala.com 
www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | 
http://designpathshala.com 
53
Combiners 
 When maps produce many repeated keys 
 It is often useful to do a local aggregation following the map 
 Done by specifying a Combiner 
 Goal is to decrease size of the transient data 
 Combiners have the same interface as Reduces, and often are the same 
class 
 Combiners must not side effects, because they run an intermediate 
number of times 
 In WordCount, conf.setCombinerClass(Reduce.class); 
www.designpathshala.com 54
Compression 
 Compressing the outputs and intermediate data will often yield huge 
performance gains 
 Can be specified via a configuration file or set programmatically 
 Set mapred.output.compress to true to compress job output 
 Set mapred.compress.map.output to true to compress map outputs 
 Compression Types (mapred(.map)?.output.compression.type) 
 “block” - Group of keys and values are compressed together 
 “record” - Each value is compressed individually 
 Block compression is almost always best 
 Compression Codecs (mapred(.map)?.output.compression.codec) 
 Default (zlib) - slower, but more compression 
 LZO - faster, but less compression 
www.designpathshala.com 55
Apache Hadoop Bigdata 
Training By Design Pathshala 
Contact us on: admin@designpathshala.com 
Or Call us at: +91 120 260 5512 or +91 98 188 23045 
Visit us at: http://designpathshala.com 
www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | 
http://designpathshala.com 
56
Zero Reduces 
 Frequently, we only need to run a filter on the input data 
 No sorting or shuffling required by the job 
 Set the number of reduces to 0 
 Output from maps will go directly to OutputFormat and disk 
www.designpathshala.com 57
Distributed File Cache 
 Sometimes need read-only copies of data on the local 
computer 
 Downloading 1GB of data for each Mapper is expensive 
 Define list of files you need to download in JobConf 
 Files are downloaded once per computer 
 Add to launching program: 
DistributedCache.addCacheFile(new URI(“hdfs://nn:8020/foo”), conf); 
 Add to task: 
Path[] files = DistributedCache.getLocalCacheFiles(conf); 
www.designpathshala.com 58
Apache Hadoop Bigdata 
Training By Design Pathshala 
Contact us on: admin@designpathshala.com 
Or Call us at: +91 120 260 5512 or +91 98 188 23045 
Visit us at: http://designpathshala.com 
www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | 
http://designpathshala.com 
59

More Related Content

Advance Map reduce - Apache hadoop Bigdata training by Design Pathshala

  • 1. Apache Hadoop Design Pathshala April 22, 2014 www.designpathshala.com 1
  • 2. Apache Hadoop Interacting with HDFS Design Pathshala April 22, 2014 www.designpathshala.com 2
  • 3. Apache Hadoop Bigdata Training By Design Pathshala Contact us on: admin@designpathshala.com Or Call us at: +91 120 260 5512 or +91 98 188 23045 Visit us at: http://designpathshala.com www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | http://designpathshala.com 3
  • 4. Basic file commands  Commads for HDFS User:  hadoop fs -mkdir /foodir  hadoop fs –ls /  hadoop fs –lsr /  hadoop fs –put abc.txt /usr/dp  hadoop fs –get /usr/dp/abc.txt .  hadoop fs -cat /foodir/myfile.txt  hadoop fs -rm /foodir/myfile.txt www.designpathshala.com 4
  • 5. Reading & Writing Programatically  org.apache.hadoop.fs  Configuration conf = new Configuration();  FileSystem hdfs = FileSystem.get(conf);  FileSystem local = FileSystem.getLocal(conf); www.designpathshala.com 5
  • 6.  public static void main(String[] args) throws IOException {  Configuration conf = new Configuration();  FileSystem hdfs = FileSystem.get(conf);  FileSystem local = FileSystem.getLocal(conf);  Path inputDir = new Path(args[0]);  Path hdfsFile = new Path(args[1]);  try {  FileStatus[] inputFiles = local.listStatus(inputDir);  FSDataOutputStream out = hdfs.create(hdfsFile); www.designpathshala.com 6
  • 7. Apache Hadoop Bigdata Training By Design Pathshala Contact us on: admin@designpathshala.com Or Call us at: +91 120 260 5512 or +91 98 188 23045 Visit us at: http://designpathshala.com www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | http://designpathshala.com 7
  • 8.  for (int i = 0; i < inputFiles.length; i++) {  System.out.println(inputFiles[i].getPath().getName());  FSDataInputStream in = local.open(inputFiles[i].getPath());  byte buffer[] = new byte[256];  int bytesRead = 0;  while ((bytesRead = in.read(buffer)) > 0) {  out.write(buffer, 0, bytesRead);  }  in.close();  }  out.close();  } catch (IOException e) {  e.printStackTrace();  }  } www.designpathshala.com 8
  • 9. Apache Hadoop Map Reduce Basics Design Pathshala April 22, 2014 www.designpathshala.com 9
  • 10. MapReduce - Dataflow www.designpathshala.com 10
  • 11. Map-Reduce Execution Engine (Example: Color Count) Shuffle & Sorting based on k Consumes(k, [v]) ( , [1,1,1,1,1,1..]) Reduce Reduce Reduce Produces (k, v) ( , 1) Map Map Map Map Input blocks on HDFS Parse-hash Parse-hash Parse-hash Parse-hash Produces(k’, v’) ( , 100) www.designpathshala.com 11 Users only provide the “Map” and “Reduce” functions Part0001 Part0002 Part0003 That’s the output file, it has 3 parts on probably 3 different machines
  • 12. Apache Hadoop Bigdata Training By Design Pathshala Contact us on: admin@designpathshala.com Or Call us at: +91 120 260 5512 or +91 98 188 23045 Visit us at: http://designpathshala.com www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | http://designpathshala.com 12
  • 13. Map <key, 1> Reducers (say, Count) Count Count Count Large scale data splits Parse-hash Parse-hash Parse-hash Parse-hash P-0000 , count1 P-0001 , count2 P-0002 ,count3 www.designpathshala.com 13
  • 14. Properties of MapReduce Engine (Cont’d)  Task Tracker is the slave node (runs on each datanode)  Receives the task from Job Tracker  Runs the task until completion (either map or reduce task)  Always in communication with the Job Tracker reporting progress Reduce Reduce Reduce Map Map Map Map Parse-hash Parse-hash Parse-hash Parse-hash In this example, 1 map-reduce job consists of 4 map tasks and 3 reduce tasks www.designpathshala.com 14
  • 15. Key-Value Pairs  Mappers and Reducers are users’ code (provided functions)  Just need to obey the Key-Value pairs interface  Mappers:  Consume <key, value> pairs  Produce <key, value> pairs  Reducers:  Consume <key, <list of values>>  Produce <key, value>  Shuffling and Sorting:  Hidden phase between mappers and reducers  Groups all similar keys from all mappers, sorts and passes them to a certain reducer in the form of <key, <list of values>> www.designpathshala.com 15
  • 16. Apache Hadoop Bigdata Training By Design Pathshala Contact us on: admin@designpathshala.com Or Call us at: +91 120 260 5512 or +91 98 188 23045 Visit us at: http://designpathshala.com www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | http://designpathshala.com 16
  • 17. Example 2: Color Filter Job: Select only the blue colorS Input blocks Produces (k, v) on HDFS ( , 1) Map Map Map Map Write to HDFS Write to HDFS Write to HDFS Write to HDFS • Each map task will select only the blue color • No need for reduce phase Part0001 Part0002 Part0003 Part0004 That’s the output file, it has 4 parts on probably 4 different machines www.designpathshala.com 17
  • 18. How does MapReduce work?  The run time partitions the input and provides it to different Map instances;  Map (key, value)  (key’, value’)  The run time collects the (key’, value’) pairs and distributes them to several Reduce functions so that each Reduce function gets the pairs with the same key’.  Each Reduce produces a single (or zero) file output.  Map and Reduce are user written functions www.designpathshala.com 18
  • 19. Example 3: Count Fruits  Job: Count the occurrences of each word in a data set Map Tasks Reduce Tasks www.designpathshala.com 19
  • 20. Word Count Example  Mapper  Input: value: lines of text of input  Output: key: word, value: 1  Reducer  Input: key: word, value: set of counts  Output: key: word, value: sum  Launching program  Defines this job  Submits job to cluster www.designpathshala.com 20
  • 21. Apache Hadoop Bigdata Training By Design Pathshala Contact us on: admin@designpathshala.com Or Call us at: +91 120 260 5512 or +91 98 188 23045 Visit us at: http://designpathshala.com www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | http://designpathshala.com 21
  • 22. Example MapReduce: Mapper public class WordCountMapper extends Mapper<LongWritable, Text, Text, IntWritable> { Text word = new Text(); public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { String line = value.toString(); StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasMoreTokens()) { word.set(tokenizer.nextToken().trim()); context.write(word, new IntWritable(1)); }}} www.designpathshala.com 22
  • 23. Reducer public class WordCountReducer extends Reducer<Text, IntWritable, Text, IntWritable> { private IntWritable result = new IntWritable(); public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { int sum = 0; for (IntWritable val : values) { sum += val.get(); } result.set(sum); context.write(key, result);}} www.designpathshala.com 23
  • 24. Job public static void main(String[] args) throws Exception { JobConf conf = new JobConf(WordCount.class); conf.setJobName("wordcount"); conf.setOutputKeyClass(Text.class); conf.setOutputValueClass(IntWritable.class); conf.setMapperClass(Map.class); conf.setReducerClass(Reduce.class); conf.setInputFormat(TextInputFormat.class); conf.setOutputFormat(TextOutputFormat.class); FileInputFormat.setInputPaths(conf, new Path(args[0])); FileOutputFormat.setOutputPath(conf, new Path(args[1])); JobClient.runJob(conf);} www.designpathshala.com 24
  • 25. Terminology Example  Running “Word Count” across 20 files is one job  20 input splits to be mapped imply 20 map tasks + some number of reduce tasks  At least 20 map task attempts will be performed… more if a machine crashes, etc. www.designpathshala.com 25
  • 26. Apache Hadoop Bigdata Training By Design Pathshala Contact us on: admin@designpathshala.com Or Call us at: +91 120 260 5512 or +91 98 188 23045 Visit us at: http://designpathshala.com www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | http://designpathshala.com 26
  • 27. MapReduce - Features  Fine grained Map and Reduce tasks  Improved load balancing  Faster recovery from failed tasks  Automatic re-execution on failure  In a large cluster, some nodes are always slow or flaky  Framework re-executes failed tasks  Locality optimizations  With large data, bandwidth to data is a problem  Map-Reduce + HDFS is a very effective solution  Map-Reduce queries HDFS for locations of input data  Map tasks are scheduled close to the inputs when possible www.designpathshala.com 27
  • 28. What is Writable?  Hadoop defines its own “box” classes for strings (Text), integers (IntWritable), etc.  All values are instances of Writable  All keys are instances of WritableComparable www.designpathshala.com 28
  • 29. Hadoop Data Types Class Size in bytes Description Sort Policy BooleanWritable 1 Wrapper for a standard Boolean variable False before and true after ByteWritable 1 Wrapper for a single byte Ascending order DoubleWritable 8 Wrapper for a Double Ascending order FloatWritable 4 Wrapper for a Float Ascending order IntWritable 4 Wrapper for a Integer Ascending order LongWritable 8 Wrapper for a Long Ascending order Text 2GB Wrapper to store text using the unicode UTF8 format Alphabetic order NullWritable Placeholder when the key or value is not needed Undefined Your Writable Implement the Writable Interface for a value or WritableComparable<T> for a key Your sort policy www.designpathshala.com 29
  • 30. WritableComparable  Compares WritableComparable data  Will call compareTo method to do comparison www.designpathshala.com 30
  • 31. Apache Hadoop Bigdata Training By Design Pathshala Contact us on: admin@designpathshala.com Or Call us at: +91 120 260 5512 or +91 98 188 23045 Visit us at: http://designpathshala.com www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | http://designpathshala.com 31
  • 32. Input Split Size  Input splits are logical division of records whereas HDFS blocks are physical division of the input data.  Its extremely efficient when they are same but in practice it’s never align.  Machine processing a particular split may fetch a fragment of a record from a block other than its “main” block and which may reside remotely.  FileInputFormat will divide large files into chunks  Exact size controlled by mapred.min.split.size  RecordReaders receive file, offset, and length of chunk (or input splits) www.designpathshala.com 32
  • 33. Getting Data To The Mapper Input file Input file InputSplit InputSplit InputSplit InputSplit RecordReader RecordReader RecordReader RecordReader Mapper (intermediates) Mapper (intermediates) Mapper (intermediates) Mapper (intermediates) InputFormat www.designpathshala.com 33
  • 34. Reading Data  Data sets are specified by InputFormats  Defines input data (e.g., a directory)  Identifies partitions of the data that form an InputSplit  Factory for RecordReader objects to extract (k, v) records from the input source www.designpathshala.com 34
  • 35. Apache Hadoop Bigdata Training By Design Pathshala Contact us on: admin@designpathshala.com Or Call us at: +91 120 260 5512 or +91 98 188 23045 Visit us at: http://designpathshala.com www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | http://designpathshala.com 35
  • 36. FileInputFormat  TextInputFormat – Treats each ‘n’-terminated line of a file as a value & Key is the byte offset of line  Key – LongWritable  Value - Text  KeyValueTextInputFormat – Each line in the text files is a record. Separator character deivides each line. Any thing before separator is a key and after that is a value.  Separator is set by “key.value.separator.in.input.line.property”  Default separator is “t”  Key: Text  Value: Text  SequenceFileInputFormat – Input format for reading in sequence files. Key and values are user defined. These are specific compression binary file format.  NLineInputFormat – Same as TextInputFormat, but each split is guaranteed to have exactly N lines.  Mapred.line.input.format.linespermap.property  Key: LongWritable  Value: Text www.designpathshala.com 36
  • 37. Filtering File Inputs  FileInputFormat will read all files out of a specified directory and send them to the mapper  Delegates filtering this file list to a method subclasses may override  e.g., Create your own “xyzFileInputFormat” to read *.xyz from directory list www.designpathshala.com 37
  • 38. Record Readers  Each InputFormat provides its own RecordReader implementation  Responsible for parsing input splits into records  Then parsing each record into a key value pair  LineRecordReader – Reads a line from a text file  Used in TextInputFormat  KeyValueRecordReader – Used by KeyValueTextInputFormat  Custom Record Readers can be created by implementing RecordReader<K,V> www.designpathshala.com 38
  • 39. Creating the Mapper  Extends Mapper Abstract Class  Class Mapper<KEYIN,VALUEIN,KEYOUT,VALUEOUT>  protected void cleanup(org.apache.hadoop.mapreduce.Mapper.Context context) - Called once at the end of the task.  protected void map(KEYIN key, VALUEIN value, org.apache.hadoop.mapreduce.Mapper.Context context) - Called once for each key/value pair in the input split.  void run(org.apache.hadoop.mapreduce.Mapper.Context context) - Expert users can override this method for more complete control over the execution of the Mapper.  protected void setup(org.apache.hadoop.mapreduce.Mapper.Context context) - Called once at the beginning of the task.OutputCollector receives the output of the mapping process www.designpathshala.com 39
  • 40. Apache Hadoop Bigdata Training By Design Pathshala Contact us on: admin@designpathshala.com Or Call us at: +91 120 260 5512 or +91 98 188 23045 Visit us at: http://designpathshala.com www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | http://designpathshala.com 40
  • 41. Mapper  public void map( Object key, Text value, Context context)  Key types implement WritableComparable  Value types implement Writable www.designpathshala.com 41
  • 42. Some useful mappers  IdentityMapper<K,V> - Maps the input directly to output  InverseMapper<K,V> - Reverse key value pair  RegexMapper<K> - Implements Mapper<K,Text,Text,LongWritable> and generates a (match,1) pair for every regular expression match.  TokenCountMapper<K> - - Implements Mapper<K,Text,Text,LongWritable> and generates a (token,1) pair when input value is tokenized. www.designpathshala.com 42
  • 43. Reducer  void reduce( Text key, Iterable<IntWritable> values, Context context)  Key types implement WritableComparable  Value types implement Writable www.designpathshala.com 43
  • 44. Apache Hadoop Bigdata Training By Design Pathshala Contact us on: admin@designpathshala.com Or Call us at: +91 120 260 5512 or +91 98 188 23045 Visit us at: http://designpathshala.com www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | http://designpathshala.com 44
  • 45. Finally: Writing The Output Reducer Reducer Reducer RecordWriter RecordWriter RecordWriter output file output file output file OutputFormat www.designpathshala.com 45
  • 46. Some useful reducers  IdentityReducer<K,V> - Maps the input directly to output  LongSumReducer<K> - Implements Reducer<K,LongWritable, K,LongWritable> and determines sum of all values corresponding to the given key. www.designpathshala.com 46
  • 47. OutputFormat  TextOutputFormat – Writes each record as a line of text. Key and values are written as string and separated as “/t”  SequenceFileOutputFormat – Writes Key and value as hadoops proprietary sequence file format.  NullOutputFormat – Output nothing. If for any reason you want to suppress the output completely. www.designpathshala.com 47
  • 48. Apache Hadoop Common MapReduce Algorithms Design Pathshala April 22, 2014 www.designpathshala.com 48
  • 49. Apache Hadoop Bigdata Training By Design Pathshala Contact us on: admin@designpathshala.com Or Call us at: +91 120 260 5512 or +91 98 188 23045 Visit us at: http://designpathshala.com www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | http://designpathshala.com 49
  • 50. Some handy tools  Partitioners  Combiners  Compression  Zero Reduces  Distributed File Cache www.designpathshala.com 50
  • 51. Partitioners  Partitioners are application code that define how keys are assigned to reduces  Default partitioning spreads keys evenly, but randomly  Uses key.hashCode() % num_reduces  Custom partitioning is often required, for example, to produce a total order in the output  Should implement Partitioner interface  Set by calling conf.setPartitionerClass(MyPart.class)  To get a total order, sample the map output keys and pick values to divide the keys into roughly equal buckets and use that in your partitioner www.designpathshala.com 51
  • 52. Partition And Shuffle Mapper (intermediates) Mapper (intermediates) Mapper (intermediates) Mapper (intermediates) Partitioner Partitioner Partitioner Partitioner (intermediates) (intermediates) (intermediates) Reducer Reducer Reducer shuffling www.designpathshala.com 52
  • 53. Apache Hadoop Bigdata Training By Design Pathshala Contact us on: admin@designpathshala.com Or Call us at: +91 120 260 5512 or +91 98 188 23045 Visit us at: http://designpathshala.com www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | http://designpathshala.com 53
  • 54. Combiners  When maps produce many repeated keys  It is often useful to do a local aggregation following the map  Done by specifying a Combiner  Goal is to decrease size of the transient data  Combiners have the same interface as Reduces, and often are the same class  Combiners must not side effects, because they run an intermediate number of times  In WordCount, conf.setCombinerClass(Reduce.class); www.designpathshala.com 54
  • 55. Compression  Compressing the outputs and intermediate data will often yield huge performance gains  Can be specified via a configuration file or set programmatically  Set mapred.output.compress to true to compress job output  Set mapred.compress.map.output to true to compress map outputs  Compression Types (mapred(.map)?.output.compression.type)  “block” - Group of keys and values are compressed together  “record” - Each value is compressed individually  Block compression is almost always best  Compression Codecs (mapred(.map)?.output.compression.codec)  Default (zlib) - slower, but more compression  LZO - faster, but less compression www.designpathshala.com 55
  • 56. Apache Hadoop Bigdata Training By Design Pathshala Contact us on: admin@designpathshala.com Or Call us at: +91 120 260 5512 or +91 98 188 23045 Visit us at: http://designpathshala.com www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | http://designpathshala.com 56
  • 57. Zero Reduces  Frequently, we only need to run a filter on the input data  No sorting or shuffling required by the job  Set the number of reduces to 0  Output from maps will go directly to OutputFormat and disk www.designpathshala.com 57
  • 58. Distributed File Cache  Sometimes need read-only copies of data on the local computer  Downloading 1GB of data for each Mapper is expensive  Define list of files you need to download in JobConf  Files are downloaded once per computer  Add to launching program: DistributedCache.addCacheFile(new URI(“hdfs://nn:8020/foo”), conf);  Add to task: Path[] files = DistributedCache.getLocalCacheFiles(conf); www.designpathshala.com 58
  • 59. Apache Hadoop Bigdata Training By Design Pathshala Contact us on: admin@designpathshala.com Or Call us at: +91 120 260 5512 or +91 98 188 23045 Visit us at: http://designpathshala.com www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | http://designpathshala.com 59