Advance Map reduce - Apache hadoop Bigdata training by Design Pathshala

Apache Hadoop
Design Pathshala
April 22, 2014
www.designpathshala.com 1

Apache Hadoop
Interacting with HDFS
Design Pathshala
April 22, 2014

Apache Hadoop Bigdata
Training By Design Pathshala
Contact us on: admin@designpathshala.com
Or Call us at: +91 120 260 5512 or +91 98 188 23045
Visit us at: http://designpathshala.com
www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com |
http://designpathshala.com
3

Basic file commands
 Commads for HDFS User:
 hadoop fs -mkdir /foodir
 hadoop fs –ls /
 hadoop fs –lsr /
 hadoop fs –put abc.txt /usr/dp
 hadoop fs –get /usr/dp/abc.txt .
 hadoop fs -cat /foodir/myfile.txt
 hadoop fs -rm /foodir/myfile.txt

Reading & Writing Programatically
 org.apache.hadoop.fs
 Configuration conf = new Configuration();
 FileSystem hdfs = FileSystem.get(conf);
 FileSystem local = FileSystem.getLocal(conf);

 public static void main(String[] args) throws IOException {
 Configuration conf = new Configuration();
 FileSystem hdfs = FileSystem.get(conf);
 FileSystem local = FileSystem.getLocal(conf);
 Path inputDir = new Path(args[0]);
 Path hdfsFile = new Path(args[1]);
 try {
 FileStatus[] inputFiles = local.listStatus(inputDir);
 FSDataOutputStream out = hdfs.create(hdfsFile);

Or Call us at: +91 120 260 5512 or +91 98 188 23045
7

 for (int i = 0; i < inputFiles.length; i++) {
 System.out.println(inputFiles[i].getPath().getName());
 FSDataInputStream in = local.open(inputFiles[i].getPath());
 byte buffer[] = new byte[256];
 int bytesRead = 0;
 while ((bytesRead = in.read(buffer)) > 0) {
 out.write(buffer, 0, bytesRead);
 }
 in.close();
 }
 out.close();
 } catch (IOException e) {
 e.printStackTrace();
 }
 }

Apache Hadoop
Map Reduce Basics
Design Pathshala
April 22, 2014

MapReduce - Dataflow

Map-Reduce Execution Engine
(Example: Color Count)
Shuffle & Sorting
based on k
Consumes(k, [v])
( ,
[1,1,1,1,1,1..])
Reduce
Reduce
Reduce
Produces (k, v)
( , 1)
Map
Map
Map
Map
Input blocks
on HDFS
Parse-hash
Parse-hash
Parse-hash
Parse-hash
Produces(k’, v’)
( , 100)
Users only provide the “Map” and “Reduce” functions
Part0001
Part0002
Part0003
That’s the output file, it
has 3 parts on probably 3
different machines

Or Call us at: +91 120 260 5512 or +91 98 188 23045
12

Map <key, 1> Reducers (say, Count)
Count Count Count
Large scale data splits
Parse-hash
Parse-hash
Parse-hash
Parse-hash
P-0000
, count1
P-0001
, count2
P-0002
,count3

Properties of MapReduce Engine
(Cont’d)
 Task Tracker is the slave node (runs on each datanode)
 Receives the task from Job Tracker
 Runs the task until completion (either map or reduce task)
 Always in communication with the Job Tracker reporting progress
Reduce
Reduce
Reduce
Map
Map
Map
Map
Parse-hash
Parse-hash
Parse-hash
Parse-hash
In this example, 1 map-reduce
job consists of 4 map
tasks and 3 reduce tasks

Key-Value Pairs
 Mappers and Reducers are users’ code (provided functions)
 Just need to obey the Key-Value pairs interface
 Mappers:
 Consume <key, value> pairs
 Produce <key, value> pairs
 Reducers:
 Consume <key, <list of values>>
 Produce <key, value>
 Shuffling and Sorting:
 Hidden phase between mappers and reducers
 Groups all similar keys from all mappers, sorts and passes them to a certain
reducer in the form of <key, <list of values>>

Or Call us at: +91 120 260 5512 or +91 98 188 23045
16

Example 2: Color Filter
Job: Select only the blue colorS
Input blocks
Produces (k, v)
on HDFS
( , 1)
Map
Map
Map
Map
Write to HDFS
Write to HDFS
Write to HDFS
Write to HDFS
• Each map task will select only
the blue color
• No need for reduce phase
Part0001
Part0002
Part0003
Part0004
That’s the output file, it
has 4 parts on probably 4
different machines

How does MapReduce work?
 The run time partitions the input and provides it to
different Map instances;
 Map (key, value)  (key’, value’)
 The run time collects the (key’, value’) pairs and
distributes them to several Reduce functions so that each
Reduce function gets the pairs with the same key’.
 Each Reduce produces a single (or zero) file output.
 Map and Reduce are user written functions

Example 3: Count Fruits
 Job: Count the occurrences of each word in a data set
Map
Tasks
Reduce
Tasks

Word Count Example
 Mapper
 Input: value: lines of text of input
 Output: key: word, value: 1
 Reducer
 Input: key: word, value: set of counts
 Output: key: word, value: sum
 Launching program
 Defines this job
 Submits job to cluster

Or Call us at: +91 120 260 5512 or +91 98 188 23045
21

Example MapReduce: Mapper
public class WordCountMapper extends Mapper<LongWritable, Text, Text,
IntWritable> {
Text word = new Text();
public void map(LongWritable key, Text value, Context context) throws
IOException, InterruptedException {
String line = value.toString();
StringTokenizer tokenizer = new StringTokenizer(line);
while (tokenizer.hasMoreTokens()) {
word.set(tokenizer.nextToken().trim());
context.write(word, new IntWritable(1)); }}}

Reducer
public class WordCountReducer extends Reducer<Text, IntWritable, Text,
IntWritable> {
private IntWritable result = new IntWritable();
public void reduce(Text key, Iterable<IntWritable> values,
Context context) throws IOException, InterruptedException {
int sum = 0;
for (IntWritable val : values) {
sum += val.get();
}
result.set(sum);
context.write(key, result);}}

Job
public static void main(String[] args) throws Exception {
JobConf conf = new JobConf(WordCount.class);
conf.setJobName("wordcount");
conf.setOutputKeyClass(Text.class);
conf.setOutputValueClass(IntWritable.class);
conf.setMapperClass(Map.class);
conf.setReducerClass(Reduce.class);
conf.setInputFormat(TextInputFormat.class);
conf.setOutputFormat(TextOutputFormat.class);
FileInputFormat.setInputPaths(conf, new Path(args[0]));
FileOutputFormat.setOutputPath(conf, new Path(args[1]));
JobClient.runJob(conf);}

Terminology Example
 Running “Word Count” across 20 files is one job
 20 input splits to be mapped imply 20 map tasks + some number of reduce
tasks
 At least 20 map task attempts will be performed… more if a machine
crashes, etc.

Or Call us at: +91 120 260 5512 or +91 98 188 23045
26

MapReduce - Features
 Fine grained Map and Reduce tasks
 Improved load balancing
 Faster recovery from failed tasks
 Automatic re-execution on failure
 In a large cluster, some nodes are always slow or flaky
 Framework re-executes failed tasks
 Locality optimizations
 With large data, bandwidth to data is a problem
 Map-Reduce + HDFS is a very effective solution
 Map-Reduce queries HDFS for locations of input data
 Map tasks are scheduled close to the inputs when possible

What is Writable?
 Hadoop defines its own “box” classes for strings (Text), integers
(IntWritable), etc.
 All values are instances of Writable
 All keys are instances of WritableComparable

Hadoop Data Types
Class Size in bytes Description Sort Policy
BooleanWritable 1 Wrapper for a standard Boolean variable False before and true
after
ByteWritable 1 Wrapper for a single byte Ascending order
DoubleWritable 8 Wrapper for a Double Ascending order
FloatWritable 4 Wrapper for a Float Ascending order
IntWritable 4 Wrapper for a Integer Ascending order
LongWritable 8 Wrapper for a Long Ascending order
Text 2GB Wrapper to store text using the unicode
UTF8 format
Alphabetic order
NullWritable Placeholder when the key or value is not
needed
Undefined
Your
Writable
Implement the Writable Interface for a
value or WritableComparable<T> for a key
Your sort policy

WritableComparable
 Compares WritableComparable data
 Will call compareTo method to do comparison

Or Call us at: +91 120 260 5512 or +91 98 188 23045
31

Input Split Size
 Input splits are logical division of records whereas HDFS blocks are physical
division of the input data.
 Its extremely efficient when they are same but in practice it’s never align.
 Machine processing a particular split may fetch a fragment of a record from a
block other than its “main” block and which may reside remotely.
 FileInputFormat will divide large files into chunks
 Exact size controlled by mapred.min.split.size
 RecordReaders receive file, offset, and length of chunk (or input splits)

Getting Data To The Mapper
Input file
Input file
InputSplit InputSplit InputSplit InputSplit
RecordReader RecordReader RecordReader RecordReader
Mapper
(intermediates)
Mapper
(intermediates)
Mapper
(intermediates)
Mapper
(intermediates)
InputFormat

Reading Data
 Data sets are specified by InputFormats
 Defines input data (e.g., a directory)
 Identifies partitions of the data that form an InputSplit
 Factory for RecordReader objects to extract (k, v) records from the input source

Or Call us at: +91 120 260 5512 or +91 98 188 23045
35

FileInputFormat
 TextInputFormat – Treats each ‘n’-terminated line of a file as a value & Key is the byte offset
of line
 Key – LongWritable
 Value - Text
 KeyValueTextInputFormat – Each line in the text files is a record. Separator character
deivides each line. Any thing before separator is a key and after that is a value.
 Separator is set by “key.value.separator.in.input.line.property”
 Default separator is “t”
 Key: Text
 Value: Text
 SequenceFileInputFormat – Input format for reading in sequence files. Key and values are user
defined. These are specific compression binary file format.
 NLineInputFormat – Same as TextInputFormat, but each split is guaranteed to have exactly N
lines.
 Mapred.line.input.format.linespermap.property
 Key: LongWritable
 Value: Text

Filtering File Inputs
 FileInputFormat will read all files out of a specified directory and send them
to the mapper
 Delegates filtering this file list to a method subclasses may override
 e.g., Create your own “xyzFileInputFormat” to read *.xyz from directory list

Record Readers
 Each InputFormat provides its own RecordReader implementation
 Responsible for parsing input splits into records
 Then parsing each record into a key value pair
 LineRecordReader – Reads a line from a text file
 Used in TextInputFormat
 KeyValueRecordReader – Used by KeyValueTextInputFormat
 Custom Record Readers can be created by implementing RecordReader<K,V>

Creating the Mapper
 Extends Mapper Abstract Class
 Class Mapper<KEYIN,VALUEIN,KEYOUT,VALUEOUT>
 protected void cleanup(org.apache.hadoop.mapreduce.Mapper.Context context) -
Called once at the end of the task.
 protected void map(KEYIN key, VALUEIN value,
org.apache.hadoop.mapreduce.Mapper.Context context) - Called once for each
key/value pair in the input split.
 void run(org.apache.hadoop.mapreduce.Mapper.Context context) - Expert users
can override this method for more complete control over the execution of the
Mapper.
 protected void setup(org.apache.hadoop.mapreduce.Mapper.Context context) -
Called once at the beginning of the task.OutputCollector receives the output of the
mapping process

Or Call us at: +91 120 260 5512 or +91 98 188 23045
40

Mapper
 public void map(
Object key,
Text value,
Context context)
 Key types implement WritableComparable
 Value types implement Writable

Some useful mappers
 IdentityMapper<K,V> - Maps the input directly to output
 InverseMapper<K,V> - Reverse key value pair
 RegexMapper<K> - Implements Mapper<K,Text,Text,LongWritable> and
generates a (match,1) pair for every regular expression match.
 TokenCountMapper<K> - - Implements Mapper<K,Text,Text,LongWritable> and
generates a (token,1) pair when input value is tokenized.

Reducer
 void reduce(
Text key,
Iterable<IntWritable> values,
Context context)
 Key types implement WritableComparable
 Value types implement Writable

Or Call us at: +91 120 260 5512 or +91 98 188 23045
44

Finally: Writing The Output
Reducer Reducer Reducer
RecordWriter RecordWriter RecordWriter
output file output file output file
OutputFormat

Some useful reducers
 IdentityReducer<K,V> - Maps the input directly to output
 LongSumReducer<K> - Implements Reducer<K,LongWritable, K,LongWritable>
and determines sum of all values corresponding to the given key.

OutputFormat
 TextOutputFormat – Writes each record as a line of text. Key and values are
written as string and separated as “/t”
 SequenceFileOutputFormat – Writes Key and value as hadoops proprietary
sequence file format.
 NullOutputFormat – Output nothing. If for any reason you want to suppress
the output completely.

Apache Hadoop
Common MapReduce Algorithms
Design Pathshala
April 22, 2014

Or Call us at: +91 120 260 5512 or +91 98 188 23045
49

Some handy tools
 Partitioners
 Combiners
 Compression
 Zero Reduces
 Distributed File Cache

Partitioners
 Partitioners are application code that define how keys are
assigned to reduces
 Default partitioning spreads keys evenly, but randomly
 Uses key.hashCode() % num_reduces
 Custom partitioning is often required, for example, to
produce a total order in the output
 Should implement Partitioner interface
 Set by calling conf.setPartitionerClass(MyPart.class)
 To get a total order, sample the map output keys and pick values
to divide the keys into roughly equal buckets and use that in your
partitioner

Partition And Shuffle
Mapper
(intermediates)
Mapper
(intermediates)
Mapper
(intermediates)
Mapper
(intermediates)
Partitioner Partitioner Partitioner Partitioner
(intermediates) (intermediates) (intermediates)
Reducer Reducer Reducer
shuffling

Or Call us at: +91 120 260 5512 or +91 98 188 23045
53

Combiners
 When maps produce many repeated keys
 It is often useful to do a local aggregation following the map
 Done by specifying a Combiner
 Goal is to decrease size of the transient data
 Combiners have the same interface as Reduces, and often are the same
class
 Combiners must not side effects, because they run an intermediate
number of times
 In WordCount, conf.setCombinerClass(Reduce.class);

Compression
 Compressing the outputs and intermediate data will often yield huge
performance gains
 Can be specified via a configuration file or set programmatically
 Set mapred.output.compress to true to compress job output
 Set mapred.compress.map.output to true to compress map outputs
 Compression Types (mapred(.map)?.output.compression.type)
 “block” - Group of keys and values are compressed together
 “record” - Each value is compressed individually
 Block compression is almost always best
 Compression Codecs (mapred(.map)?.output.compression.codec)
 Default (zlib) - slower, but more compression
 LZO - faster, but less compression

Or Call us at: +91 120 260 5512 or +91 98 188 23045
56

Zero Reduces
 Frequently, we only need to run a filter on the input data
 No sorting or shuffling required by the job
 Set the number of reduces to 0
 Output from maps will go directly to OutputFormat and disk

Distributed File Cache
 Sometimes need read-only copies of data on the local
computer
 Downloading 1GB of data for each Mapper is expensive
 Define list of files you need to download in JobConf
 Files are downloaded once per computer
 Add to launching program:
DistributedCache.addCacheFile(new URI(“hdfs://nn:8020/foo”), conf);
 Add to task:
Path[] files = DistributedCache.getLocalCacheFiles(conf);

Or Call us at: +91 120 260 5512 or +91 98 188 23045
59

Advance Map reduce - Apache hadoop Bigdata training by Design Pathshala

More Related Content

Advance Map reduce - Apache hadoop Bigdata training by Design Pathshala