SlideShare a Scribd company logo
MapReduceRahul Agarwalirahul.com
Dean, Ghemawat: http://labs.google.com/papers/mapreduce.htmlAttributions
Hadoop Principles
MapReduce
Programming model
Examples
Execution
Refinements
Q&AAgenda
HDFS: Fault-tolerant high-bandwidth clustered storageAutomatically and transparently route around failureMaster (named-node) – Slave architectureSpeculatively execute redundant tasks if certain nodes are detected to be slowMove compute to dataLower latency, lower bandwidthHadoop principles and MapReduce
HDFS: Hadoop Distributed FSBlock Size = 64MBReplication Factor = 3
Patented by Google“programming model… for processing and generating large data sets”Allows such programs to be “automatically parallelized and executed on a large cluster”Works with structured and unstructured dataMap function processes a key/value pair to generate a set of intermediate key/value pairsReduce function merges all intermediate values with the same intermediate keyMapReduce
map (in_key, in_value) -> list(intermidiate_key, intermediate_value)reduce (intermediate_key, list(intermediate_value)) -> list(out_value)MapReduce
Example: count word occurencesmap (String key, String value):   //key: document name  //value: document contents  for each word w in value:    EmitIntermediate(w,”1”);reduce (String key, Iterator values):  //key: a word  //values: a list of counts  for each v in values:    result+=ParseInt(v);  Emit(AsString(result));
Example: distributed grepmap (String key, String value):   //key: document name  //value: document contents  for each line in value:    if line.match(pattern)      EmitIntermediate(key, line);reduce (String key, Iterator values):  //key: document name  //values: a list lines  for each v in values:    Emit(v);

More Related Content

Map Reduce

  • 10. HDFS: Fault-tolerant high-bandwidth clustered storageAutomatically and transparently route around failureMaster (named-node) – Slave architectureSpeculatively execute redundant tasks if certain nodes are detected to be slowMove compute to dataLower latency, lower bandwidthHadoop principles and MapReduce
  • 11. HDFS: Hadoop Distributed FSBlock Size = 64MBReplication Factor = 3
  • 12. Patented by Google“programming model… for processing and generating large data sets”Allows such programs to be “automatically parallelized and executed on a large cluster”Works with structured and unstructured dataMap function processes a key/value pair to generate a set of intermediate key/value pairsReduce function merges all intermediate values with the same intermediate keyMapReduce
  • 13. map (in_key, in_value) -> list(intermidiate_key, intermediate_value)reduce (intermediate_key, list(intermediate_value)) -> list(out_value)MapReduce
  • 14. Example: count word occurencesmap (String key, String value): //key: document name //value: document contents for each word w in value: EmitIntermediate(w,”1”);reduce (String key, Iterator values): //key: a word //values: a list of counts for each v in values: result+=ParseInt(v); Emit(AsString(result));
  • 15. Example: distributed grepmap (String key, String value): //key: document name //value: document contents for each line in value: if line.match(pattern) EmitIntermediate(key, line);reduce (String key, Iterator values): //key: document name //values: a list lines for each v in values: Emit(v);
  • 16. Example: URL access frequencymap (String key, String value): //key: log name //value: log contents for each line in value: EmitIntermediate(URL(line), “1”);reduce (String key, Iterator values): //key: URL //values: list of counts for each v in values: result+=ParseInt(v); Emit(AsString(result));
  • 17. Example: Reverse web-link graphmap (String key, String value): //key: source document name //value: document contents for each link in value: EmitIntermediate(link, key);reduce (String key, Iterator values): //key: each target link //values: list of sources for each v in values: source_list.add(v) Emit(AsPair(key, source_list));
  • 19. LocalityNetwork bandwidth is scarceCompute on local copies which are distributed by HFDSTask GranularityRatio of Map (M) to Reduce (R) workersIdeally M and R should be much larger than cluster Typically M such that enough tasks for each 64M block, R is a small multipleEg: 200,000 M for 5,000 R with 2,000 machinesBackup Tasks‘Straggling’ workersExecution Optimization
  • 20. Partitioning FunctionHow to distribute the intermediate results to Reduce workers?Default: hash(key) mod REg: hash(Hostname(URL)) mod RCombiner FunctionPartial merging of data before the reduce stepSave bandwidthEg: Lots of <the,1> in word countingRefinements
  • 21. OrderingProcess values and produce ordered resultsStrong typingStrongly type input/output valuesSkip bad recordsSkip consistently failing dataCountersShared countersMay be updated from any map/reduce workerRefinements

Editor's Notes

  1. HDFS takes care of details of data partitioningScheduling program executionHandling machine failuresHandling inter-machine communication and data transfers
  2. Pool commodity servers in a single hierarchical namespace.Designed for large files that are written once and read many times.Example here shows what happens with a replication factor of 3, each data block is present in at least 3 separate data nodes.Typical Hadoop node is eight cores with 16GB ram and four 1TB SATA disks.Default block size is 64MB, though most folks now set it to 128MB