Map Reduce

1. MapReduceRahul Agarwalirahul.com

2. Dean, Ghemawat: http://labs.google.com/papers/mapreduce.htmlAttributions

3. Hadoop Principles

4. MapReduce

5. Programming model

6. Examples

7. Execution

8. Refinements

9. Q&AAgenda

10. HDFS: Fault-tolerant high-bandwidth clustered storageAutomatically and transparently route around failureMaster (named-node) – Slave architectureSpeculatively execute redundant tasks if certain nodes are detected to be slowMove compute to dataLower latency, lower bandwidthHadoop principles and MapReduce

11. HDFS: Hadoop Distributed FSBlock Size = 64MBReplication Factor = 3

12. Patented by Google“programming model… for processing and generating large data sets”Allows such programs to be “automatically parallelized and executed on a large cluster”Works with structured and unstructured dataMap function processes a key/value pair to generate a set of intermediate key/value pairsReduce function merges all intermediate values with the same intermediate keyMapReduce

13. map (in_key, in_value) -> list(intermidiate_key, intermediate_value)reduce (intermediate_key, list(intermediate_value)) -> list(out_value)MapReduce

14. Example: count word occurencesmap (String key, String value): //key: document name //value: document contents for each word w in value: EmitIntermediate(w,”1”);reduce (String key, Iterator values): //key: a word //values: a list of counts for each v in values: result+=ParseInt(v); Emit(AsString(result));

15. Example: distributed grepmap (String key, String value): //key: document name //value: document contents for each line in value: if line.match(pattern) EmitIntermediate(key, line);reduce (String key, Iterator values): //key: document name //values: a list lines for each v in values: Emit(v);

16. Example: URL access frequencymap (String key, String value): //key: log name //value: log contents for each line in value: EmitIntermediate(URL(line), “1”);reduce (String key, Iterator values): //key: URL //values: list of counts for each v in values: result+=ParseInt(v); Emit(AsString(result));

17. Example: Reverse web-link graphmap (String key, String value): //key: source document name //value: document contents for each link in value: EmitIntermediate(link, key);reduce (String key, Iterator values): //key: each target link //values: list of sources for each v in values: source_list.add(v) Emit(AsPair(key, source_list));

18. Execution

19. LocalityNetwork bandwidth is scarceCompute on local copies which are distributed by HFDSTask GranularityRatio of Map (M) to Reduce (R) workersIdeally M and R should be much larger than cluster Typically M such that enough tasks for each 64M block, R is a small multipleEg: 200,000 M for 5,000 R with 2,000 machinesBackup Tasks‘Straggling’ workersExecution Optimization

20. Partitioning FunctionHow to distribute the intermediate results to Reduce workers?Default: hash(key) mod REg: hash(Hostname(URL)) mod RCombiner FunctionPartial merging of data before the reduce stepSave bandwidthEg: Lots of <the,1> in word countingRefinements

21. OrderingProcess values and produce ordered resultsStrong typingStrongly type input/output valuesSkip bad recordsSkip consistently failing dataCountersShared countersMay be updated from any map/reduce workerRefinements

Map Reduce

More Related Content

Map Reduce

Editor's Notes