This document provides an overview of MapReduce, including:
- MapReduce is a programming model for processing large datasets in parallel across clusters of computers.
- It works by breaking the processing into map and reduce functions that can be run on many machines.
- Examples are given like word counting, distributed grep, and analyzing web server logs.
10. HDFS: Fault-tolerant high-bandwidth clustered storageAutomatically and transparently route around failureMaster (named-node) – Slave architectureSpeculatively execute redundant tasks if certain nodes are detected to be slowMove compute to dataLower latency, lower bandwidthHadoop principles and MapReduce
12. Patented by Google“programming model… for processing and generating large data sets”Allows such programs to be “automatically parallelized and executed on a large cluster”Works with structured and unstructured dataMap function processes a key/value pair to generate a set of intermediate key/value pairsReduce function merges all intermediate values with the same intermediate keyMapReduce
14. Example: count word occurencesmap (String key, String value): //key: document name //value: document contents for each word w in value: EmitIntermediate(w,”1”);reduce (String key, Iterator values): //key: a word //values: a list of counts for each v in values: result+=ParseInt(v); Emit(AsString(result));
15. Example: distributed grepmap (String key, String value): //key: document name //value: document contents for each line in value: if line.match(pattern) EmitIntermediate(key, line);reduce (String key, Iterator values): //key: document name //values: a list lines for each v in values: Emit(v);
16. Example: URL access frequencymap (String key, String value): //key: log name //value: log contents for each line in value: EmitIntermediate(URL(line), “1”);reduce (String key, Iterator values): //key: URL //values: list of counts for each v in values: result+=ParseInt(v); Emit(AsString(result));
17. Example: Reverse web-link graphmap (String key, String value): //key: source document name //value: document contents for each link in value: EmitIntermediate(link, key);reduce (String key, Iterator values): //key: each target link //values: list of sources for each v in values: source_list.add(v) Emit(AsPair(key, source_list));
19. LocalityNetwork bandwidth is scarceCompute on local copies which are distributed by HFDSTask GranularityRatio of Map (M) to Reduce (R) workersIdeally M and R should be much larger than cluster Typically M such that enough tasks for each 64M block, R is a small multipleEg: 200,000 M for 5,000 R with 2,000 machinesBackup Tasks‘Straggling’ workersExecution Optimization
20. Partitioning FunctionHow to distribute the intermediate results to Reduce workers?Default: hash(key) mod REg: hash(Hostname(URL)) mod RCombiner FunctionPartial merging of data before the reduce stepSave bandwidthEg: Lots of <the,1> in word countingRefinements
21. OrderingProcess values and produce ordered resultsStrong typingStrongly type input/output valuesSkip bad recordsSkip consistently failing dataCountersShared countersMay be updated from any map/reduce workerRefinements
Editor's Notes
HDFS takes care of details of data partitioningScheduling program executionHandling machine failuresHandling inter-machine communication and data transfers
Pool commodity servers in a single hierarchical namespace.Designed for large files that are written once and read many times.Example here shows what happens with a replication factor of 3, each data block is present in at least 3 separate data nodes.Typical Hadoop node is eight cores with 16GB ram and four 1TB SATA disks.Default block size is 64MB, though most folks now set it to 128MB