Hadoop is an open-source software framework for distributed storage and processing of large datasets across clusters of commodity hardware. It was created in 2005 and is designed to reliably handle large volumes of data and complex computations in a distributed fashion. The core of Hadoop consists of Hadoop Distributed File System (HDFS) for storage and Hadoop MapReduce for processing data in parallel across large clusters of computers. It is widely adopted by companies handling big data like Yahoo, Facebook, Amazon and Netflix.
2. What is Apache Hadoop?
Open source software framework designed for
storage and processing of large scale data on
clusters of commodity hardware
Created by Doug Cutting and Mike Carafella in
2005.
Cutting named the program after his son’s toy
elephant.
3. Uses for Hadoop
Data-intensive text processing
Assembly of large genomes
Graph mining
Machine learning and data mining
Large scale social network analysis
5. The Hadoop Ecosystem
• Contains Libraries and other
modules
Hadoop
Common
• Hadoop Distributed File
SystemHDFS
• Yet Another Resource
Negotiator
Hadoop
YARN
• A programming model for
large scale data processing
Hadoop
MapReduce
6. How much data?
Facebook
500 TB per day
Yahoo
Over 170 PB
eBay
Over 6 PB
Getting the data to the processors becomes the
bottleneck
7. • Hadoop:
• an open-source software framework that supports data-intensive
distributed applications, licensed under the Apache v2 license.
• Goals / Requirements:
• Abstract and facilitate the storage and processing of large and/or
rapidly growing data sets
• Structured and non-structured data
• Simple programming models
• High scalability and availability
• Use commodity (cheap!) hardware with little redundancy
• Fault-tolerance
• Move computation rather than data
9. Hadoop’s Architecture
• Distributed, with some centralization
• Main nodes of cluster are where most of the
computational power and storage of the system
lies
• Main nodes run TaskTracker to accept and reply
to MapReduce tasks, and also DataNode to
store needed blocks closely as possible
• Central control node runs NameNode to keep
track of HDFS directories & files, and JobTracker
to dispatch compute tasks to TaskTracker
10. Hadoop’s Architecture
• Hadoop Distributed Filesystem
• Tailored to needs of MapReduce
• Targeted towards many reads of filestreams
• Writes are more costly
• High degree of data replication (3x by default)
• No need for RAID on normal nodes
• Large blocksize (64MB)
• Location awareness of DataNodes in network
11. • Hadoop is in use at most organizations that
handle big data:
o Yahoo!
o Facebook
o Amazon
o Netflix
o Etc…
• Some examples of scale:
o Yahoo!’s Search Webmap runs on 10,000
core Linux cluster and powers Yahoo!
Web search
o FB’s Hadoop cluster hosts 100+ PB of
data (July, 2012) & growing at ½ PB/day
(Nov, 2012)
12. Hadoop’s Architecture
NameNode:
• Stores metadata for the files, like the directory
structure of a typical FS.
• The server holding the NameNode instance is quite
crucial, as there is only one.
• Transaction log for file deletes/adds, etc. Does not use
transactions for whole blocks or file-streams, only
metadata.
• Handles creation of more replica blocks when
necessary after a DataNode failure
13. Hadoop’s Architecture
DataNode:
• Stores the actual data in HDFS
• Can run on any underlying filesystem (ext3/4, NTFS, etc)
• Notifies NameNode of what blocks it has
• NameNode replicates blocks 2x in local rack, 1x elsewhere
16. Hadoop’s Architecture
MapReduce Engine:
• JobTracker & TaskTracker
• JobTracker splits up data into smaller
tasks(“Map”) and sends it to the TaskTracker
process in each node
• TaskTracker reports back to the JobTracker
node and reports on job progress, sends data
(“Reduce”) or requests new jobs
17. HDFS Basic Concepts
HDFS is a file system written in Java based on
the Google’s GFS
Provides redundant storage for massive amounts
of data
18. HDFS Basic Concepts
HDFS works best with a smaller number of large
files
Millions as opposed to billions of files
Typically 100MB or more per file
Files in HDFS are write once
Optimized for streaming reads of large files and
not random reads
19. How are Files Stored
Files are split into blocks
Blocks are split across many machines at load
time
Different blocks from the same file will be stored on
different machines
Blocks are replicated across multiple machines
The NameNode keeps track of which blocks
make up a file and where they are stored
22. MapReduce Overview
A method for distributing computation across
multiple nodes
Each node processes the data that is stored at
that node
Consists of two main phases
Map
Reduce
23. MapReduce Features
Automatic parallelization and distribution
Fault-Tolerance
Provides a clean abstraction for programmers to
use
24. The Mapper
Reads data as key/value pairs
The key is often discarded
Outputs zero or more key/value pairs
25. Shuffle and Sort
Output from the mapper is sorted by key
All values with the same key are guaranteed to
go to the same machine
26. The Reducer
Called once for each unique key
Gets a list of all values associated with a key as
input
The reducer outputs zero or more final key/value
pairs
Usually just one output per input key
29. Overview
NameNode
Holds the metadata for the HDFS
Secondary NameNode
Performs housekeeping functions for the
NameNode
DataNode
Stores the actual HDFS data blocks
JobTracker
Manages MapReduce jobs
TaskTracker
Monitors individual Map and Reduce tasks
30. The NameNode
Stores the HDFS file system information in a
fsimage
Updates to the file system (add/remove blocks)
do not change the fsimage file
They are instead written to a log file
When starting the NameNode loads the fsimage
file and then applies the changes in the log file
31. The Secondary NameNode
NOT a backup for the NameNode
Periodically reads the log file and applies the
changes to the fsimage file bringing it up to date
Allows the NameNode to restart faster when
required
32. JobTracker and TaskTracker
JobTracker
Determines the execution plan for the job
Assigns individual tasks
TaskTracker
Keeps track of the performance of an individual
mapper or reducer
34. Why do these tools exist?
MapReduce is very powerful, but can be awkward
to master
These tools allow programmers who are familiar
with other programming styles to take advantage
of the power of MapReduce
35. Other Tools
Hive
Hadoop processing with SQL
Pig
Hadoop processing with scripting
Cascading
Pipe and Filter processing model
HBase
Database model built on top of Hadoop
Flume
Designed for large scale data movement