Apache Hadoop Big Data Technology

Er.Jay Nagar(Technology Researcher )
+91-9601957620

What is Apache Hadoop?
 Open source software framework designed for
storage and processing of large scale data on
clusters of commodity hardware
 Created by Doug Cutting and Mike Carafella in
2005.
 Cutting named the program after his son’s toy
elephant.

Uses for Hadoop
 Data-intensive text processing
 Assembly of large genomes
 Graph mining
 Machine learning and data mining
 Large scale social network analysis

The Hadoop Ecosystem
• Contains Libraries and other
modules
Hadoop
Common
• Hadoop Distributed File
SystemHDFS
• Yet Another Resource
Negotiator
Hadoop
YARN
• A programming model for
large scale data processing
Hadoop
MapReduce

How much data?
 Facebook
 500 TB per day
 Yahoo
 Over 170 PB
 eBay
 Over 6 PB
 Getting the data to the processors becomes the
bottleneck

• Hadoop:
• an open-source software framework that supports data-intensive
distributed applications, licensed under the Apache v2 license.
• Goals / Requirements:
• Abstract and facilitate the storage and processing of large and/or
rapidly growing data sets
• Structured and non-structured data
• Simple programming models
• High scalability and availability
• Use commodity (cheap!) hardware with little redundancy
• Fault-tolerance
• Move computation rather than data

Hadoop’s Architecture
• Distributed, with some centralization
• Main nodes of cluster are where most of the
computational power and storage of the system
lies
• Main nodes run TaskTracker to accept and reply
to MapReduce tasks, and also DataNode to
store needed blocks closely as possible
• Central control node runs NameNode to keep
track of HDFS directories & files, and JobTracker
to dispatch compute tasks to TaskTracker

• Hadoop Distributed Filesystem
• Tailored to needs of MapReduce
• Targeted towards many reads of filestreams
• Writes are more costly
• High degree of data replication (3x by default)
• No need for RAID on normal nodes
• Large blocksize (64MB)
• Location awareness of DataNodes in network

• Hadoop is in use at most organizations that
handle big data:
o Yahoo!
o Facebook
o Amazon
o Netflix
o Etc…
• Some examples of scale:
o Yahoo!’s Search Webmap runs on 10,000
core Linux cluster and powers Yahoo!
Web search
o FB’s Hadoop cluster hosts 100+ PB of
data (July, 2012) & growing at ½ PB/day
(Nov, 2012)

NameNode:
• Stores metadata for the files, like the directory
structure of a typical FS.
• The server holding the NameNode instance is quite
crucial, as there is only one.
• Transaction log for file deletes/adds, etc. Does not use
transactions for whole blocks or file-streams, only
metadata.
• Handles creation of more replica blocks when
necessary after a DataNode failure

DataNode:
• Stores the actual data in HDFS
• Can run on any underlying filesystem (ext3/4, NTFS, etc)
• Notifies NameNode of what blocks it has
• NameNode replicates blocks 2x in local rack, 1x elsewhere

Hadoop’s Architecture: MapReduce Engine

Apache Hadoop Big Data Technology

MapReduce Engine:
• JobTracker & TaskTracker
• JobTracker splits up data into smaller
tasks(“Map”) and sends it to the TaskTracker
process in each node
• TaskTracker reports back to the JobTracker
node and reports on job progress, sends data
(“Reduce”) or requests new jobs

HDFS Basic Concepts
 HDFS is a file system written in Java based on
the Google’s GFS
 Provides redundant storage for massive amounts
of data

HDFS Basic Concepts
 HDFS works best with a smaller number of large
files
 Millions as opposed to billions of files
 Typically 100MB or more per file
 Files in HDFS are write once
 Optimized for streaming reads of large files and
not random reads

How are Files Stored
 Files are split into blocks
 Blocks are split across many machines at load
time
 Different blocks from the same file will be stored on
different machines
 Blocks are replicated across multiple machines
 The NameNode keeps track of which blocks
make up a file and where they are stored

Data Replication
 Default replication is 3-fold

MapReduce
Distributing computation across nodes

MapReduce Overview
 A method for distributing computation across
multiple nodes
 Each node processes the data that is stored at
that node
 Consists of two main phases
 Map
 Reduce

MapReduce Features
 Automatic parallelization and distribution
 Fault-Tolerance
 Provides a clean abstraction for programmers to
use

The Mapper
 Reads data as key/value pairs
 The key is often discarded
 Outputs zero or more key/value pairs

Shuffle and Sort
 Output from the mapper is sorted by key
 All values with the same key are guaranteed to
go to the same machine

The Reducer
 Called once for each unique key
 Gets a list of all values associated with a key as
input
 The reducer outputs zero or more final key/value
pairs
 Usually just one output per input key

Mapper
(intermediates)
Mapper
(intermediates)
Mapper
(intermediates)
Mapper
(intermediates)
Reducer Reducer Reducer
(intermediates) (intermediates) (intermediates)
Partitioner Partitioner Partitioner Partitioner
shuffling

Overview
 NameNode
 Holds the metadata for the HDFS
 Secondary NameNode
 Performs housekeeping functions for the
NameNode
 DataNode
 Stores the actual HDFS data blocks
 JobTracker
 Manages MapReduce jobs
 TaskTracker
 Monitors individual Map and Reduce tasks

The NameNode
 Stores the HDFS file system information in a
fsimage
 Updates to the file system (add/remove blocks)
do not change the fsimage file
 They are instead written to a log file
 When starting the NameNode loads the fsimage
file and then applies the changes in the log file

The Secondary NameNode
 NOT a backup for the NameNode
 Periodically reads the log file and applies the
changes to the fsimage file bringing it up to date
 Allows the NameNode to restart faster when
required

JobTracker and TaskTracker
 JobTracker
 Determines the execution plan for the job
 Assigns individual tasks
 TaskTracker
 Keeps track of the performance of an individual
mapper or reducer

Hadoop Ecosystem
Other available tools

Why do these tools exist?
 MapReduce is very powerful, but can be awkward
to master
 These tools allow programmers who are familiar
with other programming styles to take advantage
of the power of MapReduce

Other Tools
 Hive
 Hadoop processing with SQL
 Pig
 Hadoop processing with scripting
 Cascading
 Pipe and Filter processing model
 HBase
 Database model built on top of Hadoop
 Flume
 Designed for large scale data movement

Apache Hadoop Big Data Technology

More Related Content

Apache Hadoop Big Data Technology

Editor's Notes