SlideShare a Scribd company logo
Er.Jay Nagar(Technology Researcher )
+91-9601957620
What is Apache Hadoop?
 Open source software framework designed for
storage and processing of large scale data on
clusters of commodity hardware
 Created by Doug Cutting and Mike Carafella in
2005.
 Cutting named the program after his son’s toy
elephant.
Uses for Hadoop
 Data-intensive text processing
 Assembly of large genomes
 Graph mining
 Machine learning and data mining
 Large scale social network analysis
Who Uses Hadoop?
The Hadoop Ecosystem
• Contains Libraries and other
modules
Hadoop
Common
• Hadoop Distributed File
SystemHDFS
• Yet Another Resource
Negotiator
Hadoop
YARN
• A programming model for
large scale data processing
Hadoop
MapReduce
How much data?
 Facebook
 500 TB per day
 Yahoo
 Over 170 PB
 eBay
 Over 6 PB
 Getting the data to the processors becomes the
bottleneck
• Hadoop:
• an open-source software framework that supports data-intensive
distributed applications, licensed under the Apache v2 license.
• Goals / Requirements:
• Abstract and facilitate the storage and processing of large and/or
rapidly growing data sets
• Structured and non-structured data
• Simple programming models
• High scalability and availability
• Use commodity (cheap!) hardware with little redundancy
• Fault-tolerance
• Move computation rather than data
Hadoop Framework Tools
Hadoop’s Architecture
• Distributed, with some centralization
• Main nodes of cluster are where most of the
computational power and storage of the system
lies
• Main nodes run TaskTracker to accept and reply
to MapReduce tasks, and also DataNode to
store needed blocks closely as possible
• Central control node runs NameNode to keep
track of HDFS directories & files, and JobTracker
to dispatch compute tasks to TaskTracker
Hadoop’s Architecture
• Hadoop Distributed Filesystem
• Tailored to needs of MapReduce
• Targeted towards many reads of filestreams
• Writes are more costly
• High degree of data replication (3x by default)
• No need for RAID on normal nodes
• Large blocksize (64MB)
• Location awareness of DataNodes in network
• Hadoop is in use at most organizations that
handle big data:
o Yahoo!
o Facebook
o Amazon
o Netflix
o Etc…
• Some examples of scale:
o Yahoo!’s Search Webmap runs on 10,000
core Linux cluster and powers Yahoo!
Web search
o FB’s Hadoop cluster hosts 100+ PB of
data (July, 2012) & growing at ½ PB/day
(Nov, 2012)
Hadoop’s Architecture
NameNode:
• Stores metadata for the files, like the directory
structure of a typical FS.
• The server holding the NameNode instance is quite
crucial, as there is only one.
• Transaction log for file deletes/adds, etc. Does not use
transactions for whole blocks or file-streams, only
metadata.
• Handles creation of more replica blocks when
necessary after a DataNode failure
Hadoop’s Architecture
DataNode:
• Stores the actual data in HDFS
• Can run on any underlying filesystem (ext3/4, NTFS, etc)
• Notifies NameNode of what blocks it has
• NameNode replicates blocks 2x in local rack, 1x elsewhere
Hadoop’s Architecture: MapReduce Engine
Apache Hadoop Big Data Technology
Hadoop’s Architecture
MapReduce Engine:
• JobTracker & TaskTracker
• JobTracker splits up data into smaller
tasks(“Map”) and sends it to the TaskTracker
process in each node
• TaskTracker reports back to the JobTracker
node and reports on job progress, sends data
(“Reduce”) or requests new jobs
HDFS Basic Concepts
 HDFS is a file system written in Java based on
the Google’s GFS
 Provides redundant storage for massive amounts
of data
HDFS Basic Concepts
 HDFS works best with a smaller number of large
files
 Millions as opposed to billions of files
 Typically 100MB or more per file
 Files in HDFS are write once
 Optimized for streaming reads of large files and
not random reads
How are Files Stored
 Files are split into blocks
 Blocks are split across many machines at load
time
 Different blocks from the same file will be stored on
different machines
 Blocks are replicated across multiple machines
 The NameNode keeps track of which blocks
make up a file and where they are stored
Data Replication
 Default replication is 3-fold
MapReduce
Distributing computation across nodes
MapReduce Overview
 A method for distributing computation across
multiple nodes
 Each node processes the data that is stored at
that node
 Consists of two main phases
 Map
 Reduce
MapReduce Features
 Automatic parallelization and distribution
 Fault-Tolerance
 Provides a clean abstraction for programmers to
use
The Mapper
 Reads data as key/value pairs
 The key is often discarded
 Outputs zero or more key/value pairs
Shuffle and Sort
 Output from the mapper is sorted by key
 All values with the same key are guaranteed to
go to the same machine
The Reducer
 Called once for each unique key
 Gets a list of all values associated with a key as
input
 The reducer outputs zero or more final key/value
pairs
 Usually just one output per input key
MapReduce: Word Count
Mapper
(intermediates)
Mapper
(intermediates)
Mapper
(intermediates)
Mapper
(intermediates)
Reducer Reducer Reducer
(intermediates) (intermediates) (intermediates)
Partitioner Partitioner Partitioner Partitioner
shuffling
Overview
 NameNode
 Holds the metadata for the HDFS
 Secondary NameNode
 Performs housekeeping functions for the
NameNode
 DataNode
 Stores the actual HDFS data blocks
 JobTracker
 Manages MapReduce jobs
 TaskTracker
 Monitors individual Map and Reduce tasks
The NameNode
 Stores the HDFS file system information in a
fsimage
 Updates to the file system (add/remove blocks)
do not change the fsimage file
 They are instead written to a log file
 When starting the NameNode loads the fsimage
file and then applies the changes in the log file
The Secondary NameNode
 NOT a backup for the NameNode
 Periodically reads the log file and applies the
changes to the fsimage file bringing it up to date
 Allows the NameNode to restart faster when
required
JobTracker and TaskTracker
 JobTracker
 Determines the execution plan for the job
 Assigns individual tasks
 TaskTracker
 Keeps track of the performance of an individual
mapper or reducer
Hadoop Ecosystem
Other available tools
Why do these tools exist?
 MapReduce is very powerful, but can be awkward
to master
 These tools allow programmers who are familiar
with other programming styles to take advantage
of the power of MapReduce
Other Tools
 Hive
 Hadoop processing with SQL
 Pig
 Hadoop processing with scripting
 Cascading
 Pipe and Filter processing model
 HBase
 Database model built on top of Hadoop
 Flume
 Designed for large scale data movement

More Related Content

Apache Hadoop Big Data Technology

  • 2. What is Apache Hadoop?  Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware  Created by Doug Cutting and Mike Carafella in 2005.  Cutting named the program after his son’s toy elephant.
  • 3. Uses for Hadoop  Data-intensive text processing  Assembly of large genomes  Graph mining  Machine learning and data mining  Large scale social network analysis
  • 5. The Hadoop Ecosystem • Contains Libraries and other modules Hadoop Common • Hadoop Distributed File SystemHDFS • Yet Another Resource Negotiator Hadoop YARN • A programming model for large scale data processing Hadoop MapReduce
  • 6. How much data?  Facebook  500 TB per day  Yahoo  Over 170 PB  eBay  Over 6 PB  Getting the data to the processors becomes the bottleneck
  • 7. • Hadoop: • an open-source software framework that supports data-intensive distributed applications, licensed under the Apache v2 license. • Goals / Requirements: • Abstract and facilitate the storage and processing of large and/or rapidly growing data sets • Structured and non-structured data • Simple programming models • High scalability and availability • Use commodity (cheap!) hardware with little redundancy • Fault-tolerance • Move computation rather than data
  • 9. Hadoop’s Architecture • Distributed, with some centralization • Main nodes of cluster are where most of the computational power and storage of the system lies • Main nodes run TaskTracker to accept and reply to MapReduce tasks, and also DataNode to store needed blocks closely as possible • Central control node runs NameNode to keep track of HDFS directories & files, and JobTracker to dispatch compute tasks to TaskTracker
  • 10. Hadoop’s Architecture • Hadoop Distributed Filesystem • Tailored to needs of MapReduce • Targeted towards many reads of filestreams • Writes are more costly • High degree of data replication (3x by default) • No need for RAID on normal nodes • Large blocksize (64MB) • Location awareness of DataNodes in network
  • 11. • Hadoop is in use at most organizations that handle big data: o Yahoo! o Facebook o Amazon o Netflix o Etc… • Some examples of scale: o Yahoo!’s Search Webmap runs on 10,000 core Linux cluster and powers Yahoo! Web search o FB’s Hadoop cluster hosts 100+ PB of data (July, 2012) & growing at ½ PB/day (Nov, 2012)
  • 12. Hadoop’s Architecture NameNode: • Stores metadata for the files, like the directory structure of a typical FS. • The server holding the NameNode instance is quite crucial, as there is only one. • Transaction log for file deletes/adds, etc. Does not use transactions for whole blocks or file-streams, only metadata. • Handles creation of more replica blocks when necessary after a DataNode failure
  • 13. Hadoop’s Architecture DataNode: • Stores the actual data in HDFS • Can run on any underlying filesystem (ext3/4, NTFS, etc) • Notifies NameNode of what blocks it has • NameNode replicates blocks 2x in local rack, 1x elsewhere
  • 16. Hadoop’s Architecture MapReduce Engine: • JobTracker & TaskTracker • JobTracker splits up data into smaller tasks(“Map”) and sends it to the TaskTracker process in each node • TaskTracker reports back to the JobTracker node and reports on job progress, sends data (“Reduce”) or requests new jobs
  • 17. HDFS Basic Concepts  HDFS is a file system written in Java based on the Google’s GFS  Provides redundant storage for massive amounts of data
  • 18. HDFS Basic Concepts  HDFS works best with a smaller number of large files  Millions as opposed to billions of files  Typically 100MB or more per file  Files in HDFS are write once  Optimized for streaming reads of large files and not random reads
  • 19. How are Files Stored  Files are split into blocks  Blocks are split across many machines at load time  Different blocks from the same file will be stored on different machines  Blocks are replicated across multiple machines  The NameNode keeps track of which blocks make up a file and where they are stored
  • 20. Data Replication  Default replication is 3-fold
  • 22. MapReduce Overview  A method for distributing computation across multiple nodes  Each node processes the data that is stored at that node  Consists of two main phases  Map  Reduce
  • 23. MapReduce Features  Automatic parallelization and distribution  Fault-Tolerance  Provides a clean abstraction for programmers to use
  • 24. The Mapper  Reads data as key/value pairs  The key is often discarded  Outputs zero or more key/value pairs
  • 25. Shuffle and Sort  Output from the mapper is sorted by key  All values with the same key are guaranteed to go to the same machine
  • 26. The Reducer  Called once for each unique key  Gets a list of all values associated with a key as input  The reducer outputs zero or more final key/value pairs  Usually just one output per input key
  • 28. Mapper (intermediates) Mapper (intermediates) Mapper (intermediates) Mapper (intermediates) Reducer Reducer Reducer (intermediates) (intermediates) (intermediates) Partitioner Partitioner Partitioner Partitioner shuffling
  • 29. Overview  NameNode  Holds the metadata for the HDFS  Secondary NameNode  Performs housekeeping functions for the NameNode  DataNode  Stores the actual HDFS data blocks  JobTracker  Manages MapReduce jobs  TaskTracker  Monitors individual Map and Reduce tasks
  • 30. The NameNode  Stores the HDFS file system information in a fsimage  Updates to the file system (add/remove blocks) do not change the fsimage file  They are instead written to a log file  When starting the NameNode loads the fsimage file and then applies the changes in the log file
  • 31. The Secondary NameNode  NOT a backup for the NameNode  Periodically reads the log file and applies the changes to the fsimage file bringing it up to date  Allows the NameNode to restart faster when required
  • 32. JobTracker and TaskTracker  JobTracker  Determines the execution plan for the job  Assigns individual tasks  TaskTracker  Keeps track of the performance of an individual mapper or reducer
  • 34. Why do these tools exist?  MapReduce is very powerful, but can be awkward to master  These tools allow programmers who are familiar with other programming styles to take advantage of the power of MapReduce
  • 35. Other Tools  Hive  Hadoop processing with SQL  Pig  Hadoop processing with scripting  Cascading  Pipe and Filter processing model  HBase  Database model built on top of Hadoop  Flume  Designed for large scale data movement

Editor's Notes

  1. Default replication is 3-fold