Hadoop and the term 'Big Data' go hand in hand. The information explosion caused
due to cloud and distributed computing lead to the curiosity to process and analyze massive
amount of data. The process and analysis helps to add value to an organization or derive
valuable information.
The current Hadoop implementation assumes that computing nodes in a cluster are
homogeneous in nature. Hadoop relies on its capability to take computation to the nodes
rather than migrating the data around the nodes which might cause a signicant network
overhead. This strategy has its potential benets on homogeneous environment but it might
not be suitable on an heterogeneous environment. The time taken to process the data on a
slower node on a heterogeneous environment might be signicantly higher than the sum of
network overhead and processing time on a faster node. Hence, it is necessary to study the
data placement policy where we can distribute the data based on the processing power of
a node. The project explores this data placement policy and notes the ramications of this
strategy based on running few benchmark applications.
Report
Share
Report
Share
1 of 23
More Related Content
HDFS-HC2: Analysis of Data Placement Strategy based on Computing Power of Nodes on Heterogeneous Hadoop Clusters
1. Analysis of Data Placement Strategy
based on Computing Power of Nodes on
Heterogeneous Hadoop Clusters
Sanket Reddy Chintapalli
Advisor - Dr. Xiao Qin
18. Design
● Evaluate Hadoop Distribution by running grep and
wordcount together on all nodes
● Run the CRBalancer to balance the nodes
● Finally re-run the applications to note the ramifications
of the data placement strategy.