A Survey on Data Mapping Strategy for data stored in the storage cloud 111

VOLUME 9,ISSUE 3,MAY 2016 ISSN-2347-8047
INTERNATIONAL JOURNAL OF COGNITIVE SCIENCE,
ENGINEERING AND TECHNOLOGY
Available online at: http://airnetjournal.org/
A Survey on Data Mapping Strategy for data
stored in the storage cloud
Navneet Kumar
Department Of cse
K S Institute Of
Technology
Bengaluru,India
Naseeruddin V N
Department Of cse
K S Institute Of
Technology
Bengaluru,India
Murali Krishna V
Department Of cse
K S Institute Of
Technology
Bengaluru,India
S K Manu
Department Of cse
K S Institute Of
Technology
Bengaluru,India
Swathi K
Department Of cse
K S Institute Of
Technology
Bengaluru,India
Abstract— In the recent past the data being processed over the
internet is increasing exponentially so it’s difficult to store such
huge amount of data and It becomes computationally inefficient
to analyze such huge data. There is currently considerable
enthusiasm around the Map Reduce paradigm for large-scale
data analysis. It is inspired by functional programming which
allows expressing distributed computation massive amounts of
data. It is designed for large-scale data processing as it allows to
run on clusters of commodity hardware. A prominent parallel
data processing tool Map Reduce is gaining significant
momentum from both industry and academia as the volume of
data to analyze grows rapidly. In this paper we propose a method
to process huge amount of data over the internet. This method
involves storing the data to be processed on the cloud and
processing the data on hadoop multicluster environment.
Keywords— Storage Cloud, Hadoop cluster, Hadoop,
Distributed File System, Parallel Processing, MapReduce
I.Introduction
The very challenging problem is to analyze big data. For the
effective handling of such massive data or applications, the
use of MapReduce framework has been widely came into
focus. Over the last few years, MapReduce has emerged as the
most popular computing paradigm for parallel, batch-style and
analysis of large amount of data. Many areas where massive
data analysis is required, MapReduce is used. There are
evolving numbers of applications that handle big data but to
handle such huge collection of data is a very
challengingproblem today. Here, we got the MapReduce or its
opensource equivalent Hadoop which is a powerful tool for
building such applications. Data-intensive processing is fast
and currently becoming a necessity to handle the large
databases efficiently. It is required to design algorithms that
must be capable of scaling to real-world datasets. There is
currently considerable enthusiasm around the MapReduce
paradigm for large-scale data analysis. It is inspired by
functional programming which allows expressing distributed
computations on massive amounts of data. It is designed for
large-scale data processing as it allows running on clusters of
commodity hardware. MapReduce is used in the areas where
the volume of data to analyze grows speedily. Though, it
comprise of such abilities, still there are argument on its
concert, effectiveness, and simple concept. At the present time
there is outburst of data, so to process such a massive volume
of data in a timely manner, parallel processing is important.
MapReduce gained its popularity when used successfully by
Google. In real, it is a scalable and fault-tolerant data
processing tool which provides the ability to process huge
voluminous data in parallel with many low-end computing
nodes. By virtue of its simplicity, scalability, and fault
tolerance, MapReduce is becoming ubiquitous, gaining
significant momentum from both industry and academia.
However, MapReduce has inherent limitations on its
performance and efficiency. Therefore, many studies have
endeavoured to overcome the limitations of the MapReduce
framework. The goal of this analysis is to provide a timely
remark on the status of MapReduce studies and related work
focusing on the current research aimed at improving and
enhancing the MapReduce framework. This paper is brought
into consideration to assist the database in understanding
various technical aspects of the MapReduce framework. In
this paper, we focus on the working of MapReduce framework

and examine its in-built advantages and drawbacks. We then
introduce application and effective ways to improve its
properties so that we can get the optimized result. We also
brought into focus the issues and challenges raised on
MapReduce. It is well known for its simplicity, effectiveness
and capability to handle “Big Data” in a timely manner. With
all these valuable features still it consist of some limitations
which is required to be sorted out.
II.Working
MapReduce is a programming model and an associated
implementation for processing and generating large datasets
that is amenable to a broad variety of real-world tasks [3]. The
MapReduce paradigm of parallel programming provides
simplicity, while at the same time offering load balancing and
fault tolerance The Google File System (GFS) that typically
underlies a MapReduce system provides the efficient and
reliable distributed data storage needed for applications
involving large databases [10]. MapReduce is inspired by the
map and reduces primitives present in functional languages. In
its pure form, various implementations of the MapReduce
interface are possible, depending on the desired context. Some
currently available implementations are: shared-memory
multi-core system [11][12], asymmetric multi-core
processors[13], graphic processors, and cluster of networked
machines[4]. The most popular implementation is probably
the one introduced by Google, which utilizes large clusters of
commodity computers connected with switched Ethernet. In
essence, the Google’s MapReduce technique simplifies the
development and lowers the cost of large-scale distributed
applications on clusters of commodity machines. MapReduce
framework executes its tasks based on runtime scheduling
scheme. It means that MapReduce does not build any
execution plan that specifies which tasks will run on which
nodes before execution [14]. The MapReduce model is
capable of parallelly processing large data sets distributed
across many nodes. The main goal is to simplify large data
processing by using inexpensive cluster computers and to
make this easy for users while achieving both load balancing
and fault tolerance. Map-Reduce have two primary functions:
the Map function and the Reduce function. These functions are
defined by the user to meet the specific requirements. The
original Map-Reduce software is a proprietary system of
Google, and therefore, not available for public use [15].
Although the distributed computing is largely simplified with
the notions of Map and Reduce primitives, the underlying
infrastructure is non-trivial in order to achieve the desired
performance [2]. A key infrastructure in Google’s MapReduce
is the underlying distributed file system to ensure data locality
and availability [3]. Combining the MapReduce programming
technique and an efficient distributed file system, one can
easily achieve the goal of distributed computing with data
parallelism over thousands of computing nodes.
III.Methodlogy
The Architecture above illustrates the layout of the project.
User uploads the data to the cloud over the internet, then
selects the operation to be carried out. the controller present as
a middleware interprets the request and forwards the request to
hadoop master. hadoop master starts the jobtracker and
connects the cloud as the data node and the mapreduce
algorithm is run, which maps the data and reduces according
to the algorithm implempted. the result is collected and
concated and stored back onto the cloud for the use r to
download the result.
The use of a storage cloud allows the user to upload the data
and download the data from places connected to the internet
without known where the actual processing is done.
IV.Design
The goals of application is to provide an easy to use interface
so that a user with even little knowledge about using website
can use the browser . We have designed few models and
structures to explain the design and structure of the application
under discussion.
Data Flow Model
A data flow diagram (DFD) is a graphical representation of
the "flow" of data through an information system, modeling
its process aspects. Often they are a preliminary step used to
create an overview of the system which can later be
elaborated. DFDs can also be used for the visualization of
data processing (structured design).
A DFD shows what kinds of information will be input to and
output from the system, where the data will come from and
go to, and where the data will be stored. It does not show
information about the timing of processes, or information
about whether processes will operate in sequence or in
parallel.
Data flow: A data flow shows the flow of information from

its source to its destination. A data flow is represented by a
line, with arrowheads showing the direction of flow.
Data Store: A data store is a holding place for information
within the system. It is represented by an open-ended narrow
rectangle.
External Entities: It is normal for all information represented
within a system to have been obtained from and/or to be
passed on to external source recipient.
Processes: When naming processes, avoid glossing over
them, without really understanding their role. It is descriptive
title area – like ‘process’ or ‘update’.
Data Flows: Double-headed arrows can be used on all but
bottom-level diagrams. Furthermore, in common with most of
the other symbol used, a data flow at a particular level of
diagram may be decomposed to multiple data flows.
V.Snapshots
Description: Admin logins using admin as Admin name and
Password. If admin name or password not matches it will
display message as Wrong admin or wrong password, If
matches it will display message as Login Success.
Description: Website contains options for user, through

which he can do the necessary task.
Description: User can select the containers and can upload the
data to the cloud.
Description: User can select the output container where the
result is stored and can download the data from the cloud.
Acknowledgment
The satisfaction and euphoria that accompany the successful
completion of any task will be incomplete without the mention
of the individuals, we are greatly indebted to, who through
guidance and providing facilities have served as a beacon of
light and crowned our efforts with success .we are thankful to
Mrs. Swathi K , Assistant Professor, CSE,KSIT for being our
Project Guide, under whose able guidance this project work
has been carried out and completed successfully.
We thank the management, principal, Department of computer
science and engineering, KSIT. We thank VGST(Vision
Group on Science and Technology) Government of Karnataka,
India for providing infrastructure facilities through the K-FIST
Level II project at KSIT,CSE R&D Department Bengaluru.
References
[1] Maitrey S, Jha. An Integrated Approach for CURE Clustering using Map-
Reduce Technique. In Proceedings of Elsevier, ISBN 978-81- 910691-6-3,2nd
August 2013.
[2] Kyuseok Shim. MapReduce Algorithms for Big Data Analysis. In
Proceedings of the VLDB Endowment, Vol. 5, No. 12, August 27th 2012,
Istanbul, Turkey.
[3]Jeffrey Dean et al. Mapreduce: Simplified data processing on large
clusters. In Proceedings of the 6th USENIX OSDI, pages 137–150, 2004.
[4] J. Dean et al. MapReduce: Simplified data processing on large clusters.
Communications of the ACM, 51(1):107– 113, 2008.
[5] D. DeWitt and M. Stonebraker. MapReduce: A major step backwards. The
Database Column, 1, 2008.
[6] A. Pavlo et al. A comparison of approaches to large-scale data analysis. In
Proceedings of the ACM SIGMOD, pages 165– 178, 2009.
[7] M. Stonebraker et al. MapReduce and parallel DBMSs: friends or foes?
Communications of the ACM, 53(1):64–71, 2010.
[8] A. Thusoo et al. Hive: a warehousing solution over a mapreduce
framework. Proceedings of the VLDB Endowment, (2):1626–1629, 2009.
[9] A.F. Gates et al. Building a high-level dataflow system on top of Map-
Reduce: the Pig experience. Proceedings of the VLDB Endowment,
2(2):1414–1425, 2009.
[10] S. Ghemawat et al. The google file system. ACM SIGOPS Operating
Systems Review, 37(5):29–43, 2003.
[11]OpenStack Installation Guide for Ubuntu 14.04 ,February 26, 2015.
[12]http://www.stackoverflow.com/
[13]https://github.com/

A Survey on Data Mapping Strategy for data stored in the storage cloud 111

More Related Content

A Survey on Data Mapping Strategy for data stored in the storage cloud 111