SlideShare a Scribd company logo
VOLUME 9,ISSUE 3,MAY 2016 ISSN-2347-8047
INTERNATIONAL JOURNAL OF COGNITIVE SCIENCE,
ENGINEERING AND TECHNOLOGY
Available online at: http://airnetjournal.org/
A Survey on Data Mapping Strategy for data
stored in the storage cloud
Navneet Kumar
Department Of cse
K S Institute Of
Technology
Bengaluru,India
Naseeruddin V N
Department Of cse
K S Institute Of
Technology
Bengaluru,India
Murali Krishna V
Department Of cse
K S Institute Of
Technology
Bengaluru,India
S K Manu
Department Of cse
K S Institute Of
Technology
Bengaluru,India
Swathi K
Department Of cse
K S Institute Of
Technology
Bengaluru,India
Abstract— In the recent past the data being processed over the
internet is increasing exponentially so it’s difficult to store such
huge amount of data and It becomes computationally inefficient
to analyze such huge data. There is currently considerable
enthusiasm around the Map Reduce paradigm for large-scale
data analysis. It is inspired by functional programming which
allows expressing distributed computation massive amounts of
data. It is designed for large-scale data processing as it allows to
run on clusters of commodity hardware. A prominent parallel
data processing tool Map Reduce is gaining significant
momentum from both industry and academia as the volume of
data to analyze grows rapidly. In this paper we propose a method
to process huge amount of data over the internet. This method
involves storing the data to be processed on the cloud and
processing the data on hadoop multicluster environment.
Keywords— Storage Cloud, Hadoop cluster, Hadoop,
Distributed File System, Parallel Processing, MapReduce
I.Introduction
The very challenging problem is to analyze big data. For the
effective handling of such massive data or applications, the
use of MapReduce framework has been widely came into
focus. Over the last few years, MapReduce has emerged as the
most popular computing paradigm for parallel, batch-style and
analysis of large amount of data. Many areas where massive
data analysis is required, MapReduce is used. There are
evolving numbers of applications that handle big data but to
handle such huge collection of data is a very
challengingproblem today. Here, we got the MapReduce or its
opensource equivalent Hadoop which is a powerful tool for
building such applications. Data-intensive processing is fast
and currently becoming a necessity to handle the large
databases efficiently. It is required to design algorithms that
must be capable of scaling to real-world datasets. There is
currently considerable enthusiasm around the MapReduce
paradigm for large-scale data analysis. It is inspired by
functional programming which allows expressing distributed
computations on massive amounts of data. It is designed for
large-scale data processing as it allows running on clusters of
commodity hardware. MapReduce is used in the areas where
the volume of data to analyze grows speedily. Though, it
comprise of such abilities, still there are argument on its
concert, effectiveness, and simple concept. At the present time
there is outburst of data, so to process such a massive volume
of data in a timely manner, parallel processing is important.
MapReduce gained its popularity when used successfully by
Google. In real, it is a scalable and fault-tolerant data
processing tool which provides the ability to process huge
voluminous data in parallel with many low-end computing
nodes. By virtue of its simplicity, scalability, and fault
tolerance, MapReduce is becoming ubiquitous, gaining
significant momentum from both industry and academia.
However, MapReduce has inherent limitations on its
performance and efficiency. Therefore, many studies have
endeavoured to overcome the limitations of the MapReduce
framework. The goal of this analysis is to provide a timely
remark on the status of MapReduce studies and related work
focusing on the current research aimed at improving and
enhancing the MapReduce framework. This paper is brought
into consideration to assist the database in understanding
various technical aspects of the MapReduce framework. In
this paper, we focus on the working of MapReduce framework
and examine its in-built advantages and drawbacks. We then
introduce application and effective ways to improve its
properties so that we can get the optimized result. We also
brought into focus the issues and challenges raised on
MapReduce. It is well known for its simplicity, effectiveness
and capability to handle “Big Data” in a timely manner. With
all these valuable features still it consist of some limitations
which is required to be sorted out.
II.Working
MapReduce is a programming model and an associated
implementation for processing and generating large datasets
that is amenable to a broad variety of real-world tasks [3]. The
MapReduce paradigm of parallel programming provides
simplicity, while at the same time offering load balancing and
fault tolerance The Google File System (GFS) that typically
underlies a MapReduce system provides the efficient and
reliable distributed data storage needed for applications
involving large databases [10]. MapReduce is inspired by the
map and reduces primitives present in functional languages. In
its pure form, various implementations of the MapReduce
interface are possible, depending on the desired context. Some
currently available implementations are: shared-memory
multi-core system [11][12], asymmetric multi-core
processors[13], graphic processors, and cluster of networked
machines[4]. The most popular implementation is probably
the one introduced by Google, which utilizes large clusters of
commodity computers connected with switched Ethernet. In
essence, the Google’s MapReduce technique simplifies the
development and lowers the cost of large-scale distributed
applications on clusters of commodity machines. MapReduce
framework executes its tasks based on runtime scheduling
scheme. It means that MapReduce does not build any
execution plan that specifies which tasks will run on which
nodes before execution [14]. The MapReduce model is
capable of parallelly processing large data sets distributed
across many nodes. The main goal is to simplify large data
processing by using inexpensive cluster computers and to
make this easy for users while achieving both load balancing
and fault tolerance. Map-Reduce have two primary functions:
the Map function and the Reduce function. These functions are
defined by the user to meet the specific requirements. The
original Map-Reduce software is a proprietary system of
Google, and therefore, not available for public use [15].
Although the distributed computing is largely simplified with
the notions of Map and Reduce primitives, the underlying
infrastructure is non-trivial in order to achieve the desired
performance [2]. A key infrastructure in Google’s MapReduce
is the underlying distributed file system to ensure data locality
and availability [3]. Combining the MapReduce programming
technique and an efficient distributed file system, one can
easily achieve the goal of distributed computing with data
parallelism over thousands of computing nodes.
III.Methodlogy
The Architecture above illustrates the layout of the project.
User uploads the data to the cloud over the internet, then
selects the operation to be carried out. the controller present as
a middleware interprets the request and forwards the request to
hadoop master. hadoop master starts the jobtracker and
connects the cloud as the data node and the mapreduce
algorithm is run, which maps the data and reduces according
to the algorithm implempted. the result is collected and
concated and stored back onto the cloud for the use r to
download the result.
The use of a storage cloud allows the user to upload the data
and download the data from places connected to the internet
without known where the actual processing is done.
IV.Design
The goals of application is to provide an easy to use interface
so that a user with even little knowledge about using website
can use the browser . We have designed few models and
structures to explain the design and structure of the application
under discussion.
Data Flow Model
A data flow diagram (DFD) is a graphical representation of
the "flow" of data through an information system, modeling
its process aspects. Often they are a preliminary step used to
create an overview of the system which can later be
elaborated. DFDs can also be used for the visualization of
data processing (structured design).
A DFD shows what kinds of information will be input to and
output from the system, where the data will come from and
go to, and where the data will be stored. It does not show
information about the timing of processes, or information
about whether processes will operate in sequence or in
parallel.
Data flow: A data flow shows the flow of information from
its source to its destination. A data flow is represented by a
line, with arrowheads showing the direction of flow.
Data Store: A data store is a holding place for information
within the system. It is represented by an open-ended narrow
rectangle.
External Entities: It is normal for all information represented
within a system to have been obtained from and/or to be
passed on to external source recipient.
Processes: When naming processes, avoid glossing over
them, without really understanding their role. It is descriptive
title area – like ‘process’ or ‘update’.
Data Flows: Double-headed arrows can be used on all but
bottom-level diagrams. Furthermore, in common with most of
the other symbol used, a data flow at a particular level of
diagram may be decomposed to multiple data flows.
V.Snapshots
Description: Admin logins using admin as Admin name and
Password. If admin name or password not matches it will
display message as Wrong admin or wrong password, If
matches it will display message as Login Success.
Description: Website contains options for user, through
which he can do the necessary task.
Description: User can select the containers and can upload the
data to the cloud.
Description: User can select the output container where the
result is stored and can download the data from the cloud.
Acknowledgment
The satisfaction and euphoria that accompany the successful
completion of any task will be incomplete without the mention
of the individuals, we are greatly indebted to, who through
guidance and providing facilities have served as a beacon of
light and crowned our efforts with success .we are thankful to
Mrs. Swathi K , Assistant Professor, CSE,KSIT for being our
Project Guide, under whose able guidance this project work
has been carried out and completed successfully.
We thank the management, principal, Department of computer
science and engineering, KSIT. We thank VGST(Vision
Group on Science and Technology) Government of Karnataka,
India for providing infrastructure facilities through the K-FIST
Level II project at KSIT,CSE R&D Department Bengaluru.
References
[1] Maitrey S, Jha. An Integrated Approach for CURE Clustering using Map-
Reduce Technique. In Proceedings of Elsevier, ISBN 978-81- 910691-6-3,2nd
August 2013.
[2] Kyuseok Shim. MapReduce Algorithms for Big Data Analysis. In
Proceedings of the VLDB Endowment, Vol. 5, No. 12, August 27th 2012,
Istanbul, Turkey.
[3]Jeffrey Dean et al. Mapreduce: Simplified data processing on large
clusters. In Proceedings of the 6th USENIX OSDI, pages 137–150, 2004.
[4] J. Dean et al. MapReduce: Simplified data processing on large clusters.
Communications of the ACM, 51(1):107– 113, 2008.
[5] D. DeWitt and M. Stonebraker. MapReduce: A major step backwards. The
Database Column, 1, 2008.
[6] A. Pavlo et al. A comparison of approaches to large-scale data analysis. In
Proceedings of the ACM SIGMOD, pages 165– 178, 2009.
[7] M. Stonebraker et al. MapReduce and parallel DBMSs: friends or foes?
Communications of the ACM, 53(1):64–71, 2010.
[8] A. Thusoo et al. Hive: a warehousing solution over a mapreduce
framework. Proceedings of the VLDB Endowment, (2):1626–1629, 2009.
[9] A.F. Gates et al. Building a high-level dataflow system on top of Map-
Reduce: the Pig experience. Proceedings of the VLDB Endowment,
2(2):1414–1425, 2009.
[10] S. Ghemawat et al. The google file system. ACM SIGOPS Operating
Systems Review, 37(5):29–43, 2003.
[11]OpenStack Installation Guide for Ubuntu 14.04 ,February 26, 2015.
[12]http://www.stackoverflow.com/
[13]https://github.com/

More Related Content

A Survey on Data Mapping Strategy for data stored in the storage cloud 111

  • 1. VOLUME 9,ISSUE 3,MAY 2016 ISSN-2347-8047 INTERNATIONAL JOURNAL OF COGNITIVE SCIENCE, ENGINEERING AND TECHNOLOGY Available online at: http://airnetjournal.org/ A Survey on Data Mapping Strategy for data stored in the storage cloud Navneet Kumar Department Of cse K S Institute Of Technology Bengaluru,India Naseeruddin V N Department Of cse K S Institute Of Technology Bengaluru,India Murali Krishna V Department Of cse K S Institute Of Technology Bengaluru,India S K Manu Department Of cse K S Institute Of Technology Bengaluru,India Swathi K Department Of cse K S Institute Of Technology Bengaluru,India Abstract— In the recent past the data being processed over the internet is increasing exponentially so it’s difficult to store such huge amount of data and It becomes computationally inefficient to analyze such huge data. There is currently considerable enthusiasm around the Map Reduce paradigm for large-scale data analysis. It is inspired by functional programming which allows expressing distributed computation massive amounts of data. It is designed for large-scale data processing as it allows to run on clusters of commodity hardware. A prominent parallel data processing tool Map Reduce is gaining significant momentum from both industry and academia as the volume of data to analyze grows rapidly. In this paper we propose a method to process huge amount of data over the internet. This method involves storing the data to be processed on the cloud and processing the data on hadoop multicluster environment. Keywords— Storage Cloud, Hadoop cluster, Hadoop, Distributed File System, Parallel Processing, MapReduce I.Introduction The very challenging problem is to analyze big data. For the effective handling of such massive data or applications, the use of MapReduce framework has been widely came into focus. Over the last few years, MapReduce has emerged as the most popular computing paradigm for parallel, batch-style and analysis of large amount of data. Many areas where massive data analysis is required, MapReduce is used. There are evolving numbers of applications that handle big data but to handle such huge collection of data is a very challengingproblem today. Here, we got the MapReduce or its opensource equivalent Hadoop which is a powerful tool for building such applications. Data-intensive processing is fast and currently becoming a necessity to handle the large databases efficiently. It is required to design algorithms that must be capable of scaling to real-world datasets. There is currently considerable enthusiasm around the MapReduce paradigm for large-scale data analysis. It is inspired by functional programming which allows expressing distributed computations on massive amounts of data. It is designed for large-scale data processing as it allows running on clusters of commodity hardware. MapReduce is used in the areas where the volume of data to analyze grows speedily. Though, it comprise of such abilities, still there are argument on its concert, effectiveness, and simple concept. At the present time there is outburst of data, so to process such a massive volume of data in a timely manner, parallel processing is important. MapReduce gained its popularity when used successfully by Google. In real, it is a scalable and fault-tolerant data processing tool which provides the ability to process huge voluminous data in parallel with many low-end computing nodes. By virtue of its simplicity, scalability, and fault tolerance, MapReduce is becoming ubiquitous, gaining significant momentum from both industry and academia. However, MapReduce has inherent limitations on its performance and efficiency. Therefore, many studies have endeavoured to overcome the limitations of the MapReduce framework. The goal of this analysis is to provide a timely remark on the status of MapReduce studies and related work focusing on the current research aimed at improving and enhancing the MapReduce framework. This paper is brought into consideration to assist the database in understanding various technical aspects of the MapReduce framework. In this paper, we focus on the working of MapReduce framework
  • 2. and examine its in-built advantages and drawbacks. We then introduce application and effective ways to improve its properties so that we can get the optimized result. We also brought into focus the issues and challenges raised on MapReduce. It is well known for its simplicity, effectiveness and capability to handle “Big Data” in a timely manner. With all these valuable features still it consist of some limitations which is required to be sorted out. II.Working MapReduce is a programming model and an associated implementation for processing and generating large datasets that is amenable to a broad variety of real-world tasks [3]. The MapReduce paradigm of parallel programming provides simplicity, while at the same time offering load balancing and fault tolerance The Google File System (GFS) that typically underlies a MapReduce system provides the efficient and reliable distributed data storage needed for applications involving large databases [10]. MapReduce is inspired by the map and reduces primitives present in functional languages. In its pure form, various implementations of the MapReduce interface are possible, depending on the desired context. Some currently available implementations are: shared-memory multi-core system [11][12], asymmetric multi-core processors[13], graphic processors, and cluster of networked machines[4]. The most popular implementation is probably the one introduced by Google, which utilizes large clusters of commodity computers connected with switched Ethernet. In essence, the Google’s MapReduce technique simplifies the development and lowers the cost of large-scale distributed applications on clusters of commodity machines. MapReduce framework executes its tasks based on runtime scheduling scheme. It means that MapReduce does not build any execution plan that specifies which tasks will run on which nodes before execution [14]. The MapReduce model is capable of parallelly processing large data sets distributed across many nodes. The main goal is to simplify large data processing by using inexpensive cluster computers and to make this easy for users while achieving both load balancing and fault tolerance. Map-Reduce have two primary functions: the Map function and the Reduce function. These functions are defined by the user to meet the specific requirements. The original Map-Reduce software is a proprietary system of Google, and therefore, not available for public use [15]. Although the distributed computing is largely simplified with the notions of Map and Reduce primitives, the underlying infrastructure is non-trivial in order to achieve the desired performance [2]. A key infrastructure in Google’s MapReduce is the underlying distributed file system to ensure data locality and availability [3]. Combining the MapReduce programming technique and an efficient distributed file system, one can easily achieve the goal of distributed computing with data parallelism over thousands of computing nodes. III.Methodlogy The Architecture above illustrates the layout of the project. User uploads the data to the cloud over the internet, then selects the operation to be carried out. the controller present as a middleware interprets the request and forwards the request to hadoop master. hadoop master starts the jobtracker and connects the cloud as the data node and the mapreduce algorithm is run, which maps the data and reduces according to the algorithm implempted. the result is collected and concated and stored back onto the cloud for the use r to download the result. The use of a storage cloud allows the user to upload the data and download the data from places connected to the internet without known where the actual processing is done. IV.Design The goals of application is to provide an easy to use interface so that a user with even little knowledge about using website can use the browser . We have designed few models and structures to explain the design and structure of the application under discussion. Data Flow Model A data flow diagram (DFD) is a graphical representation of the "flow" of data through an information system, modeling its process aspects. Often they are a preliminary step used to create an overview of the system which can later be elaborated. DFDs can also be used for the visualization of data processing (structured design). A DFD shows what kinds of information will be input to and output from the system, where the data will come from and go to, and where the data will be stored. It does not show information about the timing of processes, or information about whether processes will operate in sequence or in parallel. Data flow: A data flow shows the flow of information from
  • 3. its source to its destination. A data flow is represented by a line, with arrowheads showing the direction of flow. Data Store: A data store is a holding place for information within the system. It is represented by an open-ended narrow rectangle. External Entities: It is normal for all information represented within a system to have been obtained from and/or to be passed on to external source recipient. Processes: When naming processes, avoid glossing over them, without really understanding their role. It is descriptive title area – like ‘process’ or ‘update’. Data Flows: Double-headed arrows can be used on all but bottom-level diagrams. Furthermore, in common with most of the other symbol used, a data flow at a particular level of diagram may be decomposed to multiple data flows. V.Snapshots Description: Admin logins using admin as Admin name and Password. If admin name or password not matches it will display message as Wrong admin or wrong password, If matches it will display message as Login Success. Description: Website contains options for user, through
  • 4. which he can do the necessary task. Description: User can select the containers and can upload the data to the cloud. Description: User can select the output container where the result is stored and can download the data from the cloud. Acknowledgment The satisfaction and euphoria that accompany the successful completion of any task will be incomplete without the mention of the individuals, we are greatly indebted to, who through guidance and providing facilities have served as a beacon of light and crowned our efforts with success .we are thankful to Mrs. Swathi K , Assistant Professor, CSE,KSIT for being our Project Guide, under whose able guidance this project work has been carried out and completed successfully. We thank the management, principal, Department of computer science and engineering, KSIT. We thank VGST(Vision Group on Science and Technology) Government of Karnataka, India for providing infrastructure facilities through the K-FIST Level II project at KSIT,CSE R&D Department Bengaluru. References [1] Maitrey S, Jha. An Integrated Approach for CURE Clustering using Map- Reduce Technique. In Proceedings of Elsevier, ISBN 978-81- 910691-6-3,2nd August 2013. [2] Kyuseok Shim. MapReduce Algorithms for Big Data Analysis. In Proceedings of the VLDB Endowment, Vol. 5, No. 12, August 27th 2012, Istanbul, Turkey. [3]Jeffrey Dean et al. Mapreduce: Simplified data processing on large clusters. In Proceedings of the 6th USENIX OSDI, pages 137–150, 2004. [4] J. Dean et al. MapReduce: Simplified data processing on large clusters. Communications of the ACM, 51(1):107– 113, 2008. [5] D. DeWitt and M. Stonebraker. MapReduce: A major step backwards. The Database Column, 1, 2008. [6] A. Pavlo et al. A comparison of approaches to large-scale data analysis. In Proceedings of the ACM SIGMOD, pages 165– 178, 2009. [7] M. Stonebraker et al. MapReduce and parallel DBMSs: friends or foes? Communications of the ACM, 53(1):64–71, 2010. [8] A. Thusoo et al. Hive: a warehousing solution over a mapreduce framework. Proceedings of the VLDB Endowment, (2):1626–1629, 2009. [9] A.F. Gates et al. Building a high-level dataflow system on top of Map- Reduce: the Pig experience. Proceedings of the VLDB Endowment, 2(2):1414–1425, 2009. [10] S. Ghemawat et al. The google file system. ACM SIGOPS Operating Systems Review, 37(5):29–43, 2003. [11]OpenStack Installation Guide for Ubuntu 14.04 ,February 26, 2015. [12]http://www.stackoverflow.com/ [13]https://github.com/