About Streaming Data Solutions for Hadoop

Fast Big Data and Streaming
Analytics: Discerning Differences and
Selecting the Best Approach
Lynn Langit
April 2015

Fast Big Data and Streaming Analytics: Discerning Differences and Selecting the Best Approach
2
TABLE OF CONTENTS
Executive
summary
...................................................................................................................................................................
3

Introduction
..................................................................................................................................................................................
4

High-‐Volume
Real-‐time
Data
Analytics
........................................................................................................................
5

Streaming
Analytics
Project
Design
...................................................................................................................................
7

Architectural
Considerations
................................................................................................................................................
7

Component
Selection
................................................................................................................................................................
9

Overall
Architecture
.............................................................................................................................................................
9

Enterprise-‐grade
Streaming
Engine
............................................................................................................................
10

Ease
of
Use
and
Development
........................................................................................................................................
12

Creating
the
Proof
of
Concept
.............................................................................................................................................
13

Management/DevOps
.............................................................................................................................................................
14

Key
Findings
...............................................................................................................................................................................
15

Summary
Comparisons
..........................................................................................................................................................
16

About
Lynn
Langit
....................................................................................................................................................................
18

3
Executive summary
An increasing number of big data projects include one or more streaming components. The architectures
of these streaming solutions are complex, often containing multiple data repositories (data at rest),
streaming pipelines (data in motion), and other processing components (such as real- or near-real time
analytics). In addition to performing analytics in real time (or near-real-time), streaming platforms can
interact with external systems (databases, files, message buses, etc.) for data enrichment or for storing
data after processing. Architects must consider many types of big data streaming solutions that can
include open source projects as well as commercial products. Now they also have a next-generation
reference architecture for big data, also know as fast big data.
This report will help IT decision makers at all levels understand the common technologies,
implementation patterns, and architectural approaches for creating streaming big data solutions. It will
also describe each solution’s tradeoffs and determine which approach best fits specific needs.
Key takeaways from this report include:
• Use commercial streaming solutions to implement complex Event Stream Processing (ESP)
projects. Organizations should match the solution’s complexity to the component’s maturity.
• Teams that have already implemented pure open source Hadoop solutions are most capable of
adding pure open source streaming solutions. An organization should match its team’s skill level
to solution complexity and component maturity.
• Organizations should test solutions at production levels of load during the proof-of-concept phase
and determine whether they will host the solution on premises, in the cloud, or as a hybrid project.
• Organizations should select tools or plan for coding appropriate types of visualization solution.

4
Introduction
Data continues to grow in variety, volume, and velocity. Enterprises are handling the first two
components, with Hadoop emerging as the de facto big data technology of choice. However, they now
realize that processing and gaining insights from the increased velocity of data creation yields greater
value as it enables better operational efficiency (cost reduction) and personalized customer offerings
(revenue growth). Streaming analytic solutions are the technology that supports processing fast big data.
Note that the term “fast big data” still includes “big data,” so streaming solutions must also adhere to the
tenets of big data: (near) linear scalability, fault tolerance, distributed computing, no single point of
failure, security, operability, ease of use etc. These abilities become more critical for fast big data. Fast big
data also requires seamless integration with external systems, in-memory processing, advanced
partitioning schemes, etc. The fast big data architecture stack must integrate seamlessly in the
enterprise’s existing data processing methodology.
Contrasting streaming to traditional analytic data architecture determines whether a business can benefit
from processing fast big data.
• The traditional workflow first entails ingest (usually via batch processing), then store-and-process,
and finally query. This workflow provides results in hours or days.
• The action-based architecture of streaming, which is based on a pipeline of steps for continuously
ingesting, transforming, processing, alerting, and then storing data, provides insights and take
actions in seconds, minutes and hours.
The following definitions are helpful in understanding solution architectures designed to support one or
more streaming data components.
• Event-stream processing (ESP). Streaming data is continuously ingested into one or more
data systems via a flow or stream. The opposite, non-streaming data is ingested via individual
record processing (insert, update or delete) or batched record processing. Considerations for
streaming include:
o The size and duration of each streaming window or chunk of data.
o The stream’s volume and velocity; is it predictable and regular or variable and prone to
spikes?

5
o The design must account for the sources, sizes, and types of data in the streams.
o In general, ESP solutions contain many input streams that can include different types and
volumes of data. Common types of stream data include sensor data, transactional data,
web clickstream, or log data. Internet of Things (IoT) data is increasing the demand for
streaming architectures as well. All streaming applications also require access to data at
rest for enrichment or to provide context—such as customer data, purchase history, and
support history.
o Along with multiple data input streams, ESP solutions are often implemented to answer
many mission-critical business questions and are placed into the operational data stream
requiring fault tolerance.
o This type solution can include many data-pipeline-processing steps that vary from simple
aggregation to complex machine learning processes. An example is using predictive
analysis of live and stored stream data to process all airline engine sensor data for all
flights for an airline, with the business goals of improved flight safety and reduced engine
maintenance.
High-Volume Real-time Data Analytics
Since Hadoop is the focal point of big data ecosystem, emerging fast big data platforms must be evaluated
based on their interaction with Hadoop. Including Apache Hadoop components in streaming solutions is
common so the following definitions of major components are helpful.
• Apache Hadoop core. The cores services of Hadoop are the Hadoop Distributed File System
(HDFS) and YARN (Yet Another Resource Negotiator). This separation of HDFS and YARN has
enabled the emergence of streaming platforms native to Hadoop. MapReduce (which previously
provided cluster management services in place of YARN) is now a user-side library and completely
separated from YARN. It is not relevant in streaming platform.
• Apache Flume. A distributed service for efficiently collecting, aggregating, and moving
streaming event data.
• Apache Kafka. A distributed, highly scalable publish/subscribe messaging system. It maintains
feeds of messages in topics. Kafka is one of the most popular message buses in the big data
ecosystem. Though not part of core Hadoop, it is very widely used by the open source community.

6
• Apache Zookeeper. This centralized configuration coordinator maintaining configuration
information and naming, and providing distributed synchronization and group services is
commonly used in a Hadoop ecosystem.
• Apache Storm and Spark Streaming. These streaming data processing libraries are defined
and differentiated in the body of this report. Storm and Spark-Streaming predate YARN. Storm
uses its own scheduler rather than YARN’s, and while Spark-Streaming has YARN integration, it
is not designed to work exclusively with HDFS.
• Commercial native Hadoop streaming platforms. Some commercial platforms run
natively in YARN and leverage all the Hadoop semantics, operability, and other features.
The following figure shows a subset Apache Hadoop’s components. Selecting the appropriate Hadoop
components for a solution is a key consideration in architecting the streaming solution.
Architectural view of typical Hadoop components

7
Streaming Analytics Project Design
The common phases of streaming data projects are: architecture, component selection, proof of concept,
and management/DevOps (or moving the solution to production). Although design phases follow familiar,
standard architectural patterns, a closer examination of the tasks performed in each phase is useful in
understanding solution design at a deeper level.
Because the landscape of streaming data solutions is changing rapidly as more open source libraries and
commercial products become available and because many technical teams lack experience creating these
types of solutions, following best practices is essential.
Architectural Considerations
The architecture phase includes sub-phases of design. The figure below shows a simple diagram of the
fast big data pipeline.
A Fast Big Data Pipeline

Phase 1: Scalable Ingestion. Identify all of the streaming data sources for ingestion. The critical
requirements of ingestion are fault tolerance and scalability. These include handling fault tolerance with
no-data loss; no loss of application state; and in-order processing, with no manual intervention or
dependence on components external to the platform. The platform should have connectors to various
sources, as well as the capability to add new future sources.
Phase 2: Real-time ETL. “What is the quality of the incoming data?” For example, will possible
duplicate records appear in a stream? If so, which will need to be de-duplicated? How will that
transformation be accomplished? Will the team write code to de-duplicate the data or a commercial
product with data manipulation capabilities do that?
Another set of considerations involves compliance. Are timestamps required on data for compliance
requirements? Does the streaming platform have connectors to various external sources for data

8
enrichment? If error checking depends on the order of data, or needs context of data, fault tolerance, and
in-order processing with no data loss becomes critical.
Phase 3: Real-time Analytics. “What are the most complex analytics that need to be performed and
can the business SLA be met to have the job finish in an hour, one minute, or one second, as needed?”
Examples of analytics computations involve dimensional cube computations, aggregates, reconciliation,
etc.
Fast big data analytics are computation-intensive and affect latency. Performance, scalability, and fault-
tolerance are of utmost importance in these use cases. Spikes in data have a great impact on analytics, so
the scaling automatically (rather than hand-coding such scalability) is important.
Analytics require that intermediate results be retained so fault tolerance is critical. A fault tolerant
platform must ensure that there is no loss of events, application state, or application data. As with many
other considerations around streaming architectures, a choice exists between custom coding fault
tolerance on a per-application basis or letting the streaming platform provide that capability.
Another consideration is whether to run some or all of the solution on a public cloud where offerings for
streaming data and persisting it differ considerably. Commercial solutions can run only on-premises, only
on a particular cloud, or both in the cloud and on-premises.
Phase 4: Alerts and Actions. Address the business need around notification for when events or
activities occur. Alerts can be provided as part of a visual dashboard, via STMP (email), SMS (text), or any
other message bus. Taking action is usually an automated business process that occurs without human
intervention, based on rules or policy that must be formulated. For example, for “smart building” data
from sensors, at what point should local building maintenance team be alerted and for which types of
event thresholds.
Phase 5: Visualization and Distribution. Design for both storage and integration of processes;
result data must be in formats consumable by end-user groups. For example, will manufacturing line
metrics be incorporated into a dashboard, a phone application, or a wearable device? Will the developers
create these visualizations or will the organization integrate commercial visualization solutions? A
streaming platform that has connectors to various external systems and provides ability to integrate with
a visualization technology, or provides its own, helps in this phase.

9
Component Selection
The next stage in creating a streaming solution is selecting the particular components for the streaming
data architectures. Gigaom suggests that the component selection be evaluated with these considerations:
• The overall architecture of the platform, both in terms of simplicity and reliability
• The enterprise-grade capabilities of the core streaming engine
• Ease of application development and the ability to process data in motion and data at rest
• Management and operation of the platform once it is in production
Following are key questions to consider that will assist with component selection along with a summary
table of the most critical features compared across representative technologies.
Overall Architecture
• Is production operability natively built into the platform? Streaming analytic solutions
run in operational capacity and have high operational requirements, including fault tolerance,
scalability, security, native integration with external sources, web services, CLI, visual dashboard,
etc. A platform that supports these features natively offers time-to-solution advantages over a
platform that does not.
• Does the platform depend on external components? The number of external components
a streaming platform depends on impacts operability. Each added component introduces another
possible point of failure, and may require additional expertise. Stitching together disparate
components makes architecture more complex and possibly more brittle.
• What is the general attitude toward using open source software? When selecting
streaming components, evaluate the level of developer talent that is available. For some teams,
particularly those that already have deep expertise in developing and deploying Hadoop-based
solution into production, adding streaming functionality via open source libraries may be a good
fit. This is because those teams are already familiar with patterns for working with Apache
Hadoop libraries.
However, for other, more traditional enterprise teams, using pure open source technologies can
result in hidden project costs such as resource hours to set up, test, and implement the solution.

10
In some cases, underestimating the knowledge required to set up, configure, integrate, tune, and
test can derail an entire project, as the complexity becomes overwhelming.
• What is the true cost of a creating, deploying, and managing a big data streaming
solution? Are there hidden costs?. When looking at commercial solutions versus open source,
they must compare the licensing costs, support costs, and development resources required for
both. Typically, commercial products have an upfront licensing fee that includes support and
requires less engineering expertise, while open source has no licensing fees, but requires support
fees, services fees, and internal development resources. Organizations running in a commercial
cloud must analyze their monthly bills carefully for potential cost savings.
Enterprise-grade Streaming Engine
The core of any selection process is ensuring that the platform will meet the business needs for scalability,
high-availability, performance, and security. The many dimensions to consider vary based on use case.
Here are the most common areas of consideration.
• What is the forecasted data volume and variety of sources for ingestion? The available
solution components vary widely in their ability to ingest at scale from multiple data sources. The
ingestion-volume requirements alone could drive a decision to a particular commercial product or
set of libraries, or some combination of both, because there are known upper limits to the
solutions. The number and scope of data sources requires enterprise-quality adapters to handle
the variety of data types. The platform must be based on a very common programming language,
such as Java, to enable re-use of current code within the streaming platform. A Java-based
commercial product designed and tested for that kind of scale, with fault tolerance and a large
number of connectors would be a far better fit.
• What are the use case requirements for event processing regarding data processing
guarantees and event order? Fast big data is comprised of a series of events that occur over
time. Architectural decisions on how a streaming platform processes those events will have a
direct impact on performance, latency, and scalability. Many fast big data use cases require that
event order be maintained. For example, some predictive analytics use cases must compare event
order to determine what might happen next. Streaming platform architectural decisions can
impact the ability to guarantee the event order, and whether an event will be processed exactly

11
one time, at most one time or at least one time. Below, we look at three architectural methods of
processing event streams and the implications of each method.
1. Event-at-a-time. Apache Storm uses this method, which processes each event and uses an
acknowledgement signal for each. It can impact performance and the cost of hardware.
•
Source: DataTorrent
2. Micro-batching. Apache Spark Streaming uses this method, which processes tiny groups of
events. This method cannot provide a guarantee of in-order processing.

Source: DataTorrent
3. Streaming event window processing. This method processes windows in the stream and
differs from micro-batching by relying on a more lightweight process (markers in the stream)
rather than batching. This approach is a non-blocking high-performance implementation that
can guarantee event order and provide only-once, at-least-once, and at-most-once event
processing with no data loss.

Source: DataTorrent
What are the fault tolerance requirements of the use case? Fast big data use cases are typically
implemented in an operational environment. A streaming application, unlike a batch application, does
not have an end. It runs continuously 24/7. An organization must know if the platform it has selected to
process incoming data streams meets its requirements for fault tolerance. Does the ESP system manage
the application fault tolerance or does the engineer need to hand-code it? In the event of a failure, does
the ESP guarantee the events will be processed in the order in which they are generated?

12
Will a use case and business logic change over time? As an organization learns more about the
customer or operational aspects of the streaming application, it will typically want to change or
supplement the current business logic of its application. For example, in a financial services fraud
detection application, it may want to add another algorithm for detecting fraud, or change an alert
process. Since transactions occur continuously, the streaming application needs to handle being updated
without impacting the analysis of the data flow. This can be thought of as an extension of fault tolerance.
Ease of Use and Development
Development process. In addition to the complexity of setting up the development environment,
when an organization begins creating a POC, it should also understand the tradeoff involved in the details
of solution implementation. These details include items such as selection of programming language. For
example, will coding be done in a proprietary vendor-created language or in a general-purpose language,
such as Java? What features of the solution (for example event guarantees, parallel partitioning, and fault
tolerant capabilities) will be manually coded?
Commercial vendors, such as DataTorrent, Informatica, and others, include visual data pipeline creation
tools for rapid prototyping. This can be a significant factor in successful pipeline POC creation. For
example, in the case of the need to build a time sensitive POC (due to competitive pressure, regulatory
changes or some other business concern, the ability to use visual tools to do so can significantly reduce
the time to create a prototype.
Data visualization. Another decision-point is how output data, insights, and action will be presented to
the project stakeholders for validation.
• What type of dash-boarding solution will be used? Will alerts be generated to demonstrate the
viability of the project?
• Will the developer team be expected to create visualization for results or will one or more
visualization tools be provided used instead?
• If using a commercial visualization solution, such as Tableau, what is the complexity of connecting
the output data to it and of creating visualizations that make sense to the subject matter experts?
• How is integration with processes, such as indexing or dimensional attribute processing, achieved?
This is another factor that separates commercial products from pure open source libraries.

13
• What steps are involved in optimizing real-time analytics? Can dimensional pre-calculations or some
type of indexing be added or adjusted? And, again, is optimization manual or automated via tools?
Creating the Proof of Concept
The first consideration in the proof of concept phase of a streaming is revalidating the first few business
questions the solution should address and what action the organization would like to take as a result of
the insights it creates. For example, in an IoT smart-building sensor data-streaming project, typical
questions are: Which building sections or rooms seem to be outside of normal for HVAC and is there any
related data (changes in weather, number of people in the area, etc.) that normally correlate to such a
change? Or does this seem to be an anomaly that requires investigation by a technician? If a technician is
required, automatically schedule the appointment.
Next, begin the proof of concept (POC) that will be driven by the data sources and methods of making
sense of these streams. Developers will be eager to build a POC as this point. Understanding developer
knowledge of the systems being proposed is key. Another approach is to ask candidate vendors “What is
the developer story when creating applications on different types streaming data systems?”

14
Management/DevOps
A new set of priorities and decisions arises after the POC has been validated and planning begins for
production deployment. Top among these concerns is considering what enterprise-level services are
necessary to make the solution operate and meet SLA requirements. These often include processes and
tools for monitoring and maintaining security, availability, scalability, and recoverability. Next is deciding
whether to build or buy. Which open source projects and which products best meet these enterprise
service requirements? Is the best solution an all-in-one or an amalgamation of various open source
libraries and vendors tools?
How are production solutions monitored, scaled, and managed? Does the streaming platform
have rich and deep web services and other integration methods to enable ease of integration into
monitoring systems? Commercial solutions include strong support for integration with monitoring
systems. Open source projects may not have strong monitoring integration points. Is the set of streams
spiky? Is dynamic or even automated scaling needed? Commercial products may include such auto-
scaling capabilities, including the ability to add operations into, or remove operations from, a running
pipeline.
What steps are involved in optimizing the ingest pipeline? Is it a manual process? Does code
have to be re-written, tested, and deployed or does the solution include tooling that can automate this?

15
Key Findings
The addition of streaming platform to a big data stack adds significant complexity to big data solutions.
Given this, a solid understanding of technology choices around streaming data solutions is essential for
designing and delivering solutions that provide business value to the organization.
• Consider use of commercial streaming solutions for complex ESP projects. Match the
solution complexity to component maturity. Consider the volume, velocity, and variety of
both data and data streams. Consider the true costs of building on top of pure open source Hadoop
projects versus buying vendor tools and solutions that include Hadoop plus enterprise services.
• Teams that have already implemented pure open source Hadoop solutions are most
capable of adding pure open source streaming solutions. Match a team’s skill level
to solution complexity and component maturity. Know whether the team has production
experience working with Hadoop solutions, and has deployed Hadoop solutions with low time to
market. What programming languages do they use, what are their expectation about integrating
this new project into their current development environment, and their requirements around
monitoring and scaling. Do they expect to work with tools or are they comfortable working with
libraries from the command line?
• Select tools or plan for coding libraries to perform the types of analytics required.
Match the types of analysis expected on the event stream data. Organizations must
understand whether they will be performing simple aggregations or anticipate using machine
learning algorithms or some other type of predictive component. Know whether a team should
write this logic or whether the organization will purchase a solution that integrates or contains
these components. Figure out how much of the analytics must be returned in real-time.
• Test a solutions at production levels of load during the proof-of-concept phase.
Determine whether to host the solution on premises, in the cloud, or as a hybrid
project. Examine different vendor’s cloud solutions. Is the solution tested at scale on any/all
clouds? Know the potential service costs.
• Select tools or plan for coding appropriate types of visualization solution. If the
organization plans to build its own visualization solution, the technical team must have the talent
to create what is envisioned. If not, what is the cost of hiring, re-training, or getting additional
help to implement this part of the solution.

16
Summary Comparisons
Capability Storm
Spark
Streaming
DataTorrent
RTS
IBM
InfoStreams
TIBCO
StreamBase
Core Streaming Engine Capabilities
In-memory
compute engine Yes Yes Yes Yes Yes
Native Hadoop
Architecture No Yes Yes No No
Sub-second
event processing Yes No Yes Yes Yes
Extremely Linear
Scalability (billion(s) of
events /second) No No Yes No No
Stream partitioning for
parallel processing
No No Yes No No
Auto-scaling input
and event processing No No Yes Yes Yes
Event Processing
Guarantees
Only Once
At least once
At most once
Only Once
At least once
At most once
Only Once
At Least Once
At Most Once At Most Once
Only Once
At Most Once
Event order
Guaranteed No No Yes No No
End-to-End Stateful
Fault Tolerance No No Yes No No
Incremental recovery No No Yes No No
Dynamic application
updates No No Yes No No
Data Loss Potential Yes Yes No Yes Yes
Complete separation of
business logic, event
acknowledgement and
fault tolerance
No No Yes No No

17
Streaming Application Development Tools
Native Application
Programming
Language Java
Scala
Java API Java
Proprietary
Streams Query
Language
Proprietary
StreamSQL
EventFlow
Open Source
Pre-built Connectors <10 <10 > 75 No No
Graphical application
builder
No No Yes (beta) Yes Yes
Visual Real-time
Dashboard
No No Yes (beta) No Yes
Operations and Management Tools
Fully functional GUI-
based Management
No No Yes Yes Yes
Ease of integration with
monitoring systems
No No Yes Yes Yes
Ease of integration with
external systems via
REST APIs
No No Yes Yes Yes
Simple install and
upgrade
No No Yes No No

18
About Lynn Langit
Lynn Langit is a Big Data and Cloud Architect who has been working with database solutions for more
than 15 years. Over the past 4 years, she’s been working as an independent architect using these
technologies, mostly in the biotech, education, manufacturing and facilities verticals. Lynn has done
POCs and has helped teams build solutions on the AWS, Azure, Google and Rackspace Clouds. She has
done work with SQL Server, MySQL, AWS Redshift, AWS MapReduce, Cloudera Hadoop, MongoDB,
Neo4j and many other database systems. In addition to building solutions, Lynn also partners with all
major vendor cloud vendors, providing early technical feedback into their Big Data and Cloud offerings.
She is a Google Developer Expert (Cloud), Microsoft MVP (SQL Server) and a MongoDB Master. Lynn is
also a Cloudera certified instructor (for MapReduce Programming).
Prior to re-entering the consulting world 4 years ago, Lynn’s background is over 10 years as a Microsoft
Certified instructor, a Microsoft vendor and then 4 years as Microsoft employee. She’s published 3 books
on SQL Server Business Intelligence and has most recently worked with the SQL Azure team at Microsoft.
She continues to write and screencast and hosts a BigData channel on YouTube
(http://www.youtube.com/SoCalDevGal) with over 150 different technical videos on Cloud and BigData
topics. Lynn is also a committer on several open source projects (http://github.com/lynnlangit).

About Streaming Data Solutions for Hadoop

More Related Content

About Streaming Data Solutions for Hadoop