Big Data and the Tools Used
As the data produced
due to the use of social media sites have become larger, the time to process the
data, data cost and storage space becomes insufficient.
Overcoming these
limitations, big data came into the picture.
Therefore, the goal of big
data systems is to surface connections from a large volume of heterogeneous data.
What is big
data?
The enormous quantity
or volume of data is called "Big data". It is the complex and
unprocessed data (in exabytes or zeta bytes). Using traditional methods, they
are difficult to process.
Characteristics of Big
data:
The definitions of 3
major V’s of big data are:
1. Volume:
The datasets can be
orders of magnitude larger than traditional datasets.
2. Velocity:
Another way in which big
data differs from other datasets is the speed that information moves from
through the system.
3. Variety:
The formats and types
of data can vary significantly and big data handles it regardless of its
sources.
The big data tools are Hadoop,
Hive, Sqoop, HBase, Pig, etc. They are used to analyse and process the data
based on their need.
APPLICATIONS OF BIG
DATA:
Some of the real-time
examples are:
• Personalized health plans for cancer
patients
• Real-time data monitoring and cyber
security protocols
• Personalized marketing.
• Fuel optimization tools for the
transportation industry.
• Monitoring health conditions through data
from wearable.
• Live road mapping for autonomous vehicles.
BIG DATA LIFE CYCLE:
The general categories
involved with big data processing are:
1.
Ingesting
data into system:
Data
ingestion is the process of taking raw data and providing it to the system. The complexity of this operation depends heavily on the format and quality of the
data sources and how far the data is from the desired state prior to
processing.
Tool
used: Apache Sqoop
2. Persisting the data in storage:
The
ingestion processes typically provide the data to the components that manage
storage system. While this seems like a simple operation, the volume of
incoming data, the requirements for availability, and the distributed computing
layer make more complex storage systems necessary.
Tool
used: Apache Hadoop’s HDFS
3. Analysing and computing data:
The
computation layer is perhaps the most diverse part of the system as the
requirements and best approach can vary significantly depending on what type of
insights desired.
Tool used:
Apache Hadoop’s Mapreduce
4. Visualizing the results:
Due to the
type of information being processed in big data systems, recognizing trends or
changes in data over time is often more important than the values themselves.
Visualizing data is one of the most useful ways to spot trends and make sense
of a large number of data points.
Tool used: Apache Pig
Therefore,
in conclusion, big data systems are uniquely suited for surfacing
difficult-to-detect patterns and providing insight into behaviours that are
impossible to find through conventional means. By correctly implementing systems
that deal big data organisations can gain incredible value from the data that
is already available.

Comments
Post a Comment