Overall Overview:
The age of Big Data is now coming, But the
traditional data analytics may not be
able to handle such large quantities of data in many of Terra bytes (TB) to
Zettabytes (ZB).
The question that arises now days is, how to develop a high-performance platform to efficiently analyze big data and How to design an
appropriate mining algorithm to find the useful thing from big data? & also
How to store and process on the large and complex data efficiently?
Introduction:
As the
Information Technology spreads fast, most of the data were born on the internet
continuously. So the size of the data is continuously increasing day by day and
the problem related to storing and extracting data is also be increased. Big
data means it’s the kind of smaller data but its BIGGER..!! or we can say in
simple language that is, Big Data is a collection of data sets so large and
a complex that is becomes difficult to process using on-hand database management
tools.
Big Data is a
phrase of a massive volume of both Structured,
Semi-structured and Unstructured data
that is so large it is difficult to process using tradition database.
Classifications are as follows:
i. Structured Data:
The data that
does have a proper format associated with it can be referred to as structured data.
ie- Data
present in the database, CSV files, XLS files.
ii. Semi-Structured Data:
The data that
does not have a proper format associated to it can be referred to as semi-structured
data.
ie- Emails,
LOG files, DOC files.
iii. Un-Structured Data:
The data that
haven’t any proper format associated to it can be referred to as Unstructured
data.
ie- Image
files(JPG), Audio files(mp3), Video files(mp4).
Invention & Working:
As we knew about the famous Tech
Companies like GOOGLE, FACEBOOK, IBM, etc. These Tech Companies have their
problems associated with storing and extracting the huge data. ie- As these
companies are at the top levels. So its daily data is being generated and
before a few years ago, these companies have a shortage of storing &
processing on the data which is generated continuously.
Nowadays, FACEBOOK collects
nearly 500 TB of data and the GOOGLE collects more than 10 Exabytes of data (10
billion GB) every day, So this is a big amount of data.
Storage OF DATA by Users per minute(Ratio Of 2016):-
- Google receives over 4000000 search queries.
- Facebook users share 2460000 pieces of content.
- Tinder users swipe 416667 times
- WhatsApp users share 347222 photos
- Twitter users tweet 277000 times
- Instagram users post 216000 new photos
- Amazon makes 83000$ in online sales
- Apple users download 48000 apps
- Skype users connect for 23300 hours
- Email users send 204000000 messages
- YouTube users upload 72 hrs of new videos
What Happens in 1 Sec Of the Internet (Ratio Of 2017-18):-
How they could store and process on it?
So the invention of Big Data is been created. These techs
Companies need big data to store and process on it. These Tech Companies
needs the speed and the lot of size of data to be stored and proceed this is
shown below as the characteristics of Big data.
Big Data can be described by the
following characteristics:
i.Volume:
The volume refers to the amount of data that
is getting generated. The size of the data determines the value and potential
insight – and it can actually be considered big data.
ii.Velocity:
In this
context the speed at which the data is generated and processed to meet the
demands and challenges that lie in the path of growth and development.
iii. Variety:
Variety
refers to the different types of data that is getting generated.
ie- Text
files(.txt), Audio files(.mp3), Raw files(.raw), Image files(.jpg), Video
files(.mp4).
Traditional Approach of storing & processing Big Data:
In a
the traditional approach usually the data is been generated out of the
Organizations, Banks, Stock markets & Hospitals is given as an input to the
ETL system. This ETL
system is
then extract the data and it converts the data into the proper format and finally
load this data on to a database.
And also some end users generate reports and perform analytics by quiring the data, but as this grows it becomes a very challenging task to manage & process the data by using the traditional approach .hence, the market needs the Big Data for storing & processing data.
To beat with the challenges of Big data of storing & Extracting process, the Invention of Hadoop is being generated.
And also some end users generate reports and perform analytics by quiring the data, but as this grows it becomes a very challenging task to manage & process the data by using the traditional approach .hence, the market needs the Big Data for storing & processing data.
To beat with the challenges of Big data of storing & Extracting process, the Invention of Hadoop is being generated.
Hadoop:
Hadoop is an open source software framework for distributed
storage and distributed processing of very large data sets in a computing
environment. It is a part of the Apache project sponsored by the Apache
Software Foundation. Hadoop was created by computer scientists ‘Doug Cutting’
and ‘Mike Cafarell’ in 2006. Hadoop gives the symbol of a baby elephant because
the Cuttings sons love the Elephant Toy of itself.
The Hadoop framework itself is mostly written in the Java programming language, with some
native code in C and command lines utilities were written as Shell Script.
Hadoop consists of a storage part known as HADOOP DISTRIBUTED FILE SYSTEM (HDFS),
and a processing part called MapReduce.
1)HDFS (HADOOP DISTRIBUTED FILE SYSTEM):
The HDFS takes care of storing and managing the data with in the Hadoop term. HDFS stores large files across multiple machines (Typically in range of GB to TB).
2) MapReduce:
Mapreduce takes care of processing and computing the data that is present in HDFS. And also the Mapreduce programming model for large scale of data processing.
Hadoop Cluster:
IMP NOTES:
A. Node:- It is a technical term used to describe a machine or a computer that is present in cluster.
B. Demons:-It is a technical term used to describe background process running on linux.
Hadoop cluster is
distributed into two parts as MASTER
NODE & SLAVE NODE. Here the
the master node consists of name node and job tracker demons, and the slave node or
worker node acts as both a data node and task tracker demons.
The data worker node & compute worker nodes
are normally used only in non-standard applications. Here job tracker demons
can manage job scheduling across nodes.
Conclusion:
In the
information era we are currently living in, voluminous varieties of high velocity
data are being produced daily, and within them lay intrinsic details and
patterns of hidden knowledge which should be extracted and utilized.
Big data is very difficult to deal with all the problems faced with traditional data
management, big data exponentially increase these difficulties due to
additional volumes, velocities, and varieties of data and sources which have to
be dealt with. Therefore, future research can focus on providing a Roadmap or
framework for big data management which can encompass the previously stated
difficulties.