BIG DATA: The Huge Concept

Overall Overview:

The age of Big Data is now coming, But the traditional data analytics may not be able to handle such large quantities of data in many of Terra bytes (TB) to Zettabytes (ZB).

The question that arises now days is, how to develop a high-performance platform to efficiently analyze big data and How to design an appropriate mining algorithm to find the useful thing from big data? & also How to store and process on the large and complex data efficiently?

Introduction:

As the Information Technology spreads fast, most of the data were born on the internet continuously. So the size of the data is continuously increasing day by day and the problem related to storing and extracting data is also be increased. Big data means it’s the kind of smaller data but its BIGGER..!! or we can say in simple language that is, Big Data is a collection of data sets so large and a complex that is becomes difficult to process using on-hand database management tools.

Big Data is a phrase of a massive volume of both Structured, Semi-structured and Unstructured data that is so large it is difficult to process using tradition database. Classifications are as follows:

i. Structured Data:

The data that does have a proper format associated with it can be referred to as structured data.

ie- Data present in the database, CSV files, XLS files.

ii. Semi-Structured Data:

The data that does not have a proper format associated to it can be referred to as semi-structured data.

ie- Emails, LOG files, DOC files.

iii. Un-Structured Data:

The data that haven’t any proper format associated to it can be referred to as Unstructured data.

ie- Image files(JPG), Audio files(mp3), Video files(mp4).

Invention & Working:

As we knew about the famous Tech Companies like GOOGLE, FACEBOOK, IBM, etc. These Tech Companies have their problems associated with storing and extracting the huge data. ie- As these companies are at the top levels. So its daily data is being generated and before a few years ago, these companies have a shortage of storing & processing on the data which is generated continuously.

Nowadays, FACEBOOK collects nearly 500 TB of data and the GOOGLE collects more than 10 Exabytes of data (10 billion GB) every day, So this is a big amount of data.

Storage OF DATA by Users per minute(Ratio Of 2016):-

Google receives over 4000000 search queries.
Facebook users share 2460000 pieces of content.
Tinder users swipe 416667 times
WhatsApp users share 347222 photos
Twitter users tweet 277000 times
Instagram users post 216000 new photos
Amazon makes 83000$ in online sales
Apple users download 48000 apps
Skype users connect for 23300 hours
Email users send 204000000 messages
YouTube users upload 72 hrs of new videos

What Happens in 1 Sec Of the Internet (Ratio Of 2017-18):-

How they could store and process on it?

So the invention of Big Data is been created. These techs Companies need big data to store and process on it. These Tech Companies needs the speed and the lot of size of data to be stored and proceed this is shown below as the characteristics of Big data.

Big Data can be described by the following characteristics:

i.Volume:

The volume refers to the amount of data that is getting generated. The size of the data determines the value and potential insight – and it can actually be considered big data.

ii.Velocity:

In this context the speed at which the data is generated and processed to meet the demands and challenges that lie in the path of growth and development.

iii. Variety:

Variety refers to the different types of data that is getting generated.

ie- Text files(.txt), Audio files(.mp3), Raw files(.raw), Image files(.jpg), Video files(.mp4).

Traditional Approach of storing & processing Big Data:

In a the traditional approach usually the data is been generated out of the Organizations, Banks, Stock markets & Hospitals is given as an input to the ETL system. This ETL system is then extract the data and it converts the data into the proper format and finally load this data on to a database.
And also some end users generate reports and perform analytics by quiring the data, but as this grows it becomes a very challenging task to manage & process the data by using the traditional approach .hence, the market needs the Big Data for storing & processing data.
To beat with the challenges of Big data of storing & Extracting process, the Invention of Hadoop is being generated.

Hadoop:

Hadoop is an open source software framework for distributed storage and distributed processing of very large data sets in a computing environment. It is a part of the Apache project sponsored by the Apache Software Foundation. Hadoop was created by computer scientists ‘Doug Cutting’ and ‘Mike Cafarell’ in 2006. Hadoop gives the symbol of a baby elephant because the Cuttings sons love the Elephant Toy of itself.

The Hadoop framework itself is mostly written in the Java programming language, with some native code in C and command lines utilities were written as Shell Script.

Hadoop consists of a storage part known as HADOOP DISTRIBUTED FILE SYSTEM (HDFS), and a processing part called MapReduce.

1)HDFS (HADOOP DISTRIBUTED FILE SYSTEM):

The HDFS takes care of storing and managing the data with in the Hadoop term. HDFS stores large files across multiple machines (Typically in range of GB to TB).

2) MapReduce:

Mapreduce takes care of processing and computing the data that is present in HDFS. And also the Mapreduce programming model for large scale of data processing.

Hadoop Cluster:

IMP NOTES:

A. Node:- It is a technical term used to describe a machine or a computer that is present in cluster.

B. Demons:-It is a technical term used to describe background process running on linux.

Hadoop cluster is distributed into two parts as MASTER NODE & SLAVE NODE. Here the the master node consists of name node and job tracker demons, and the slave node or worker node acts as both a data node and task tracker demons.

The data worker node & compute worker nodes are normally used only in non-standard applications. Here job tracker demons can manage job scheduling across nodes.

Conclusion:

In the information era we are currently living in, voluminous varieties of high velocity data are being produced daily, and within them lay intrinsic details and patterns of hidden knowledge which should be extracted and utilized.

Big data is very difficult to deal with all the problems faced with traditional data management, big data exponentially increase these difficulties due to additional volumes, velocities, and varieties of data and sources which have to be dealt with. Therefore, future research can focus on providing a Roadmap or framework for big data management which can encompass the previously stated difficulties.

BlogsInsider

Search