What is Data?

Introduction

Before we try to understanding the data analytics, it is important to understand what is Data? It is important to understand Data; it is important to understand why the world is chasing Data? - Structured or unstructured / big or small. Data is essentially a chunk of the information generated from a given set of sources source; it is a track record of events, their inputs and outputs. Such data is as valuable as the benefit one can obtain by processing and analyzing it. For example, every internet purchase generates information about the buyer, seller and the purchase. Such information, if processed and analyzed properly, can provide valuable inputs about the buyer’s choices, the seller’s capabilities and the popularity of the object sold. Ever since businesses understood this value, they have been investing and developing more and more efficient ways of capturing, storing, processing and analyzing such data.

Structured Data

Information is unstructured by default. Any event contains huge amount of information, of various kinds. It is impossible to capture all of it; let alone storing, processing and analyzing it. The traditional data analysis followed the basic approach of picking just what is necessary, and storing it in a way that can be easy to retrieve, process and analyze.
This gave rise to the traditional relational databases. The relational databases had predefined tables, with predefined columns and predefined relations between records in different tables. Information picked from any event being recorded, was in this predefined format and stored in the database. Such data was easy to store, access and process – to obtain results in predefined formats. But, the only disadvantage was the loss in the amount of information that was picked up.
For example, a telephone call can generate a huge amount of information. But, a billing system would just record the phone numbers; the billing plan used and the duration of the call. It would just pick what was necessary, and generate the bill – with just the required result – the amount due. It was easy to pick, store and process such data. But the limitation was that it had absolutely very little value beyond from the bill amount. Some analysis of the data formed an input to the network team – for predicting the kind of network traffic.
Structured data may be generated by humans or machines; as long as the data is created within the templates, and follows the structure defined by the tables. This format is eminently searchable both with human generated queries and via algorithms using type of data and field names, such as alphabetical or numeric, currency or date.
Common relational database applications with structured data include airline reservation systems, inventory control, sales transactions, and ATM activity. Structured Query Language (SQL) enables queries on this type of structured data within relational databases. Even today, a huge amount of software applications continue to use structured data.

Unstructured Data

As people realized what we are missing out on, research on processing and storing unstructured data gained momentum. Even today, we are at a really nascent stage in this. But, industries have started implementing the solutions and have started making revenue out of it.
What has changed today is the awareness of the importance of data, the storage capacity, and algorithms for processing such unstructured data. The essential difference between structured and unstructured data is the phase in which the data is captured. Raw data is always unstructured. But, due to limitations in the processing, storage and sensing capabilities, we were forced to extract just what we needed and forget the rest.
In essence, structured data was extracted out of unstructured data, after an amount of processing. And this processed structured data was stored for analysis. Today, the order has changed. Now raw data is pumped into the storage - with very little initial processing – with the hope (or knowledge) that it is useful. This is because, now we have ways of storing such volumes of data. And we have enough processing power and mature algorithms to make sense out of such data at query.
Each such data unit has its own internal structure. But it is not fixed via pre-defined data models or schema. It may be textual or non-textual, and generated by humans or machines. Such data can be stored in non-relational databases like NoSQL.
The most inclusive Big Data analysis makes use of both structured and unstructured data.

Big Data

This refers to digital information that has the 3 V's – high volume, velocity and variety. Big data analytics refers to the process of identifying trends, patterns, correlations or other useful insights in such data – using various software tools. Data analytics is not a new concept. Businesses have been analyzing their own data for decades. But, software tools used for analysis have greatly improved in their capability and performance, that it can handle much larger varieties of large volumes of data, at a much higher velocity. This is partly because of improved algorithms and partly because of improved performance of the underlying hardware.