Introduction to Data Mining

What is Data Mining?

Data mining is science of discovering of "models" for data. It is the science of identifying trends in the data and extracting a "model" that best describes it. Modelling is the science of identifying and extracting all the information in form of a model that can be used in further analysis. It is the art of breaking the available data into pieces of information that can be used.
Naturally, the kind of models we generate depends upon the available data, the approach to modelling. For example, if the data is merely a set of numbers, a statistician might decide that the data comes from a Gaussian distribution and use a formula to compute the most likely parameters of this Gaussian. The mean and standard deviation of this Gaussian distribution completely characterise the distribution and would become the model of the data. Of course, any data worth the analysis would certainly be a lot more complex than a chunk of Gaussian distribution.
Many people consider Data Mining as synonymous to Machine Learning. Although they share the same techniques, there are fundamental differences between them - owing to their approach and end. They share a lot of algorithms. But, the intention of Machine Learning is to let the machine analyse and understand the available data and infer from it. The goal of Data Mining is the get valuable information out of the available data. Machine learning and data mining often feed each other. Data Analytics provides "good" input data for Machine Learning. Machine Learning helps implementation of Data Mining.
Two classic data mining problem quoted everywhere are:

Google's Page Ranking

We need to rank all the pages available all over the internet, based on the other pages accessing them. This iterative process runs on a huge amount of data and hence requires elaborate algorithms to make sure things get through properly.

Feature Extraction

Given a raw chunk of data, it is important to identify the features that can then be fed to a learning algorithm. Again, this task typically takes in several PetaBytes of input data. Hence requires a lot of effort on carefully designing an optimal algorithm.
Start your future with a Data Analysis Certificate.

Big Data

"Big Data" refers to digital information that has the 3 V's – high volume, velocity and variety. Big data analytics refers to the process of identifying trends, patterns, correlations or other useful insights in such data – using various software tools. Data analytics is not a new concept. Businesses have been analyzing their own data for decades. But, software tools used for analysis have greatly improved in their capability and performance, that it can handle much larger varieties of large volumes of data, at a much higher velocity. This is partly because of improved algorithms and partly because of improved performance of the underlying hardware.

Data Analytics and Data Mining

Because of the big revolution around Big Data, we have many opportunities opening up to improve on the algorithms for using the data that can make sense of the available data. Two main domains that show up are Data Analytics and Data Mining. Data Analytics is generally with a purpose in mind - to extract specific information out of the huge amount of raw data. Data Mining on the other hand, is a more generic process - of trying to understand the data.


This is one of the greatest algorithmic breakthroughs that made Big Data processing feasible. It is obvious that any single machine can never process such huge volumes of data at the processing speed that we need. Hence, it is necessary to split the available task into multiple jobs that can be processed on individual machines. When we run jobs on several machines, we have to take extreme care of a major aspect - Redundancy.
Any machine can knock off at any time. Considering the fact that we have thousands of machines running in parallel, it is certain that at least a few of them will crash every day. This should not break the application. Hence the processing as well as the data should have redundant copies. You need a master to take care of merging the set of redundant jobs. And the master itself should be backed up by redundant copies.
The application that runs on this distributed system should be designed with these aspects in mind. Map Reduce helps you with that. Ofcourse not every application can be facilitated by Map Reduce. But it is specially designed to help big data processing. In absolutely simple words, Map Reduce is based on splitting the processing into aspects that Map the input into key-value pairs - followed by reduce tasks that sort these key-value pairs and then group them appropriately to generate the required output.