Statistics - Some Important Concepts

Introduction

Statistics and Linear Algebra are the basic pillars for learning data science, data analytics and machine learning. This blog provides a refresher for enumerates some of the basic concepts that one should know in order to work on these technologies.

Bayes' theorem

This is a fundamental concept in Statistics. It may sound trivial once you understand it. But it is an important concept that has been an important lead in solving several complex problems in various domains including Machine Learning. Hence it is very important to grasp this concept. Bayes' theorem is used to calculate the conditional probability. Conditional probability is the probability of an event 'B' occurring given the related event 'A' has already occurred. Consider for example, a clinic wants to detect cancer among the patients. In formal statistical terms, let A represent an event "Person has cancer" and let B represent an event "Person is a smoker". The clinic wishes to calculate the proportion of smokers from the ones diagnosed with cancer. To do so use the Bayes' Theorem (also known as Bayes' rule) which is as follows:
P(A|B) = P(B|A) * P(A) / P(B)
That is: Probability of A given B is the probability of B given A multiplied by the ratio of Probability of A and Probability of B.
In our case, we can use this to find the probability that a smoker gets Cancer. Based on the data available in the clinic, one can identify the probability that a patient has cancer (A) and the probability that the patient has cancer(B). From the data, we also know P(B|A) - that is the probability that a cancer patient is a smoker. Based on this data, we can use the Bayes's theorem to get the probability that a smoker develops cancer - that is P(A|B)
This is the essence of Bayes' Theorem. As mentioned, it sounds too trivial once it is understood. But this principle is the basis for many other derivations in statistics - Machine Learning and Data Analytics. Hence it is very important to understand its essence.
If you are interested in a more detailed analysis of the subject, you can check out this Link

Binomial Distribution

This is a kind of distribution of probabilities for experiments having defined number of trials. It is relevant only in case of discrete random variables (random variables that can have a two discrete values - not a continuous range). Formally,
  • The experiment should have finite number of trials
  • There should be two outcomes in a trial: success and failure
  • Trials are independent - The probability of success (p) remains constant
All the above should be satisfied for a distribution to qualify as binomial. For example, the probability of getting heads on the toss of a coin in 20 attempts. This meets all the above criteria. The number of attempts is defined. There are only two outcomes possible - success or failure. And each toss is independent. Succeeding or failing once does not affect the next attempt in any way. The formula to calculate probability using Binomial Distribution is:
P(X=r) = nCr * (pˆr) * (1-p) * (n-r)
n : No. of trials
r : No. of success
p : the probability of success
1 – p : Probability of failure
nCr : binomial coefficient given by n!/k!(n-k)!
If you are interested in more, check out this link

References

The following video series provides a good in depth training on various aspects of Statistics
If you like reference books, check out this one