Elementary Probability Theory

Probability itself is a big topic and here it is not possible to discuss each and everything. This tutorial touches all the relevant fundamentals that will give you a conceptual framework which is required for data analysis and inferential statistics.


In a random process, we know what outcomes could happen but we don’t know which particular outcome will happen.


Tossing a coin, rolling a dice, shuffle mode on your music player, Stock market etc.

 If you toss a coin you know only two outcomes may come but we don’t know which will come exactly. On the other way for shuffling mode on your music player you know what are the songs you have stored in your music player. So, you know what are the possible outcome and your next song will be something from your entire music library but don’t know which song will play next. Sometimes it might be helpful to modeled a process as random though it is not truly random. Example is stock market.

To describe the probability of event, the notation will P(A) = Probability of event A. There are several possible interpretations of probability but they (almost) completely agree on the mathematical rules probability must follow 0 <= P(A) <=1. That means probability of and event always between 0 and 1.

The traditional interpretation of probability is a relative frequency. This is call frequentist interpretation.

Frequentist Interpretation of Probability:

The probability of an outcome is the proportion of times the outcome would occur if we observed the random process an infinite number of times. An alternative interpretation is Bayesian interpretation.

Bayesian interpretation of Probability:

A Bayesian interprets probability as a subjective degree of belief. For same event two separate people may have different viewpoints and so assigned different probabilities to it. This interpretation allows for prior information to be integrated into inferential framework.  Largely popularized by revolutionary advance in computational technology and methods during the last twenty years.

Law of large numbers:

Law of large numbers states that as more observation are collected, the proportion of occurrences with a particular outcome converges to the probability of that outcome. For example, if you roll a dice 6 times there is no guarantee that you will get at least one five in there. But if you roll the dice for 600 times or 6000 times. Then you are expect to see at least 1/6 times to get a five.

Disjoint Events:

Disjoint events cannot happen at the same time. A synonym for this is mutually exclusive.

  • The outcome of a single coin toss cannot be a head and tail at the same time.
  • A student can’t both fail and pass a class.
  • A single card drawn from a deck cannot be an ace and a queen.
  • The event don’t join hence the term disjoint.
  • For disjoint event P(A and B) = 0

Non-Disjoint Event:

Non-disjoint event can happen at the same time. For example a student can get an A in statistics and Econ in the same semester. P(A and B) is not equals 0.

Union of disjoint events:

What is the probability of drawing a Jack or a three from a well shiffled full deck of cards.

P( J or 3) = P(J) + P(3) = 4/52 + 4/52 = .154

For Disjoint events A and B, P(A or B) = P(A) + P(B)

Union of non-disjoint events:

What is the probability of drawing a Jack or a red card from a well shuffled full deck of cards. How is this different form the previous question. Here is the situation below.

Here we have 4 Jacks and 26 red cards in the deck and note that there is a overlap. Two red Jacks are there which fills both the criteria. So, we need to consider this overlap as we want to double count it once calculating the probability.  

P( J or red) = P(A) + P(red) -P(J and red) = 4/52 + 26/52 -2/52 = 0.538

For non-disjoint events A and B, P(A or B) = P(A) + P(B) -P(A and B)

The general Addition rule:

P(A or B) = P(A) + P(B) -P(A and B)

Note that when A and B are disjoint, P(A and B) = 0, so the formula simplifies to P(A or B)= P(A) + P(B)

Sample Space:

A sample space is a collection of all possible outcomes of a trail. For example, a couple has two kids, what is the sample space for the sex of these kids. Assume that a sex can only be male or female.

{MM, FF, FM, MF } —– Sample space for sex of two kids for a couple.

A second example may be, you are tossing a coin two times what will be the sample space? It will be

{HH, TT, HT, TH } . So, as outcome may happen equally likely. we have 25% chance of each outcome may happen. A probability distribution lists all possible outcome in the sample space and the probabilities with which they occur.

Note that this is the probability distribution for discrete events. Next section you will get idea for probability distribution of continuous variable. Probability distribution follow three broad rules.

 Probability distribution rules:

  1. The events listed must be disjoint.
  2. Each probability must be between 0 and 1
  3. The probabilities must total 1.

Complementary events:

Complementary events are two mutually exclusive events whose probabilities add up to 1.

Note that complementary and disjoint events are not same. Because sum of probabilities of two disjoint outcomes not necessarily add up to one. But sum of probabilities of two complementary outcomes always add up to 1.

Independent Event:

An event is independent if its outcome does not depend on the previous outcomes. Two process are independent if knowing the outcome of one provides no useful information about the outcome of the other. Let’s say you toss a coin 10 times, and it lands on head each time. What do you think the chance is that another head will come up on the next toss? The probability is still 50%.

P( H on the 11th toss) = (PH on the 10th toss) =0.5

On the other way, you can think an independent event is memory less. It doesn’t remember what happened in past.

On the other way, if you draw a card from deck and in the 1st draw you got a J.  Now in the 2nd draw probability of J i.e. P(J) = 3/51. As you already drawn a card so now we have 52-1=51 cards and now number J will be 4-1 =3. Before 1st draw probability of J was 4/52 but in the end draw it comes to 3/51. So, this is an example of dependent event.

Checking for independence:

If probability of A given B is Probability of A, then A and B are independent events. Which basically  tell us that knowing B is nothing about A.

P ( A|B) =P(A), then A and B are independent.

Multiplication rule for independent events:

The product rules for independent event says If A and B are independent then probability of A and B happening is simply product of their probability.

If A and B are independent, P( A and B) = P(A) * P(B)

If you toss a coin twice what is the probability of getting two tails in a row?

P ( Two tails in a row) = P( T on the 1st Toss) * P( T on the 2nd Toss) = P(0.5) * P(0.5) =0.25

If A1, A2, A3,………AK are independent, P(A1 and A2 and A3…..Ak)= P(A1)*P(A2)*P(A3)*…..*P(AK)

Marginal Probability:

The study title ADOLESCENTS’ UNDERSTANDING OF SOCIAL CLASS is the study examining teen’ belief about their social class. Sample consists 48 working class, 50 upper middle class 16 year old.

The study was designed by following way:

  • “Objective” assignment to social class based on self-reported measures of both parents’ occupation and education and household income.
  • “subjective” association based on survey questions

Here is the summarization of the study as a contingency table.

What is the probability that a student’ objective social class position is upper middle? If you see the objective upper middle class column in the table it shows total 50 students who belong in this category. So probability will be P( objective upper middle class) 50/98 -0.51 . The term marginal probability comes from the fact that the count we use the probability comes from the margin of the contingency table. Here 50 and 98 both come from the total column which is the margin of that contingency table.

Joint Probability:

What is the probability that a student’s objective position and subjective identity are both upper middle class?

P( Objective upper middle class and Subjective upper middle class) =37/98 =.0.38. See the above picture marked by circle. The important term in joint probability is AND. Here students are being  considered who are on the intersection of the two event of interest.

Conditional Probability:

What is probability that a student who is objectively in the working class associated with upper middle class?

P(subjective upper middle class | objective working class) = 8/48 =0.17

Here main important thing to be noted the vertical line which is called given that separates what we are looking for and what we know to be true bout the students. This is called conditional because 1st we conditioned only on the working class and then probability is calculated based on the count only in this column.

Bayes’ Theorem :

Formally, we calculate conditional probability based on Bayes’ theorem.

P(A|B) = P(A and B) / P(B)

Here joint probability on numerator and what we conditioned on the denominator. Consider the same question mentioned in the conditional probability section and calculate the probability using Bayes’ Theorem.

P (Subjective upper middle class | Objective working class) = P ( Subjective Upper middle class & Objective working class) / P(objective  working class) = (8/98) / (48/98) = 8/48 = 0.17 ( we get the same answer what we got previously. We already arrived at the same answer by simply reasoning through the contingency table. But if we don’t have the counts neatly organized in the table then using Bayes’ theorem calculating the conditional probability will be much more intuitive.

In a card game, suppose a player needs to draw two cards of the same suit in order to win. Of the 52 cards, there are 13 cards in each suit. Suppose first the player draws a heart. Now the player wishes to draw a second heart. Since one heart has already been chosen, there are now 12 hearts remaining in a deck of 51 cards. So the conditional probability P(Draw second heart|First card a heart) = 12/51.

Suppose an individual applying to a college determines that he has an 80% chance of being accepted, and he knows that dormitory housing will only be provided for 60% of all of the accepted students. The chance of the student being accepted and receiving dormitory housing is defined by

P(Accepted and Dormitory Housing) = P(Dormitory Housing|Accepted)P(Accepted) = (0.60)*(0.80) = 0.48.

General Product rule: 

 Previously, It is shown that product rule for independent event will be P(A and B) = P(A) * P( B) if A and B are independent. If they are not intendent then joint probability needs to be calculated slightly differently. Since Bayes’ theorem does not have independence condition we can simply rearrange the Bayes’ theorem to get Joint Probability P( A and B) as a product of conditional probability P (A|B) multiplying by Probability P(b)

General product rule:  P(A and B) = P(A|B) * P(B)

Here we are shuffling Bayes’ theorem to get a new rule for  joint probability. Consider the below example.

Suppose an individual applying to a college determines that he has an 80% chance of being accepted, and he knows that dormitory housing will only be provided for 60% of all of the accepted students. The chance of the student being accepted and receiving dormitory housing is defined by

P(Accepted and Dormitory Housing) = P(Dormitory Housing|Accepted)P(Accepted) = (0.60)*(0.80) = 0.48.

Independence and Conditional Probability:

Generically, if P(A|B) = P(A) then events A and B are said to be independent.

Conceptually, Giving B doesn’t tell us anything about A

Mathematically, If events A and B are independent, P( A and B) = P(A) * P(B). Then,

P(A|B) = P( A and B) / P(B) = P(A)*P(B) / P(B) = P(A)

Previously we suggest the rules that P (A | B) = P(A) Now using Bayes’ theorem we can prove that why this is the case mathematically.

Probability Trees:

Probability trees are very important when P (A | B ) is already  given for a question and then they ask for P (B | A).

You have 100 emails in your mail box. 60 are spam, 40 are not. Of the 60 spam emails, 35 contain the word “free”. Of the rest, 3 contain the word “free”. If and email contains the word “free”, what is the probability that it is spam?

We are trying to find out P (Spam | “free”). First see the below picture to get an idea how to organize them into a probability tree.

As, we are interested in only the word “free” so 35 come from spam folder and 3 come from non-spam folder. Total words that contain the word “free” is 35+3 =38.

P ( Spam | “free” ) = 35 / (35+3) = 35/38 =  0.92

Here, we have implicitly made use of Bayes’ theorem. Numerator is having the joint probability and the denominator is having marginal probability of what we are conditioning on the word “free”.

Consider another example:

As of 2009, Swaziland had the highest HIV prevalence in the world. 25.9 % of this country’s population is infected with HIV. The ELISA test is the one of the first and most accurate tests for HIV. For those who carries HIV, the ELISA test is 99.7 % accurate. For those who do not carry HIV, the test is 92.6 % accurate. If an individual form Swaziland has tested positive, what is the probability that he carries HIV?

P ( HIV ) = 0.259

P (+ | HIV ) =0.997 and  P ( – | HIV) =0.926

Now find P ( HIV | +) = ?

So, here a conditional probability has been asked in the reverse way of a given probability i.e. P (+ | HIV ) =0.997. We should follow the tree diagram.

Basics of Regression

Random Variables and Probability Distributions