Probabilistic Modeling

This refers to models of data generation i.e. generative models. They model where the data comes from.

Consider spam classification. We would learn the probability distribution for the examples in the spam class as well as the probability distribution for the examples in the non-spam (ham) class.

Given a new sample x, we must then calculate P(x is spam). By default, we label it as spam if P(x is spam)>=0.5 i.e. if P(x is spam)>=P(x is ham).

More generally put, we must compute P(C|X) i.e. the probability of a class C given a training example X i.e. the probability of X belonging to C.

Bayes' Rule

According to Bayes' Rule,

P(CX)=P(C)P(XC)P(X)P(C|X) = \frac{P(C)P(X|C)}{P(X)}

P(C) is the prior probability, P(X|C) is the likelihood probability of X being generated from C and P(X) is known as the evidence. P(C|X) is called the posterior probability.

C is the hypothesis and X is the data.

The prior probability is computed without having seen the data X.

For example:

X - I am late to class C1C_1 - I was kidnapped by Martians C2C_2 - I was thinking about research and I lost track of time

Say, P(C1C_1) = 0.00000...1. We assume that these are the only two hypotheses. Therefore, P(C2C_2)=0.999999....9. The probability that I was late to class given that the Martians kidnapped me is pretty high. So, say P(X|C1C_1)=0.97. Now, the probability that I was late to class given that I was thinking about research and lost track of time is also pretty high. So, say P(X|C2C_2)=0.5.

Bayes' Rule aims to find the most probable hypothesis. In other words, given X, we aim to choose the hypothesis that maximizes P(C|X) i.e. argmaxCargmax_C P(C|X). This is also called the Maximum a-posteriori (MAP) hypothesis, where a-posteriori means 'after' seeing the data.

argmaxCP(CX)=argmaxCP(C)P(XC)P(X)=argmaxCP(C)P(XC)argmax_C P(C|X) = argmax_C \frac{P(C)P(X|C)}{P(X)} = argmax_C P(C)P(X|C) since the denominator is a constant.

In our example, P(C1X)=P(C1)P(XC1)=0.0000...10.97P(C_1|X) = P(C_1)P(X|C_1) = 0.0000...1 * 0.97 and P(C2X)=P(C2)P(XC2)=0.999...90.5P(C_2|X) = P(C_2)P(X|C_2) = 0.999...9 * 0.5

Therefore, we choose hypothesis C2C_2.

The Maximum Likelihood (ML) hypothesis is given by argmaxCP(XC)argmax_C P(X|C).

Note that if we have a uniform prior (distribution) i.e. all the hypotheses are equally likely, the MAP and ML hypotheses are the same.

Continuous Distributions and Bayes' Rule

For continuous distributions, i.e. for a continuous random variable X, we cannot compute P(X) directly, because X doesn't hold a fixed set of discrete values.

Instead, we compute the pdf(X) i.e. probability distribution function of X. We denote it as p(X). It is, visually speaking, the height of the curve at X.

Say we have two probability distributions, one for mens' heights and the other for womens' heights. They are assumed to be Normal/Gaussian distributions.

Say, C1:man,C2:woman;P(C1)=P(C2)=0.5C_1: man, C_2: woman; P(C_1) = P(C_2) = 0.5. So, ML hypothesis = MAP hypothesis.

For continuous random variables X, the Bayes' Rule is as follows:

P(CX)=P(C)p(XC)p(X)P(C|X) = \frac{P(C)p(X|C)}{p(X)}

The MAP hypothesis = argmaxCP(CX)=argmaxCP(C)p(XC)argmax_C P(C|X) = argmax_C P(C)p(X|C)

Last updated