Probabilistic Modeling
Last updated
Last updated
This refers to models of data generation i.e. generative models. They model where the data comes from.
Consider spam classification. We would learn the probability distribution for the examples in the spam class as well as the probability distribution for the examples in the non-spam (ham) class.
Given a new sample x, we must then calculate P(x is spam). By default, we label it as spam if P(x is spam)>=0.5 i.e. if P(x is spam)>=P(x is ham).
More generally put, we must compute P(C|X) i.e. the probability of a class C given a training example X i.e. the probability of X belonging to C.
According to Bayes' Rule,
P(C) is the prior probability, P(X|C) is the likelihood probability of X being generated from C and P(X) is known as the evidence. P(C|X) is called the posterior probability.
C is the hypothesis and X is the data.
The prior probability is computed without having seen the data X.
For example:
X - I am late to class - I was kidnapped by Martians - I was thinking about research and I lost track of time
Say, P() = 0.00000...1. We assume that these are the only two hypotheses. Therefore, P()=0.999999....9. The probability that I was late to class given that the Martians kidnapped me is pretty high. So, say P(X|)=0.97. Now, the probability that I was late to class given that I was thinking about research and lost track of time is also pretty high. So, say P(X|)=0.5.
Bayes' Rule aims to find the most probable hypothesis. In other words, given X, we aim to choose the hypothesis that maximizes P(C|X) i.e. P(C|X). This is also called the Maximum a-posteriori (MAP) hypothesis, where a-posteriori means 'after' seeing the data.
since the denominator is a constant.
In our example, and
Note that if we have a uniform prior (distribution) i.e. all the hypotheses are equally likely, the MAP and ML hypotheses are the same.
For continuous distributions, i.e. for a continuous random variable X, we cannot compute P(X) directly, because X doesn't hold a fixed set of discrete values.
Instead, we compute the pdf(X) i.e. probability distribution function of X. We denote it as p(X). It is, visually speaking, the height of the curve at X.
Say we have two probability distributions, one for mens' heights and the other for womens' heights. They are assumed to be Normal/Gaussian distributions.
For continuous random variables X, the Bayes' Rule is as follows:
Therefore, we choose hypothesis .
The Maximum Likelihood (ML) hypothesis is given by .
Say, . So, ML hypothesis = MAP hypothesis.
The MAP hypothesis =