Naive Bayes Classifier

Say our data has d attributes/features, and that these attributes are binary. Also, we have a finite set of classes C.

CMAP=argmaxcCP(XC)P(C)C_{MAP} = argmax_{c\in C} \, P(X|C)P(C)

The main issue here is that we do not know P(X|C) in advance, and it is difficult, if not impossible, to model the joint distribution of all the d attributes. Therefore, we use the simplifying assumption that the attributes are conditionally independent i.e. the attributes are independent, given the class. This assumption is also called class-conditional independence.

So, P([x1,x2,...,xd]C)=P(x1C)P(x2C)...P(xdC)P([x_1, x_2,..., x_d]|C) = P(x_1|C) * P(x_2|C) * ... * P(x_d|C)

Therefore, we calculate CMAPC_{MAP} as:

CMAP=argmaxCP(x1C)P(x2C)...P(xdC)P(C)C_{MAP} = argmax_C P(x_1|C)*P(x_2|C)*...*P(x_d|C)*P(C)

Since we need to compute a product of probabilities, we must prevent any of the individual probabilities from being 0. This would nullify the effect of all the other probabilities.

To do so, we use smoothing.

If N is the total number of examples having a certain class C, and t is the number of times xi=vx_i = v, then the non-smoothed estimate is given by:

P(xi=vC)=tNP(x_i=v|C) = \frac{t}{N}

If t=0, this probability becomes 0!

So, we perform add-m smoothing:

P(xi=vC)=t+mN+msP(x_i=v|C) = \frac{t+m}{N+ms} (where s is the number of possible values for the attribute xix_i). Sometimes, we use m=1. This is called add-1 smoothing. In most cases, we limit m to the range (0,1].

Last updated