# Naive Bayes Classifier

Say our data has d attributes/features, and that these attributes are binary. Also, we have a finite set of classes C.

$$C\_{MAP} = argmax\_{c\in C} , P(X|C)P(C)$$

The main issue here is that we do not know P(X|C) in advance, and it is difficult, if not impossible, to model the joint distribution of all the d attributes. Therefore, we use the simplifying assumption that the attributes are **conditionally independent** i.e. the attributes are independent, *given the class*. This assumption is also called **class-conditional independence**.

So, $$P(\[x\_1, x\_2,..., x\_d]|C) = P(x\_1|C) \* P(x\_2|C) \* ... \* P(x\_d|C)$$

Therefore, we calculate $$C\_{MAP}$$ as:

$$C\_{MAP} = argmax\_C P(x\_1|C)*P(x\_2|C)*...\*P(x\_d|C)\*P(C)$$

Since we need to compute a product of probabilities, we must prevent any of the individual probabilities from being 0. This would nullify the effect of all the other probabilities.

To do so, we use **smoothing**.

If N is the total number of examples having a certain class C, and t is the number of times $$x\_i = v$$, then the non-smoothed estimate is given by:

$$P(x\_i=v|C) = \frac{t}{N}$$

If t=0, this probability becomes 0!

So, we perform **add-m smoothing**:

$$P(x\_i=v|C) = \frac{t+m}{N+ms}$$ (where s is the number of possible values for the attribute $$x\_i$$). Sometimes, we use m=1. This is called **add-1 smoothing**. In most cases, we limit m to the range (0,1].
