CS-GY 6923: Machine Learning
1.0.0
1.0.0
  • Introduction
  • What is Machine Learning?
  • Types of Machine Learning
    • Supervised Learning
      • Notations
      • Probabilistic Modeling
        • Naive Bayes Classifier
      • Linear Regression
      • Nearest Neighbor
      • Evaluating a Classifier
      • Parametric Estimation
        • Bayesian Approach to Parameter Estimation
        • Parametric Estimation for Simple Linear Regression
        • Parametric Estimation for Multivariate Linear Regression
        • Parametric Estimation for Simple Polynomial Regression
        • Parametric Estimation for Multivariate Polynomial Regression
      • Bias and Variance of an Estimator
      • Bias and Variance of a Regression Algorithm
        • Model Selection
      • Logistic Regression
      • Decision Trees
        • Using Decision Trees for Regression
        • Bias and Variance
      • Dimensionality Reduction
      • Neural Networks
        • Training a Neuron
        • MLP
          • Regression with Multiple Outputs
          • Advice/Tricks and Issues to Train a Neural Network
        • Deep Learning
      • Support Vector Machines
      • Ensemble Learning
    • Unsupervised Learning
      • K-Means Clustering
      • Probabilistic Clustering
    • Reinforcement Learning
Powered by GitBook
On this page

Was this helpful?

  1. Types of Machine Learning
  2. Supervised Learning
  3. Probabilistic Modeling

Naive Bayes Classifier

Say our data has d attributes/features, and that these attributes are binary. Also, we have a finite set of classes C.

CMAP=argmaxc∈C P(X∣C)P(C)C_{MAP} = argmax_{c\in C} \, P(X|C)P(C)CMAP​=argmaxc∈C​P(X∣C)P(C)

The main issue here is that we do not know P(X|C) in advance, and it is difficult, if not impossible, to model the joint distribution of all the d attributes. Therefore, we use the simplifying assumption that the attributes are conditionally independent i.e. the attributes are independent, given the class. This assumption is also called class-conditional independence.

So, P([x1,x2,...,xd]∣C)=P(x1∣C)∗P(x2∣C)∗...∗P(xd∣C)P([x_1, x_2,..., x_d]|C) = P(x_1|C) * P(x_2|C) * ... * P(x_d|C)P([x1​,x2​,...,xd​]∣C)=P(x1​∣C)∗P(x2​∣C)∗...∗P(xd​∣C)

Therefore, we calculate CMAPC_{MAP}CMAP​ as:

CMAP=argmaxCP(x1∣C)∗P(x2∣C)∗...∗P(xd∣C)∗P(C)C_{MAP} = argmax_C P(x_1|C)*P(x_2|C)*...*P(x_d|C)*P(C)CMAP​=argmaxC​P(x1​∣C)∗P(x2​∣C)∗...∗P(xd​∣C)∗P(C)

Since we need to compute a product of probabilities, we must prevent any of the individual probabilities from being 0. This would nullify the effect of all the other probabilities.

To do so, we use smoothing.

If N is the total number of examples having a certain class C, and t is the number of times xi=vx_i = vxi​=v, then the non-smoothed estimate is given by:

P(xi=v∣C)=tNP(x_i=v|C) = \frac{t}{N}P(xi​=v∣C)=Nt​

If t=0, this probability becomes 0!

So, we perform add-m smoothing:

P(xi=v∣C)=t+mN+msP(x_i=v|C) = \frac{t+m}{N+ms}P(xi​=v∣C)=N+mst+m​ (where s is the number of possible values for the attribute xix_ixi​). Sometimes, we use m=1. This is called add-1 smoothing. In most cases, we limit m to the range (0,1].

PreviousProbabilistic ModelingNextLinear Regression

Last updated 5 years ago

Was this helpful?