Logistic Regression
Last updated
Last updated
This is a classification model and is used for real values attributes only, like all other linear discriminant methods.
Consider a problem with 2 classes.
We assume that the log likelihood ratio (log odds) is linear.
This assumption only holds if the distributions for and are Gaussian, with a shared covariance matrix.
We want to classify x into .
Put in class if i.e. if . Using the property that the inverse of logit is sigmoid, we can show that .
This is the posterior probability that x belongs to .
Proof:
where
Since sigmoid is the inverse of logit, we get:
We predict + if and predict - otherwise.
Note that when and when
So we can predict + if and - otherwise.
Consider the data below:
To do so, we use an iterative technique known as Gradient Descent.
Note that the error function mentioned above has a single minimum (i.e. no local minima, just one global minimum).
Therefore, we get:
We need to learn weights to be able to compute
We need to compute .
Suppose is the real probability that example t is in class , then the label of example t can be viewed as a random variable.
Assuming that the labelings of the examples are independent, the probability that example 1 is labeled + and example 2 is labeled + and example 3 is labeled - can be given by
We want to choose (w is a vector that contains ) such that the estimates "match" the labels in the dataset as well as possible, i.e. the must maximize:
where is the estimate of the probability that is labeled 1.
We can simplify the above by using logs. Let
We need to find
Let us define , so in other words, we need to find the weights that minimize this error.
We start with random weights. Then, we compute the derivative of the error function w.r.t. the weights. Then, we take a step in the direction of steepest descent i.e. in the negative direction of the gradient (since the gradient points to the direction of steepest ascent). This is done iteratively until we can no longer move in a direction that further minimizes the error.
For
, where and using for
The iterative update rule for is given by:
where is the learning rate. This can be re-written as: Similarly,
Put generally,