This is a classification model and is used for real values attributes only, like all other linear discriminant methods.
Consider a problem with 2 classes.
We assume that the log likelihood ratio (log odds) is linear.
logp(x∣C2)p(x∣C1)=wTx+w00
This assumption only holds if the distributions for C1 and C2 are Gaussian, with a shared covariance matrix.
We want to classify x into C1(+)orC2(−).
Put in class C1 if p(x∣C2)p(x∣C1)>1 i.e. if logp(x∣C2)p(x∣C1)>0.
Using the property that the inverse of logit is sigmoid, we can show that P(C1∣x)=1+e−(wTx+w0)1.
This is the posterior probability that x belongs to C1.
We predict + if P(C1∣x)>0.5 and predict - otherwise.
Note that when (wTx+w0)→+∞,P(C1∣x)→1 and when (wTx+w0)→−∞,P(C1∣x)→0
So we can predict + if wTx+w0>0 and - otherwise.
Learning the Weights
We need to learn weights to be able to compute P(C1∣x)=1+e−(wTx+w0)1
Consider the data below:
x1362x22159r1(i.e.+)1(i.e.+)0(i.e.−)
We need to compute w2,w1,w0.
Suppose yt is the real probability that example t is in class C1, then the label of example t can be viewed as a random variable.
Assuming that the labelings of the examples are independent, the probability that example 1 is labeled + and example 2 is labeled + and example 3 is labeled - can be given by P(C1∣x1)∗P(C1∣x2)∗(1−P(C1∣x3))=y1∗y2∗(1−y3)
We want to choose w,w0 (w is a vector that contains w2,w1) such that the estimates yt "match" the labels in the dataset as well as possible, i.e. the must maximize:
Πt=1N(P^(C1∣xt))rt∗(1−P^(C1∣xt))(1−rt) where P^(C1∣xt) is the estimate of the probability that xt is labeled 1.
We can simplify the above by using logs. Let yt=1+e−(wtx+w0)1
We need to find argmaxw,w0Πt(yt)rt(1−yt)(1−rt)
=argmaxw,w0∑t[rtlogyt+(1−rt)log(1−yt)]
=argminw,w0−∑t[rtlogyt+(1−rt)log(1−yt)]
Let us define Err(w,w0∣X)=−∑t[rtlogyt+(1−rt)log(1−yt)], so in other words, we need to find the weights w,w0 that minimize this error.
To do so, we use an iterative technique known as Gradient Descent.
Gradient Descent
Note that the error function mentioned above has a single minimum (i.e. no local minima, just one global minimum).
For j=0,
∂j∂E=−∑t(ytrt−1−yt(1−rt))yt(1−yt)xjt, where yt=1+e−(w2x2t+w1x1t+w0)1 and using dady=y(1−y) for y=1+e−a1
Therefore, we get:
∂j∂E=−∑t(rt−yt)xjt
The iterative update rule for wj(j=0) is given by:
wj←wj+η(−∂wj∂E) where η is the learning rate. This can be re-written as:
wj←wj+η∑t(rt−yt)xjt
Similarly, w0←w0+η∑t(rt−yt)
Put generally, wj←wj+ηΔwj
Pseudocode for Gradient Descent for Logistic Regression
# assume x_0 = 1
for j = 0...d:
w_j = rand(-0.01, 0.01)
repeat:
for j = 0...d
delta_w_j = 0
for j = 1...N
theta = 0
for j = 0...d
theta = theta + w_jx_j^t # theta = wTx^t + w_0x_0 where x_0=1
y = sigmoid(theta)
delta_w_j = delta_w_j + (r^t-y)x_j^t
for j = 0...d
w_j = w_j + n * delta_w_j
until convergence
We start with random weights. Then, we compute the derivative of the error function w.r.t. the weights. Then, we take a step in the direction of steepest descent i.e. in the negative direction of the gradient (since the gradient points to the direction of steepest ascent). This is done iteratively until we can no longer move in a direction that further minimizes the error.