Logistic Regression

This is a classification model and is used for real values attributes only, like all other linear discriminant methods.

Consider a problem with 2 classes.

We assume that the log likelihood ratio (log odds) is linear.

logp(xC1)p(xC2)=wTx+w00log \frac{p(x|C_1)}{p(x|C_2)} = w^Tx+w_0^0

This assumption only holds if the distributions for C1C_1 and C2C_2 are Gaussian, with a shared covariance matrix.

We want to classify x into C1(+)orC2()C_1 (+) \,\, or \,\, C_2 (-).

Put in class C1C_1 if p(xC1)p(xC2)>1\frac{p(x|C_1)}{p(x|C_2)} > 1 i.e. if logp(xC1)p(xC2)>0log \frac{p(x|C_1)}{p(x|C_2)} > 0. Using the property that the inverse of logit is sigmoid, we can show that P(C1x)=11+e(wTx+w0)P(C_1|x)=\frac{1}{1+e^{-(w^Tx+w_0)}}.

This is the posterior probability that x belongs to C1C_1.

Proof:

logit(P(C1x))=logP(C1x)1P(C1x)=logP(C1x)P(C2x)=logP(C1)p(xC1)P(C2)p(xC2)=logp(xC1)p(xC2)+logP(C1)P(C2)=wTx+w00+logP(C1)P(C2)logit(P(C_1|x)) = log\frac{P(C_1|x)}{1-P(C_1|x)} = log \frac{P(C_1|x)}{P(C_2|x)} = log\frac{P(C_1)p(x|C_1)}{P(C_2)p(x|C_2)} = log \frac{p(x|C_1)}{p(x|C_2)} + log \frac{P(C_1)}{P(C_2)} = w^Tx+w_0^0 + log\frac{P(C_1)}{P(C_2)}

=wTx+w0= w^Tx + w_0

where w0=w00+logP(C1)P(C2)w_0 = w_0^0 + log \frac{P(C_1)}{P(C_2)}

Since sigmoid is the inverse of logit, we get:

P(C1x)=11+e(wTx+w0)P(C_1|x) = \frac{1}{1+e^{-(w^Tx+w_0)}}

We predict + if P(C1x)>0.5P(C_1|x) \gt 0.5 and predict - otherwise.

Note that when (wTx+w0)+,P(C1x)1(w^Tx+w_0) \rightarrow +\infty, \,\, P(C_1|x) \rightarrow 1 and when (wTx+w0),P(C1x)0(w^Tx+w_0) \rightarrow -\infty, \,\, P(C_1|x) \rightarrow 0

So we can predict + if wTx+w0>0w^Tx+w_0 \gt 0 and - otherwise.

Learning the Weights

We need to learn weights to be able to compute P(C1x)=11+e(wTx+w0)P(C_1|x) = \frac{1}{1+e^{-(w^Tx+w_0)}}

Consider the data below:

x1x2r3211(i.e.+)651(i.e.+)290(i.e.)\begin{array}{c|c|c} x_1 & x_2 & r\\3 & 21 & 1 (i.e. +)\\6 & 5 & 1 (i.e. +)\\ 2 & 9 & 0 (i.e. -)\end{array}

We need to compute w2,w1,w0w_2, w_1, w_0.

Suppose yty^t is the real probability that example t is in class C1C_1, then the label of example t can be viewed as a random variable.

Assuming that the labelings of the examples are independent, the probability that example 1 is labeled + and example 2 is labeled + and example 3 is labeled - can be given by P(C1x1)P(C1x2)(1P(C1x3))=y1y2(1y3)P(C_1|x^1)*P(C_1|x^2)*(1-P(C_1|x^3)) = y^1*y^2*(1-y^3)

We want to choose w,w0w, w_0 (w is a vector that contains w2,w1w_2, w_1) such that the estimates yty^t "match" the labels in the dataset as well as possible, i.e. the must maximize:

Πt=1N(P^(C1xt))rt(1P^(C1xt))(1rt)\Pi_{t=1}^N (\hat{P}(C_1|x^t))^{r^t} * (1-\hat{P}(C_1|x^t))^{(1-r^t)} where P^(C1xt)\hat{P}(C_1|x^t) is the estimate of the probability that xtx^t is labeled 1.

We can simplify the above by using logs. Let yt=11+e(wtx+w0)y^t = \frac{1}{1+e^{-(w^tx+w_0)}}

We need to find argmaxw,w0Πt(yt)rt(1yt)(1rt)argmax_{w,w_0}\,\Pi_t (y^t)^{r^t}(1-y^t)^{(1-r^t)}

=argmaxw,w0t[rtlogyt+(1rt)log(1yt)]=argmax_{w,w_0} \sum_t [r^t\, log\, y^t \, + \, (1-r^t)\, log\, (1-y^t)]

=argminw,w0t[rtlogyt+(1rt)log(1yt)]= argmin_{w,w_0} -\sum_t [r^t\, log\, y^t \, + \, (1-r^t)\, log\, (1-y^t)]

Let us define Err(w,w0X)=t[rtlogyt+(1rt)log(1yt)]Err(w,w_0|X) = -\sum_t [r^t\, log\, y^t \, + \, (1-r^t)\, log\, (1-y^t)], so in other words, we need to find the weights w,w0w, w_0 that minimize this error.

To do so, we use an iterative technique known as Gradient Descent.

Gradient Descent

Note that the error function mentioned above has a single minimum (i.e. no local minima, just one global minimum).

For j0,j \neq 0,

Ej=t(rtyt(1rt)1yt)yt(1yt)xjt\frac{\partial E}{\partial j} = - \sum_t \, (\frac{r^t}{y^t}-\frac{(1-r^t)}{1-y^t})y^t(1-y^t)x_j^t, where yt=11+e(w2x2t+w1x1t+w0)y^t=\frac{1}{1+e^{-(w_2x_2^t+w_1x_1^t+w_0)}} and using dday=y(1y)\frac{d}{da}y = y(1-y) for y=11+eay=\frac{1}{1+e^{-a}}

Therefore, we get:

Ej=t(rtyt)xjt\frac{\partial E}{\partial j} = - \sum_t (r^t-y^t) x_j^t

The iterative update rule for wj(j0)w_j \,\, (j\neq 0) is given by:

wjwj+η(Ewj)w_j \leftarrow w_j + \eta(-\frac{\partial E}{\partial w_j}) where η\eta is the learning rate. This can be re-written as: wjwj+ηt(rtyt)xjtw_j \leftarrow w_j + \eta \sum_t (r^t-y^t)x_j^t Similarly, w0w0+ηt(rtyt)w_0 \leftarrow w_0 + \eta \sum_t (r^t-y^t)

Put generally, wjwj+ηΔwjw_j \leftarrow w_j + \eta \, \Delta w_j

Pseudocode for Gradient Descent for Logistic Regression

# assume x_0 = 1

for j = 0...d:
    w_j = rand(-0.01, 0.01)
    repeat:
        for j = 0...d
            delta_w_j = 0
        for j = 1...N
            theta = 0
            for j = 0...d
                theta = theta + w_jx_j^t  # theta = wTx^t + w_0x_0 where x_0=1
            y = sigmoid(theta)
            delta_w_j = delta_w_j + (r^t-y)x_j^t
        for j = 0...d
            w_j = w_j + n * delta_w_j
    until convergence

Last updated