Logistic Regression

This is a classification model and is used for real values attributes only, like all other linear discriminant methods.

Consider a problem with 2 classes.

We assume that the log likelihood ratio (log odds) is linear.

logp(xC1)p(xC2)=wTx+w00log \frac{p(x|C_1)}{p(x|C_2)} = w^Tx+w_0^0

This assumption only holds if the distributions for C1C_1 and C2C_2 are Gaussian, with a shared covariance matrix.

We want to classify x into C1(+)orC2()C_1 (+) \,\, or \,\, C_2 (-).

Put in class C1C_1 if p(xC1)p(xC2)>1\frac{p(x|C_1)}{p(x|C_2)} > 1 i.e. if logp(xC1)p(xC2)>0log \frac{p(x|C_1)}{p(x|C_2)} > 0. Using the property that the inverse of logit is sigmoid, we can show that P(C1x)=11+e(wTx+w0)P(C_1|x)=\frac{1}{1+e^{-(w^Tx+w_0)}}.

This is the posterior probability that x belongs to C1C_1.

Proof:

logit(P(C1x))=logP(C1x)1P(C1x)=logP(C1x)P(C2x)=logP(C1)p(xC1)P(C2)p(xC2)=logp(xC1)p(xC2)+logP(C1)P(C2)=wTx+w00+logP(C1)P(C2)logit(P(C_1|x)) = log\frac{P(C_1|x)}{1-P(C_1|x)} = log \frac{P(C_1|x)}{P(C_2|x)} = log\frac{P(C_1)p(x|C_1)}{P(C_2)p(x|C_2)} = log \frac{p(x|C_1)}{p(x|C_2)} + log \frac{P(C_1)}{P(C_2)} = w^Tx+w_0^0 + log\frac{P(C_1)}{P(C_2)}

=wTx+w0= w^Tx + w_0

where w0=w00+logP(C1)P(C2)w_0 = w_0^0 + log \frac{P(C_1)}{P(C_2)}

Since sigmoid is the inverse of logit, we get:

P(C1x)=11+e(wTx+w0)P(C_1|x) = \frac{1}{1+e^{-(w^Tx+w_0)}}

We predict + if P(C1x)>0.5P(C_1|x) \gt 0.5 and predict - otherwise.

Note that when (wTx+w0)+,P(C1x)1(w^Tx+w_0) \rightarrow +\infty, \,\, P(C_1|x) \rightarrow 1 and when (wTx+w0),P(C1x)0(w^Tx+w_0) \rightarrow -\infty, \,\, P(C_1|x) \rightarrow 0

So we can predict + if wTx+w0>0w^Tx+w_0 \gt 0 and - otherwise.

Learning the Weights

We need to learn weights to be able to compute P(C1x)=11+e(wTx+w0)P(C_1|x) = \frac{1}{1+e^{-(w^Tx+w_0)}}

Consider the data below:

x1x2r3211(i.e.+)651(i.e.+)290(i.e.)\begin{array}{c|c|c} x_1 & x_2 & r\\3 & 21 & 1 (i.e. +)\\6 & 5 & 1 (i.e. +)\\ 2 & 9 & 0 (i.e. -)\end{array}

We need to compute w2,w1,w0w_2, w_1, w_0.

Suppose yty^t is the real probability that example t is in class C1C_1, then the label of example t can be viewed as a random variable.

Assuming that the labelings of the examples are independent, the probability that example 1 is labeled + and example 2 is labeled + and example 3 is labeled - can be given by P(C1x1)P(C1x2)(1P(C1x3))=y1y2(1y3)P(C_1|x^1)*P(C_1|x^2)*(1-P(C_1|x^3)) = y^1*y^2*(1-y^3)

We want to choose w,w0w, w_0 (w is a vector that contains w2,w1w_2, w_1) such that the estimates yty^t "match" the labels in the dataset as well as possible, i.e. the must maximize:

Πt=1N(P^(C1xt))rt(1P^(C1xt))(1rt)\Pi_{t=1}^N (\hat{P}(C_1|x^t))^{r^t} * (1-\hat{P}(C_1|x^t))^{(1-r^t)} where P^(C1xt)\hat{P}(C_1|x^t) is the estimate of the probability that xtx^t is labeled 1.

We can simplify the above by using logs. Let yt=11+e(wtx+w0)y^t = \frac{1}{1+e^{-(w^tx+w_0)}}

We need to find argmaxw,w0Πt(yt)rt(1yt)(1rt)argmax_{w,w_0}\,\Pi_t (y^t)^{r^t}(1-y^t)^{(1-r^t)}

=argmaxw,w0t[rtlogyt+(1rt)log(1yt)]=argmax_{w,w_0} \sum_t [r^t\, log\, y^t \, + \, (1-r^t)\, log\, (1-y^t)]

=argminw,w0t[rtlogyt+(1rt)log(1yt)]= argmin_{w,w_0} -\sum_t [r^t\, log\, y^t \, + \, (1-r^t)\, log\, (1-y^t)]

Let us define Err(w,w0X)=t[rtlogyt+(1rt)log(1yt)]Err(w,w_0|X) = -\sum_t [r^t\, log\, y^t \, + \, (1-r^t)\, log\, (1-y^t)], so in other words, we need to find the weights w,w0w, w_0 that minimize this error.

To do so, we use an iterative technique known as Gradient Descent.

Gradient Descent

We start with random weights. Then, we compute the derivative of the error function w.r.t. the weights. Then, we take a step in the direction of steepest descent i.e. in the negative direction of the gradient (since the gradient points to the direction of steepest ascent). This is done iteratively until we can no longer move in a direction that further minimizes the error.

Note that the error function mentioned above has a single minimum (i.e. no local minima, just one global minimum).

For j0,j \neq 0,

Ej=t(rtyt(1rt)1yt)yt(1yt)xjt\frac{\partial E}{\partial j} = - \sum_t \, (\frac{r^t}{y^t}-\frac{(1-r^t)}{1-y^t})y^t(1-y^t)x_j^t, where yt=11+e(w2x2t+w1x1t+w0)y^t=\frac{1}{1+e^{-(w_2x_2^t+w_1x_1^t+w_0)}} and using dday=y(1y)\frac{d}{da}y = y(1-y) for y=11+eay=\frac{1}{1+e^{-a}}

Therefore, we get:

Ej=t(rtyt)xjt\frac{\partial E}{\partial j} = - \sum_t (r^t-y^t) x_j^t

The iterative update rule for wj(j0)w_j \,\, (j\neq 0) is given by:

wjwj+η(Ewj)w_j \leftarrow w_j + \eta(-\frac{\partial E}{\partial w_j}) where η\eta is the learning rate. This can be re-written as: wjwj+ηt(rtyt)xjtw_j \leftarrow w_j + \eta \sum_t (r^t-y^t)x_j^t Similarly, w0w0+ηt(rtyt)w_0 \leftarrow w_0 + \eta \sum_t (r^t-y^t)

Put generally, wjwj+ηΔwjw_j \leftarrow w_j + \eta \, \Delta w_j

Pseudocode for Gradient Descent for Logistic Regression

# assume x_0 = 1

for j = 0...d:
    w_j = rand(-0.01, 0.01)
    repeat:
        for j = 0...d
            delta_w_j = 0
        for j = 1...N
            theta = 0
            for j = 0...d
                theta = theta + w_jx_j^t  # theta = wTx^t + w_0x_0 where x_0=1
            y = sigmoid(theta)
            delta_w_j = delta_w_j + (r^t-y)x_j^t
        for j = 0...d
            w_j = w_j + n * delta_w_j
    until convergence

Last updated

Was this helpful?