Logistic Regression

This is a classification model and is used for real values attributes only, like all other linear discriminant methods.

Consider a problem with 2 classes.

We assume that the log likelihood ratio (log odds) is linear.

$log \frac{p(x|C_1)}{p(x|C_2)} = w^Tx+w_0^0$

This assumption only holds if the distributions for $C_1$ and $C_2$ are Gaussian, with a shared covariance matrix.

We want to classify x into $C_1 (+) \,\, or \,\, C_2 (-)$ .

Put in class $C_1$ if $\frac{p(x|C_1)}{p(x|C_2)} > 1$ i.e. if $log \frac{p(x|C_1)}{p(x|C_2)} > 0$ . Using the property that the inverse of logit is sigmoid, we can show that $P(C_1|x)=\frac{1}{1+e^{-(w^Tx+w_0)}}$ .

This is the posterior probability that x belongs to $C_1$ .

Proof:

$logit(P(C_1|x)) = log\frac{P(C_1|x)}{1-P(C_1|x)} = log \frac{P(C_1|x)}{P(C_2|x)} = log\frac{P(C_1)p(x|C_1)}{P(C_2)p(x|C_2)} = log \frac{p(x|C_1)}{p(x|C_2)} + log \frac{P(C_1)}{P(C_2)} = w^Tx+w_0^0 + log\frac{P(C_1)}{P(C_2)}$

$= w^Tx + w_0$

where $w_0 = w_0^0 + log \frac{P(C_1)}{P(C_2)}$

Since sigmoid is the inverse of logit, we get:

$P(C_1|x) = \frac{1}{1+e^{-(w^Tx+w_0)}}$

We predict + if $P(C_1|x) \gt 0.5$ and predict - otherwise.

Note that when $(w^Tx+w_0) \rightarrow +\infty, \,\, P(C_1|x) \rightarrow 1$ and when $(w^Tx+w_0) \rightarrow -\infty, \,\, P(C_1|x) \rightarrow 0$

So we can predict + if $w^Tx+w_0 \gt 0$ and - otherwise.

Learning the Weights

We need to learn weights to be able to compute $P(C_1|x) = \frac{1}{1+e^{-(w^Tx+w_0)}}$

Consider the data below:

$\begin{array}{c|c|c} x_1 & x_2 & r\\3 & 21 & 1 (i.e. +)\\6 & 5 & 1 (i.e. +)\\ 2 & 9 & 0 (i.e. -)\end{array}$

We need to compute $w_2, w_1, w_0$ .

Suppose $y^t$ is the real probability that example t is in class $C_1$ , then the label of example t can be viewed as a random variable.

Assuming that the labelings of the examples are independent, the probability that example 1 is labeled + and example 2 is labeled + and example 3 is labeled - can be given by $P(C_1|x^1)*P(C_1|x^2)*(1-P(C_1|x^3)) = y^1*y^2*(1-y^3)$

We want to choose $w, w_0$ (w is a vector that contains $w_2, w_1$ ) such that the estimates $y^t$ "match" the labels in the dataset as well as possible, i.e. the must maximize:

$\Pi_{t=1}^N (\hat{P}(C_1|x^t))^{r^t} * (1-\hat{P}(C_1|x^t))^{(1-r^t)}$ where $\hat{P}(C_1|x^t)$ is the estimate of the probability that $x^t$ is labeled 1.

We can simplify the above by using logs. Let $y^t = \frac{1}{1+e^{-(w^tx+w_0)}}$

We need to find $argmax_{w,w_0}\,\Pi_t (y^t)^{r^t}(1-y^t)^{(1-r^t)}$

$=argmax_{w,w_0} \sum_t [r^t\, log\, y^t \, + \, (1-r^t)\, log\, (1-y^t)]$

$= argmin_{w,w_0} -\sum_t [r^t\, log\, y^t \, + \, (1-r^t)\, log\, (1-y^t)]$

Let us define $Err(w,w_0|X) = -\sum_t [r^t\, log\, y^t \, + \, (1-r^t)\, log\, (1-y^t)]$ , so in other words, we need to find the weights $w, w_0$ that minimize this error.

To do so, we use an iterative technique known as Gradient Descent.

Gradient Descent

We start with random weights. Then, we compute the derivative of the error function w.r.t. the weights. Then, we take a step in the direction of steepest descent i.e. in the negative direction of the gradient (since the gradient points to the direction of steepest ascent). This is done iteratively until we can no longer move in a direction that further minimizes the error.

Note that the error function mentioned above has a single minimum (i.e. no local minima, just one global minimum).

For $j \neq 0,$

$\frac{\partial E}{\partial j} = - \sum_t \, (\frac{r^t}{y^t}-\frac{(1-r^t)}{1-y^t})y^t(1-y^t)x_j^t$ , where $y^t=\frac{1}{1+e^{-(w_2x_2^t+w_1x_1^t+w_0)}}$ and using $\frac{d}{da}y = y(1-y)$ for $y=\frac{1}{1+e^{-a}}$

Therefore, we get:

$\frac{\partial E}{\partial j} = - \sum_t (r^t-y^t) x_j^t$

The iterative update rule for $w_j \,\, (j\neq 0)$ is given by:

$w_j \leftarrow w_j + \eta(-\frac{\partial E}{\partial w_j})$ where $\eta$ is the learning rate. This can be re-written as: $w_j \leftarrow w_j + \eta \sum_t (r^t-y^t)x_j^t$ Similarly, $w_0 \leftarrow w_0 + \eta \sum_t (r^t-y^t)$

Put generally, $w_j \leftarrow w_j + \eta \, \Delta w_j$

Pseudocode for Gradient Descent for Logistic Regression

# assume x_0 = 1

for j = 0...d:
    w_j = rand(-0.01, 0.01)
    repeat:
        for j = 0...d
            delta_w_j = 0
        for j = 1...N
            theta = 0
            for j = 0...d
                theta = theta + w_jx_j^t  # theta = wTx^t + w_0x_0 where x_0=1
            y = sigmoid(theta)
            delta_w_j = delta_w_j + (r^t-y)x_j^t
        for j = 0...d
            w_j = w_j + n * delta_w_j
    until convergence

PreviousModel Selection NextDecision Trees

Last updated 4 years ago