CS-GY 6923: Machine Learning
1.0.0
1.0.0
  • Introduction
  • What is Machine Learning?
  • Types of Machine Learning
    • Supervised Learning
      • Notations
      • Probabilistic Modeling
        • Naive Bayes Classifier
      • Linear Regression
      • Nearest Neighbor
      • Evaluating a Classifier
      • Parametric Estimation
        • Bayesian Approach to Parameter Estimation
        • Parametric Estimation for Simple Linear Regression
        • Parametric Estimation for Multivariate Linear Regression
        • Parametric Estimation for Simple Polynomial Regression
        • Parametric Estimation for Multivariate Polynomial Regression
      • Bias and Variance of an Estimator
      • Bias and Variance of a Regression Algorithm
        • Model Selection
      • Logistic Regression
      • Decision Trees
        • Using Decision Trees for Regression
        • Bias and Variance
      • Dimensionality Reduction
      • Neural Networks
        • Training a Neuron
        • MLP
          • Regression with Multiple Outputs
          • Advice/Tricks and Issues to Train a Neural Network
        • Deep Learning
      • Support Vector Machines
      • Ensemble Learning
    • Unsupervised Learning
      • K-Means Clustering
      • Probabilistic Clustering
    • Reinforcement Learning
Powered by GitBook
On this page
  • Learning the Weights
  • Gradient Descent

Was this helpful?

  1. Types of Machine Learning
  2. Supervised Learning

Logistic Regression

This is a classification model and is used for real values attributes only, like all other linear discriminant methods.

Consider a problem with 2 classes.

We assume that the log likelihood ratio (log odds) is linear.

logp(x∣C1)p(x∣C2)=wTx+w00log \frac{p(x|C_1)}{p(x|C_2)} = w^Tx+w_0^0logp(x∣C2​)p(x∣C1​)​=wTx+w00​

This assumption only holds if the distributions for C1C_1C1​ and C2C_2C2​ are Gaussian, with a shared covariance matrix.

We want to classify x into C1(+)  or  C2(−)C_1 (+) \,\, or \,\, C_2 (-)C1​(+)orC2​(−).

Put in class C1C_1C1​ if p(x∣C1)p(x∣C2)>1\frac{p(x|C_1)}{p(x|C_2)} > 1p(x∣C2​)p(x∣C1​)​>1 i.e. if logp(x∣C1)p(x∣C2)>0log \frac{p(x|C_1)}{p(x|C_2)} > 0logp(x∣C2​)p(x∣C1​)​>0. Using the property that the inverse of logit is sigmoid, we can show that P(C1∣x)=11+e−(wTx+w0)P(C_1|x)=\frac{1}{1+e^{-(w^Tx+w_0)}}P(C1​∣x)=1+e−(wTx+w0​)1​.

This is the posterior probability that x belongs to C1C_1C1​.

Proof:

logit(P(C1∣x))=logP(C1∣x)1−P(C1∣x)=logP(C1∣x)P(C2∣x)=logP(C1)p(x∣C1)P(C2)p(x∣C2)=logp(x∣C1)p(x∣C2)+logP(C1)P(C2)=wTx+w00+logP(C1)P(C2)logit(P(C_1|x)) = log\frac{P(C_1|x)}{1-P(C_1|x)} = log \frac{P(C_1|x)}{P(C_2|x)} = log\frac{P(C_1)p(x|C_1)}{P(C_2)p(x|C_2)} = log \frac{p(x|C_1)}{p(x|C_2)} + log \frac{P(C_1)}{P(C_2)} = w^Tx+w_0^0 + log\frac{P(C_1)}{P(C_2)}logit(P(C1​∣x))=log1−P(C1​∣x)P(C1​∣x)​=logP(C2​∣x)P(C1​∣x)​=logP(C2​)p(x∣C2​)P(C1​)p(x∣C1​)​=logp(x∣C2​)p(x∣C1​)​+logP(C2​)P(C1​)​=wTx+w00​+logP(C2​)P(C1​)​

=wTx+w0= w^Tx + w_0=wTx+w0​

where w0=w00+logP(C1)P(C2)w_0 = w_0^0 + log \frac{P(C_1)}{P(C_2)}w0​=w00​+logP(C2​)P(C1​)​

Since sigmoid is the inverse of logit, we get:

P(C1∣x)=11+e−(wTx+w0)P(C_1|x) = \frac{1}{1+e^{-(w^Tx+w_0)}}P(C1​∣x)=1+e−(wTx+w0​)1​

We predict + if P(C1∣x)>0.5P(C_1|x) \gt 0.5P(C1​∣x)>0.5 and predict - otherwise.

Note that when (wTx+w0)→+∞,  P(C1∣x)→1(w^Tx+w_0) \rightarrow +\infty, \,\, P(C_1|x) \rightarrow 1(wTx+w0​)→+∞,P(C1​∣x)→1 and when (wTx+w0)→−∞,  P(C1∣x)→0(w^Tx+w_0) \rightarrow -\infty, \,\, P(C_1|x) \rightarrow 0(wTx+w0​)→−∞,P(C1​∣x)→0

So we can predict + if wTx+w0>0w^Tx+w_0 \gt 0wTx+w0​>0 and - otherwise.

Learning the Weights

We need to learn weights to be able to compute P(C1∣x)=11+e−(wTx+w0)P(C_1|x) = \frac{1}{1+e^{-(w^Tx+w_0)}}P(C1​∣x)=1+e−(wTx+w0​)1​

Consider the data below:

x1x2r3211(i.e.+)651(i.e.+)290(i.e.−)\begin{array}{c|c|c} x_1 & x_2 & r\\3 & 21 & 1 (i.e. +)\\6 & 5 & 1 (i.e. +)\\ 2 & 9 & 0 (i.e. -)\end{array}x1​362​x2​2159​r1(i.e.+)1(i.e.+)0(i.e.−)​

We need to compute w2,w1,w0w_2, w_1, w_0w2​,w1​,w0​.

Suppose yty^tyt is the real probability that example t is in class C1C_1C1​, then the label of example t can be viewed as a random variable.

Assuming that the labelings of the examples are independent, the probability that example 1 is labeled + and example 2 is labeled + and example 3 is labeled - can be given by P(C1∣x1)∗P(C1∣x2)∗(1−P(C1∣x3))=y1∗y2∗(1−y3)P(C_1|x^1)*P(C_1|x^2)*(1-P(C_1|x^3)) = y^1*y^2*(1-y^3)P(C1​∣x1)∗P(C1​∣x2)∗(1−P(C1​∣x3))=y1∗y2∗(1−y3)

We want to choose w,w0w, w_0w,w0​ (w is a vector that contains w2,w1w_2, w_1w2​,w1​) such that the estimates yty^tyt "match" the labels in the dataset as well as possible, i.e. the must maximize:

Πt=1N(P^(C1∣xt))rt∗(1−P^(C1∣xt))(1−rt)\Pi_{t=1}^N (\hat{P}(C_1|x^t))^{r^t} * (1-\hat{P}(C_1|x^t))^{(1-r^t)}Πt=1N​(P^(C1​∣xt))rt∗(1−P^(C1​∣xt))(1−rt) where P^(C1∣xt)\hat{P}(C_1|x^t)P^(C1​∣xt) is the estimate of the probability that xtx^txt is labeled 1.

We can simplify the above by using logs. Let yt=11+e−(wtx+w0)y^t = \frac{1}{1+e^{-(w^tx+w_0)}}yt=1+e−(wtx+w0​)1​

We need to find argmaxw,w0 Πt(yt)rt(1−yt)(1−rt)argmax_{w,w_0}\,\Pi_t (y^t)^{r^t}(1-y^t)^{(1-r^t)}argmaxw,w0​​Πt​(yt)rt(1−yt)(1−rt)

=argmaxw,w0∑t[rt log yt + (1−rt) log (1−yt)]=argmax_{w,w_0} \sum_t [r^t\, log\, y^t \, + \, (1-r^t)\, log\, (1-y^t)]=argmaxw,w0​​∑t​[rtlogyt+(1−rt)log(1−yt)]

=argminw,w0−∑t[rt log yt + (1−rt) log (1−yt)]= argmin_{w,w_0} -\sum_t [r^t\, log\, y^t \, + \, (1-r^t)\, log\, (1-y^t)]=argminw,w0​​−∑t​[rtlogyt+(1−rt)log(1−yt)]

Let us define Err(w,w0∣X)=−∑t[rt log yt + (1−rt) log (1−yt)]Err(w,w_0|X) = -\sum_t [r^t\, log\, y^t \, + \, (1-r^t)\, log\, (1-y^t)]Err(w,w0​∣X)=−∑t​[rtlogyt+(1−rt)log(1−yt)], so in other words, we need to find the weights w,w0w, w_0w,w0​ that minimize this error.

To do so, we use an iterative technique known as Gradient Descent.

Gradient Descent

Note that the error function mentioned above has a single minimum (i.e. no local minima, just one global minimum).

For j≠0,j \neq 0,j=0,

∂E∂j=−∑t (rtyt−(1−rt)1−yt)yt(1−yt)xjt\frac{\partial E}{\partial j} = - \sum_t \, (\frac{r^t}{y^t}-\frac{(1-r^t)}{1-y^t})y^t(1-y^t)x_j^t∂j∂E​=−∑t​(ytrt​−1−yt(1−rt)​)yt(1−yt)xjt​, where yt=11+e−(w2x2t+w1x1t+w0)y^t=\frac{1}{1+e^{-(w_2x_2^t+w_1x_1^t+w_0)}}yt=1+e−(w2​x2t​+w1​x1t​+w0​)1​ and using dday=y(1−y)\frac{d}{da}y = y(1-y)dad​y=y(1−y) for y=11+e−ay=\frac{1}{1+e^{-a}}y=1+e−a1​

Therefore, we get:

∂E∂j=−∑t(rt−yt)xjt\frac{\partial E}{\partial j} = - \sum_t (r^t-y^t) x_j^t∂j∂E​=−∑t​(rt−yt)xjt​

The iterative update rule for wj  (j≠0)w_j \,\, (j\neq 0)wj​(j=0) is given by:

wj←wj+η(−∂E∂wj)w_j \leftarrow w_j + \eta(-\frac{\partial E}{\partial w_j})wj​←wj​+η(−∂wj​∂E​) where η\etaη is the learning rate. This can be re-written as: wj←wj+η∑t(rt−yt)xjtw_j \leftarrow w_j + \eta \sum_t (r^t-y^t)x_j^twj​←wj​+η∑t​(rt−yt)xjt​ Similarly, w0←w0+η∑t(rt−yt)w_0 \leftarrow w_0 + \eta \sum_t (r^t-y^t)w0​←w0​+η∑t​(rt−yt)

Put generally, wj←wj+η Δwjw_j \leftarrow w_j + \eta \, \Delta w_jwj​←wj​+ηΔwj​

Pseudocode for Gradient Descent for Logistic Regression

# assume x_0 = 1

for j = 0...d:
    w_j = rand(-0.01, 0.01)
    repeat:
        for j = 0...d
            delta_w_j = 0
        for j = 1...N
            theta = 0
            for j = 0...d
                theta = theta + w_jx_j^t  # theta = wTx^t + w_0x_0 where x_0=1
            y = sigmoid(theta)
            delta_w_j = delta_w_j + (r^t-y)x_j^t
        for j = 0...d
            w_j = w_j + n * delta_w_j
    until convergence
PreviousModel SelectionNextDecision Trees

Last updated 5 years ago

Was this helpful?

We start with random weights. Then, we compute the derivative of the error function w.r.t. the weights. Then, we take a step in the direction of steepest descent i.e. in the negative direction of the gradient (since the gradient points to the direction of steepest ascent). This is done iteratively until we can no longer move in a direction that further minimizes the error.