CS-GY 6923: Machine Learning
1.0.0
1.0.0
  • Introduction
  • What is Machine Learning?
  • Types of Machine Learning
    • Supervised Learning
      • Notations
      • Probabilistic Modeling
        • Naive Bayes Classifier
      • Linear Regression
      • Nearest Neighbor
      • Evaluating a Classifier
      • Parametric Estimation
        • Bayesian Approach to Parameter Estimation
        • Parametric Estimation for Simple Linear Regression
        • Parametric Estimation for Multivariate Linear Regression
        • Parametric Estimation for Simple Polynomial Regression
        • Parametric Estimation for Multivariate Polynomial Regression
      • Bias and Variance of an Estimator
      • Bias and Variance of a Regression Algorithm
        • Model Selection
      • Logistic Regression
      • Decision Trees
        • Using Decision Trees for Regression
        • Bias and Variance
      • Dimensionality Reduction
      • Neural Networks
        • Training a Neuron
        • MLP
          • Regression with Multiple Outputs
          • Advice/Tricks and Issues to Train a Neural Network
        • Deep Learning
      • Support Vector Machines
      • Ensemble Learning
    • Unsupervised Learning
      • K-Means Clustering
      • Probabilistic Clustering
    • Reinforcement Learning
Powered by GitBook
On this page
  • Advantages
  • Disadvantages
  • Batch Gradient Descent

Was this helpful?

  1. Types of Machine Learning
  2. Supervised Learning
  3. Neural Networks

MLP

This is the other name for a neural network.

Advantages

  • Good accuracy on even data that is far from linearly separable

  • Can learn complicated functions or concepts

Disadvantages

  • Danger of overfitting

  • Slow to train

Some Notations

Consider a simple 2 layer neural network (the input layer is not counted as a layer):

1 output unit, H hidden nodes (+1 dummy bias unit), d input nodes (+1 bias unit) Fully Connected: every node in a layer is connected to every node in the previous layer.

(H+1) + H(d+1) weights to learn.

whjw_{h_j}whj​​is the weight on the edge from input node xjx_jxj​ to hidden node h, vhv_hvh​is the weight on the edhe from hidden node h to the output node.

Z0,Z1,...,ZHZ_0, Z_1,...,Z_HZ0​,Z1​,...,ZH​ (with Z0=1Z_0=1Z0​=1) are the activations from the hidden layer, usually sigmoid i.e. Zh=11+e−wTxZ_h=\frac{1}{1+e^{-w^Tx}}Zh​=1+e−wTx1​

The Error Function for Regression is given by:

E(W,v)=12∑t(rt−yt)2E(W,v)=\frac{1}{2}\sum_t (r^t-y^t)^2E(W,v)=21​∑t​(rt−yt)2 i.e. the mean squared error, with W=weights whjw_{h_j}whj​​ and v=weights vhv_hvh​ and y=vTZy=v^TZy=vTZ i.e. a _linear activation function _i.e. y=v0.1+v1Z1+...+vHZHy = v_0.1+v_1Z_1+...+v_HZ_Hy=v0​.1+v1​Z1​+...+vH​ZH​

The Error Function for Classification is given by:

E(W,v)=−∑trtlogyt+(1−rt)log(1−yt)E(W,v) = -\sum_t r^tlogy^t + (1-r^t)log(1-y^t)E(W,v)=−∑t​rtlogyt+(1−rt)log(1−yt) i.e. the cross entropy loss

where y=11+e−vTZy = \frac{1}{1+e^{-v^TZ}}y=1+e−vTZ1​ i.e. the sigmoid function

Batch Gradient Descent

We must find W, v that minimize the error.

1. Initialize W and v
2. Repeat unitl convergence:
       compute v^t for each x^t in training
       update each v_h and w_h_j by doing:
       v_h = v_h - eta * dE/dv_h
       w_h_j = w_h_j - eta * dE/dw_h_j

We have:

∂E/∂vh=∑t(rt−yt)Zht\partial E/\partial v_h = \sum_t (r^t-y^t)Z_h^t∂E/∂vh​=∑t​(rt−yt)Zht​

∂E/∂whj=∑t−(rt−yt)vhzht(1−zht)xjt\partial E/\partial w_{h_j} = \sum_t -(r^t-y^t)v_hz_h^t(1-z_h^t)x_j^t∂E/∂whj​​=∑t​−(rt−yt)vh​zht​(1−zht​)xjt​

(computed using chain rule i.e. ∂E/∂whj=∑t∂E/∂y∗∂y/∂zht∗∂zht/∂whj\partial E/\partial w_{h_j} = \sum_t \partial E/\partial y * \partial y/\partial z_h^t * \partial z_h^t/\partial w_{h_j}∂E/∂whj​​=∑t​∂E/∂y∗∂y/∂zht​∗∂zht​/∂whj​​)

This technique is called backpropagation.

PreviousTraining a NeuronNextRegression with Multiple Outputs

Last updated 5 years ago

Was this helpful?