Deep Learning Specialization - Coursera
main
main
  • Introduction
  • Neural Networks and Deep Learning
    • Introduction to Deep Learning
    • Logistic Regression as a Neural Network (Neural Network Basics)
    • Shallow Neural Network
    • Deep Neural Network
  • Improving Deep Neural Networks: Hyperparameter tuning, Regularization and Optimization
    • Practical Aspects of Deep Learning
    • Optimization Algorithms
    • Hyperparameter Tuning, Batch Normalization and Programming Frameworks
  • Structuring Machine Learning Projects
    • Introduction to ML Strategy
    • Setting Up Your Goal
    • Comparing to Human-Level Performance
    • Error Analysis
    • Mismatched Training and Dev/Test Set
    • Learning from Multiple Tasks
    • End-to-End Deep Learning
  • Convolutional Neural Networks
    • Foundations of Convolutional Neural Networks
    • Deep Convolutional Models: Case Studies
      • Classic Networks
      • ResNets
      • Inception
    • Advice for Using CNNs
    • Object Detection
      • Object Localization
      • Landmark Detection
      • Sliding Window Detection
      • The YOLO Algorithm
      • Intersection over Union
      • Non-Max Suppression
      • Anchor Boxes
      • Region Proposals
    • Face Recognition
      • One-Shot Learning
      • Siamese Network
      • Face Recognition as Binary Classification
    • Neural Style Transfer
  • Sequence Models
    • Recurrent Neural Networks
      • RNN Structure
      • Types of RNNs
      • Language Modeling
      • Vanishing Gradient Problem in RNNs
      • Gated Recurrent Units (GRUs)
      • Long Short-Term Memory Network (LSTM)
      • Bidirectional RNNs
    • Natural Language Processing & Word Embeddings
      • Introduction to Word Embeddings
      • Learning Word Embeddings: Word2Vec and GloVe
      • Applications using Word Embeddings
      • De-Biasing Word Embeddings
    • Sequence Models & Attention Mechanisms
      • Sequence to Sequence Architectures
        • Basic Models
        • Beam Search
        • Bleu Score
        • Attention Model
      • Speech Recognition
Powered by GitBook
On this page
  • Optimizing Beam Search - Length Normalization
  • Error Analysis in Beam Search

Was this helpful?

  1. Sequence Models
  2. Sequence Models & Attention Mechanisms
  3. Sequence to Sequence Architectures

Beam Search

As discussed, the aim of machine translation is to translate from one language (say French) to another language (say English) using a many-to-many encoder-decoder RNN.

The translated sentence is outputted word by word.

We must find the most probable translation.

The Beam Search algorithm helps us do so. It has a parameter called the beam width (say b).

First, the algorithm chooses b possible words as the first word of the translated sentence. Then, for each of these words, the next most probable word is chosen, and so on. (If b=1, it is a greedy approach, since it will choose a single most probable word at each step, but this may not be accurate/efficient). So, using larger b values will lead to a more accurate translation, however it will also be more computationally intensive.

Also, note that unlike exact search algorithms, such as the Breadth-First Search (BFS) or Depth-First Search (DFS), the Beam Search algorithm is an approximate search model, and doesn’t always find the exact maximum. Thus, it doesn’t always output the most likely sentence.

Optimizing Beam Search - Length Normalization

This is one way to optimize the Beam Search algorithm.

Since Beam Search internally computes a product of several probabilities, the result will be a very small number that is difficult for a computer to store accurately. So, instead of maximizing this product, we take the log instead. This results in a more numerically-stable algorithm, less prone to rounding errors.

Another unintended effect of the product is that it gets smaller for longer sentences since there will be more probabilities to multiply, thereby causing the algorithm to prefer shorter sentences. This may, however, not be accurate. So, we normalize the log of the product by the number of words in the output sentence, say n (in practice we normalize with n0.7n^{0.7}n0.7)

Error Analysis in Beam Search

Say the actual translation is y_actual, but Beam Search resulted in y_predicted. (x is the input sentence).

If P(y_actual | x) > P(y_predicted | x), then Beam Search is at fault because it failed to output the sentence with the maximum probability. In such a case, increasing the beam width may help.

If P(y_actual | x) <= P(y_predicted | x), then the RNN is at fault because it wrongly predicted the probabilities. In such a case, it may help to get more training data, use regularization, try a different network architecture etc.

PreviousBasic ModelsNextBleu Score

Last updated 4 years ago

Was this helpful?