Deep Learning Specialization - Coursera
main
main
  • Introduction
  • Neural Networks and Deep Learning
    • Introduction to Deep Learning
    • Logistic Regression as a Neural Network (Neural Network Basics)
    • Shallow Neural Network
    • Deep Neural Network
  • Improving Deep Neural Networks: Hyperparameter tuning, Regularization and Optimization
    • Practical Aspects of Deep Learning
    • Optimization Algorithms
    • Hyperparameter Tuning, Batch Normalization and Programming Frameworks
  • Structuring Machine Learning Projects
    • Introduction to ML Strategy
    • Setting Up Your Goal
    • Comparing to Human-Level Performance
    • Error Analysis
    • Mismatched Training and Dev/Test Set
    • Learning from Multiple Tasks
    • End-to-End Deep Learning
  • Convolutional Neural Networks
    • Foundations of Convolutional Neural Networks
    • Deep Convolutional Models: Case Studies
      • Classic Networks
      • ResNets
      • Inception
    • Advice for Using CNNs
    • Object Detection
      • Object Localization
      • Landmark Detection
      • Sliding Window Detection
      • The YOLO Algorithm
      • Intersection over Union
      • Non-Max Suppression
      • Anchor Boxes
      • Region Proposals
    • Face Recognition
      • One-Shot Learning
      • Siamese Network
      • Face Recognition as Binary Classification
    • Neural Style Transfer
  • Sequence Models
    • Recurrent Neural Networks
      • RNN Structure
      • Types of RNNs
      • Language Modeling
      • Vanishing Gradient Problem in RNNs
      • Gated Recurrent Units (GRUs)
      • Long Short-Term Memory Network (LSTM)
      • Bidirectional RNNs
    • Natural Language Processing & Word Embeddings
      • Introduction to Word Embeddings
      • Learning Word Embeddings: Word2Vec and GloVe
      • Applications using Word Embeddings
      • De-Biasing Word Embeddings
    • Sequence Models & Attention Mechanisms
      • Sequence to Sequence Architectures
        • Basic Models
        • Beam Search
        • Bleu Score
        • Attention Model
      • Speech Recognition
Powered by GitBook
On this page

Was this helpful?

  1. Sequence Models
  2. Sequence Models & Attention Mechanisms

Speech Recognition

PreviousAttention Model

Last updated 4 years ago

Was this helpful?

Speech recognition is the conversion of an audio input to a textual output (transcript). Spectrogram features of the audio are used as the input.

One way to perform speech recognition is to use an attention model.

Speech recognition can also be performed using the CTC (Connectionist Temporal Classification) Cost. In this method, a unidirectional RNN is used. The output will contain sets of repeating characters separated by blanks. This is because, in speech recognition, the input is always significantly longer than the output.

Say the input speech was "the quick brown fox".

The output would be:

t t t_h h_e e<space>q q_u u_...t\,t\,t \_ h\,h \_ e\,e <space> q\,q \_ u\,u \_ ...ttt_hh_ee<space>qq_uu_...

This is an acceptable output as long as we can obtain the input sentence by collapsing repeated characters that are not separated by blanks.

Although speech recognition requires a large amount of training data, trigger word detection can be performed with smaller amounts of data.

Trigger Word Detection

Trigger words are words that are used to wake up smart devices, such as "Ok Google", "Alexa" etc.

While creating a training set, we must set the labels for a few timestamps to 1 when the trigger word is heard in the audio, and the labels for the remaining timestamps must be set to 0. Then, an RNN can be trained to detect trigger words.