Reinforcement Learning

In this kind of learning, the computer learns to choose the best course of action under a given circumstance, with the aim of maximizing a reward. It makes use of continuous feedback to judge the correctness of its decisions.

One issue is that this feedback is not instant: it can be delayed.

Credit Assignment is another issue: how to determine which action is to be given credit/blame?

State changes are an issue as well, because the state of the problem may/will change after every action is performed.

Reinforcement Learning is commonly used in learning to play games such as Chess, Checkers, Go etc.

Usually, a grid world is used with (st,ats_t,a_t) pairs, i.e. state at time t, and action at time t. A policy tells the learner what action to perform in a given state. The best policy is one that maximizes the reward. Therefore, the aim of Reinforcement Learning is to learn this optimal policy.

The Exploration-Exploitation Trade-off: cost is incurred while exploring for better alternatives. This is common in the Bandit Problem.

Note that it isn't necessary to maximize the sum of rewards at each step, but it is important to maximize the final reward.

Q(s,a)Q^*(s,a) is the maximum discounted reward that can be obtained if the learner started in state s and did action a as its first step. (* denotes optimal)

Steps (Q-Learning for Deterministic Worlds)

For each (s,a), initialize Q^(s,a)=0\hat{Q}(s,a)=0 Observe current state s Do forever:

  • select an action a and execute it

  • receive reward r

  • observe new state s'

  • update the estimate of Q^(s,a)\hat{Q}(s,a) as follows: Q^(s,a)=r+γmaxaQ^(s,a)\hat{Q}(s,a) = r+\gamma max_{a'}\hat{Q}(s',a')

    (γ\gamma is used for discounting the reward, the max term is the maximum discounted reward starting in s', assuming Q^=Q\hat{Q}=Q^*)

Last updated