# Reinforcement Learning

In this kind of learning, the computer learns to choose the best course of action under a given circumstance, with the aim of maximizing a **reward**. It makes use of continuous feedback to judge the correctness of its decisions.

One issue is that this feedback is not instant: it can be **delayed**.

**Credit Assignment** is another issue: how to determine which action is to be given credit/blame?

**State changes** are an issue as well, because the state of the problem may/will change after every action is performed.

Reinforcement Learning is commonly used in learning to play games such as Chess, Checkers, Go etc.

Usually, a **grid world** is used with ($$s\_t,a\_t$$) pairs, i.e. state at time t, and action at time t. A **policy** tells the learner what action to perform in a given state. The best policy is one that maximizes the reward. Therefore, *the aim of Reinforcement Learning is to learn this optimal policy*.

The **Exploration-Exploitation Trade-off**: cost is incurred while exploring for better alternatives. This is common in the **Bandit Problem**.

Note that it isn't necessary to maximize the sum of rewards at each step, but it is important to maximize the final reward.

$$Q^\*(s,a)$$ is the maximum discounted reward that can be obtained if the learner started in state s and did action a as its first step. (\* denotes optimal)

## **Steps** (Q-Learning for Deterministic Worlds)

For each (s,a), initialize $$\hat{Q}(s,a)=0$$\
Observe current state s\
Do forever:

* select an action a and execute it &#x20;
* receive reward r &#x20;
* observe new state s' &#x20;
* update the estimate of $$\hat{Q}(s,a)$$ as follows: $$\hat{Q}(s,a) = r+\gamma max\_{a'}\hat{Q}(s',a')$$

  ($$\gamma$$ is used for discounting the reward, the max term is the maximum discounted reward starting in s', assuming $$\hat{Q}=Q^\*$$)