Reinforcement Learning

As opposed to standard supervised / unsupervised learning, reinforcement learning aims to maximize a total reward in an environment by taking an action at particular timestamp given state at that timestamp.

Note: Total reward being maximized is over all future timestamps.

Notations

We assume that we have a RL agent that takes action $a_t$ at timestamp $t$ given a state $s_t$ and get a reward $r_t$

The total reward is $ R_t = \sum_{i=t}^{n} r_i$.

In reality we add a discounting factor $\gamma$ to rewards that are too far off, which results in $\sum_{i=t}^{n} \gamma^i r_i$

Q function

The function relates the total reward we can recieve by taking an aciton $a_t$ in the state $s_t$

$ Q(s_t, a_t) = \mathbb{E}[R_t | s_t, a_t] $

Policy

Another function $\pi (s)$ that takes a state and determines best action to take in that state.

One greedy policy to just pick the action that leads to maximum expected total reward.

$\pi_* (s) = argmax_a Q(s, a) $

This leads to two different forms of RL algorithms:

  • Value Learning
    • Find Q function and use greedy policy.
  • Policy Learning
    • Learn the direct policy
    • Sample actions $a ~ \pi (s) $

Value Learning

Deep Neural Network can output $Q_ k$ for each possible action $a_k$.

We can design a loss function based on the reward recieved for action $a_t$ , i.e $r$ plus the discounted reward for action after that, $\gamma * max Q(s’, a’)$

Q Loss = $ || r_t + \gamma * max Q(s’, a’) - Q(s, a) ||_{2} $

Deep Q Learning

  • Downsides:
    • Discrete action space.
    • No stochasticity.

Steps in Deep Q Learning:

  • Provide the state of the environment to the agent.
  • The agent uses Target Network and Q-Network to get the Q-Values of all possible actions in the defined state.
    • Q Network is used to pick the current action which hopes to achieve a Q value $Q$.
    • Based on the current action, we get to new state. The reward we saw in the current state + the Q value obtained in the next state ( this time using target network ) ( $Q’$ ) gives the expected total reward.
    • The difference between expected reward: $ R_t + Q’_t+1 $ and proposed reward from Q network $Q$ needs to be minimized.
  • Pick the action a, based on the epsilon value. Meaning, either select a random action (exploration) or select the action with the maximum Q-Value (exploitation).
  • Perform action a
  • Observe reward r and the next state s’
  • Store these information in the experience replay memory <s, s’, a, r>
  • Sample random batches from experience replay memory and perform training of the Q-Network.
  • Each Nth iteration, copy the weights values from the Q-Network to the Target Network.
  • Repeat steps 2-7 for each episode

To Summarize main steps are:

  • Get current action and calcuated current and expected rewards
    • Use exploration vs exploitation while choosing current action.
  • Save Experences $(s, s', a, r)$ and perform experience replay
  • Update target network

Policy gradient

Probability that a action is good given a state.

We learn a distribution $ P (a | s) $, for example if we assume $P$ is a normal distribution, we learn the parameters $\mu$ and $\sigma$ for defining this distribution over continous actions. Then we can sample an action from this distribution.