Deep Reinforcement Learning Notes

Published in , 2024

Introduction to Reinforcement Learning

Taxonomy of Machine Learning methods

An MDP is defined by:

Taxonomy of Machine Learning methods

\[Goal : max_{\pi} [\sum_{t=0}^{H} \gamma^{t}R(S_{t}, A_{t}, S_{t+1})| \pi]\]

Taxonomy of Machine Learning methods

The objective in Optimal Control is given an MDP $(S,A,P,R,\gamma,H)$
Find the Optimal policy $\pi^{*}$
Exact solution methods to solve MDPs:
- Value Iteration
- Policy Iteration

Note:

Exact solutions Methods practically work for Small Discrete state-action spaces as it’s easier to use tabular methods for every storing value function of every state
These only work when the Transition function is known(dynamics of the environment is given!)
Value Iteration
Optimal Value Function $V^{*}$
$V^{*}(s) = max_{\pi} {\mathbb E}\left[ \sum_{t=0}^{H} \gamma R(s_{t}, a_{t}, s_{t+1}) | \pi , s_{0} = s \right]$ = Sum of discounted rewards when starting from state s and acting optimally
$V^{*}_{0} (s) = $ optimal value for state s when H = 0
- $V^{*}_{0} (s) = 0 \ \forall s$
$V^{*}_{1} (s) = $ optimal value for state s when H = 1
- $V^{*}{1} (s) = max{a} \sum_{s’} P(s’ s,a)(R(s,a,s’) + \gamma V_{0}^{*}(s’))$
$V^{*}_{2} (s) = $ optimal value for state s when H = 2
- $V^{*}{2} (s) = max{a} \sum_{s’} P(s’ s,a)(R(s,a,s’) + \gamma V_{1}^{*}(s’))$
$V^{*}_{k} (s) = $ optimal value for state s when H = k
- $V^{*}{k} (s) = max{a} \sum_{s’} P(s’ s,a)(R(s,a,s’) + \gamma V_{k-1}^{*}(s’))$

Start with $V_{0}^{s} = 0$ for all s
For $k = 1, … , H:$
- For all states s in S:
  - $V^{*}{k} (s) \leftarrow max{a} \sum_{s’} P(s’ s,a)(R(s,a,s’) + \gamma V_{k-1}^{*}(s’))$
  - $\pi^{*}{k} (s) \leftarrow max{a} \sum_{s’} P(s’ s,a)(R(s,a,s’) + \gamma V_{k-1}^{*}(s’))$