{
"cells": [
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "view-in-github"
},
"source": [
""
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "Io_4iovMTlzT"
},
"source": [
"# Tutorial - Value Iteration and Q-Learning\n",
"---------------------------------\n",
"\n",
"In this tutorial, you will:\n",
"\n",
"* Implement the value iteration algorithm to approximate the value function when *a model of the environment is available*.\n",
"* Implement the Q-Learning algorithm to approximate the value function when *the model is unknown*, that is, the agent must learn through interactions.\n",
"\n",
"We start with a short review of these algorithms.\n",
"\n",
"\n",
"## Markov decision processes and value functions\n",
"\n",
"In reinforcement learning, an agent interacts with an enviroment by taking actions and observing rewards. Its goal is to learn a *policy*, that is, a mapping from states to actions, that maximizes the amount of reward it gathers.\n",
"\n",
"The enviroment is modeled as a __Markov decision process (MDP)__, defined by a set of states $\\mathcal{S}$, a set of actions $\\mathcal{A}$, a reward function $r(s, a)$ and transition probabilities $P(s'|s,a)$. When an agent takes action $a$ in state $s$, it receives a random reward with mean $r(s,a)$ and makes a transion to a state $s'$ distributed according to $P(s'|s,a)$.\n",
"\n",
"A __policy__ $\\pi$ is such that $\\pi(a|s)$ gives the probability of choosing an action $a$ in state $s$. __If the policy is deterministic__, we denote by $\\pi(s)$ the action that it chooses in state $s$. We are interested in finding a policy that maximizes the value function $V^\\pi$, defined as \n",
"\n",
"$$\n",
"V^\\pi(s) = \\sum_{a\\in \\mathcal{A}} \\pi(a|s) Q^\\pi(s, a), \n",
"\\quad \\text{where} \\quad \n",
"Q^\\pi(s, a) = \\mathbf{E}\\left[ \\sum_{t=0}^\\infty \\gamma^t r(S_t, A_t) \\Big| S_0 = s, A_0 = a\\right].\n",
"$$\n",
"and represents the mean of the sum of discounted rewards gathered by the policy $\\pi$ in the MDP, where $\\gamma \\in [0, 1[$ is a discount factor ensuring the convergence of the sum. \n",
"\n",
"The __action-value function__ $Q^\\pi$ is the __fixed point of the Bellman operator $T^\\pi$__:\n",
"\n",
"$$ \n",
"Q^\\pi(s, a) = T^\\pi Q^\\pi(s, a)\n",
"$$\n",
"where, for any function $f: \\mathcal{S}\\times\\mathcal{A} \\to \\mathbb{R}$\n",
"$$\n",
"T^\\pi f(s, a) = r(s, a) + \\gamma \\sum_{s'} P(s'|s,a) \\left(\\sum_{a'}\\pi(a'|s')f(s',a')\\right) \n",
"$$\n",
"\n",
"\n",
"The __optimal value function__, defined as $V^*(s) = \\max_\\pi V^\\pi(s)$ can be shown to satisfy $V^*(s) = \\max_a Q^*(s, a)$, where $Q^*$ is the __fixed point of the optimal Bellman operator $T^*$__: \n",
"\n",
"$$ \n",
"Q^*(s, a) = T^* Q^*(s, a)\n",
"$$\n",
"where, for any function $f: \\mathcal{S}\\times\\mathcal{A} \\to \\mathbb{R}$\n",
"$$\n",
"T^* f(s, a) = r(s, a) + \\gamma \\sum_{s'} P(s'|s,a) \\max_{a'} f(s', a')\n",
"$$\n",
"and there exists an __optimal policy__ which is deterministic, given by $\\pi^*(s) \\in \\arg\\max_a Q^*(s, a)$.\n",
"\n",
"\n",
"## Value iteration\n",
"\n",
"If both the reward function $r$ and the transition probablities $P$ are known, we can compute $Q^*$ using value iteration, which proceeds as follows:\n",
"\n",
"1. Start with arbitrary $Q_0$, set $t=0$.\n",
"2. Compute $Q_{t+1}(s, a) = T^*Q_t(s,a)$ for every $(s, a)$.\n",
"3. If $\\max_{s,a} | Q_{t+1}(s, a) - Q_t(s,a)| \\leq \\varepsilon$, return $Q_{t}$. Otherwise, set $t \\gets t+1$ and go back to 2. \n",
"\n",
"The convergence is guaranteed by the contraction property of the Bellman operator, and $Q_{t+1}$ can be shown to be a good approximation of $Q^*$ for small epsilon. \n",
"\n",
"__Question__: Can you bound the error $\\max_{s,a} | Q^*(s, a) - Q_t(s,a)|$ as a function of $\\gamma$ and $\\varepsilon$?\n",
"\n",
"## Q-Learning\n",
"\n",
"In value iteration, we need to know $r$ and $P$ to implement the Bellman operator. When these quantities are not available, we can approximate $Q^*$ using *samples* from the environment with the Q-Learning algorithm.\n",
"\n",
"Q-Learning with __$\\varepsilon$-greedy exploration__ proceeds as follows:\n",
"\n",
"1. Start with arbitrary $Q_0$, get starting state $s_0$, set $t=0$.\n",
"2. Choosing action $a_t$: \n",
" * With probability $\\varepsilon$ choose $a_t$ randomly (uniform distribution) \n",
" * With probability $1-\\varepsilon$, choose $a_t \\in \\arg\\max_a Q_t(s_t, a)$.\n",
"3. Take action $a_t$, observe next state $s_{t+1}$ and reward $r_t$.\n",
"4. Compute error $\\delta_t = r_t + \\gamma \\max_a Q_t(s_{t+1}, a) - Q_t(s_t, a_t)$.\n",
"5. Update \n",
" * $Q_{t+1}(s, a) = Q_t(s, a) + \\alpha_t(s,a) \\delta_t$, __if $s=s_t$ and $a=a_t$__\n",
" * $Q_{t+1}(s, a) = Q_{t}(s, a)$ otherwise.\n",
"\n",
"Here, $\\alpha_t(s,a)$ is a learning rate that can depend, for instance, on the number of times the algorithm has visited the state-action pair $(s, a)$. \n"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "KYq9-63OR8RW"
},
"source": [
"# Colab setup"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "AxepTGrNR3DX",
"outputId": "42376421-d387-42a8-a943-0d1c5b5b3db0"
},
"outputs": [],
"source": [
"if 'google.colab' in str(get_ipython()):\n",
" print(\"Installing packages, please wait a few moments. Restart the runtime after the installation.\")\n",
"\n",
" # install rlberry library\n",
" !pip install git+https://github.com/rlberry-py/rlberry.git@v0.3.0#egg=rlberry[default] > /dev/null 2>&1\n",
"\n",
" # packages required to show video\n",
" !pip install pyvirtualdisplay > /dev/null 2>&1\n",
" !apt-get install -y xvfb python-opengl ffmpeg > /dev/null 2>&1\n"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "3_bPhqKlSiF0",
"outputId": "959689cb-1e62-41f3-c1ac-71741bd5bb48"
},
"outputs": [
{
"data": {
"text/plain": [
""
]
},
"execution_count": 2,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Create directory for saving videos\n",
"!mkdir videos > /dev/null 2>&1\n",
"\n",
"# The following code is will be used to visualize the environments.\n",
"import base64\n",
"from pyvirtualdisplay import Display\n",
"from IPython import display as ipythondisplay\n",
"from IPython.display import clear_output\n",
"from pathlib import Path\n",
"\n",
"def show_video(filename=None, directory='./videos'):\n",
" \"\"\"\n",
" Either show all videos in a directory (if filename is None) or \n",
" show video corresponding to filename.\n",
" \"\"\"\n",
" html = []\n",
" if filename is not None:\n",
" files = Path('./').glob(filename)\n",
" else:\n",
" files = Path(directory).glob(\"*.mp4\")\n",
" for mp4 in files:\n",
" print(mp4)\n",
" video_b64 = base64.b64encode(mp4.read_bytes())\n",
" html.append(''''''.format(mp4, video_b64.decode('ascii')))\n",
" ipythondisplay.display(ipythondisplay.HTML(data=\" \".join(html)))\n",
" \n",
"from pyvirtualdisplay import Display\n",
"display = Display(visible=0, size=(800, 800))\n",
"display.start()"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {
"id": "ZYZCXMpisE_O"
},
"outputs": [],
"source": [
"# other required libraries\n",
"import numpy as np\n",
"import matplotlib.pyplot as plt\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "zOPiAupGmkxh"
},
"source": [
"# Warm up: interacting with a reinforcement learning environment"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 578
},
"id": "6IZ0bVAlTjpZ",
"outputId": "60cf10f4-8f13-4264-c281-1194beff4c1d"
},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"[INFO] OpenGL_accelerate module loaded \n",
"[INFO] Using accelerated ArrayDatatype \n",
"[INFO] Generating grammar tables from /usr/lib/python3.7/lib2to3/Grammar.txt \n",
"[INFO] Generating grammar tables from /usr/lib/python3.7/lib2to3/PatternGrammar.txt \n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"number of states = 13\n",
"number of actions = 4\n",
"transition probabilities from state 0 by taking action 1: [0. 0.9 0. 0. 0.1 0. 0. 0. 0. 0. 0. 0. 0. ]\n",
"mean reward in state 0 for action 1 = 0.0\n",
"videos/random_policy.mp4\n"
]
},
{
"data": {
"text/html": [
""
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"from rlberry.envs import GridWorld\n",
"\n",
"# A GridWorld is an environment where an agent moves in a 2d grid and aims to reach the state which gives a reward.\n",
"env = GridWorld(nrows=3, ncols=5, walls=((0,2),(1, 2)), success_probability=0.9)\n",
"\n",
"# Number of states and actions\n",
"print(\"number of states = \", env.observation_space.n)\n",
"print(\"number of actions = \", env.action_space.n)\n",
"\n",
"# Transitions probabilities, env.P[s, a, s'] = P(s'|s, a)\n",
"print(\"transition probabilities from state 0 by taking action 1: \", env.P[0, 1, :])\n",
"\n",
"# Reward function: env.R[s, a] = r(s, a)\n",
"print(\"mean reward in state 0 for action 1 = \", env.R[0, 1])\n",
"\n",
"# Following a random policy \n",
"state = env.reset() # initial state \n",
"env.enable_rendering() # save states for visualization\n",
"for tt in range(100): # interact for 100 time steps\n",
" action = env.action_space.sample() # random action, a good RL agent must have a better strategy!\n",
" next_state, reward, is_terminal, info = env.step(action)\n",
" if is_terminal:\n",
" break\n",
" state = next_state\n",
"\n",
"# save video \n",
"env.save_video('./videos/random_policy.mp4', framerate=10)\n",
"# clear rendering data\n",
"env.clear_render_buffer()\n",
"env.disable_rendering()\n",
"# see video\n",
"show_video(filename='./videos/random_policy.mp4')"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "snmFW5Bzqpwj"
},
"source": [
"# Implementing Value Iteration\n",
"\n",
"1. Write a function ``bellman_operator`` that takes as input a function $Q$ and returns $T^* Q$.\n",
"2. Write a function ``value_iteration`` that returns a function $Q$ such that $||Q-T^* Q||_\\infty \\leq \\varepsilon$\n",
"3. Evaluate the performance of the policy $\\pi(s) = \\arg\\max_a Q(s, a)$, where Q is returned by ``value_iteration``."
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {
"id": "RPIOmpjkq0YX"
},
"outputs": [],
"source": [
"def bellman_operator(Q, env, gamma=0.99):\n",
" S = env.observation_space.n\n",
" A = env.action_space.n \n",
" TQ = np.zeros((S, A))\n",
"\n",
" # to complete...\n",
"\n",
" return TQ"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {
"id": "tEKAtA1LsYFx"
},
"outputs": [],
"source": [
"def value_iteration(env, gamma=0.99, epsilon=1e-6):\n",
" S = env.observation_space.n\n",
" A = env.action_space.n \n",
" Q = np.zeros((S, A))\n",
"\n",
" # to complete...\n",
"\n",
" return Q"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 440
},
"id": "rZ7k-rDLssSk",
"outputId": "7731f953-093d-4c3b-e84f-1b356eb892c3"
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"videos/value_iteration_policy.mp4\n"
]
},
{
"data": {
"text/html": [
""
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"Q_vi = value_iteration(env)\n",
"\n",
"# Following value iteration policy \n",
"state = env.reset() \n",
"env.enable_rendering() \n",
"for tt in range(100): \n",
" action = Q_vi[state, :].argmax()\n",
" next_state, reward, is_terminal, info = env.step(action)\n",
" if is_terminal:\n",
" break\n",
" state = next_state\n",
"\n",
"# save video (run last cell to visualize it!)\n",
"env.save_video('./videos/value_iteration_policy.mp4', framerate=10)\n",
"# clear rendering data\n",
"env.clear_render_buffer()\n",
"env.disable_rendering()\n",
"# see video\n",
"show_video(filename='./videos/value_iteration_policy.mp4')"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "1Uw6LVyVulOX"
},
"source": [
"# Implementing Q-Learning\n",
"\n",
"Implement a function ``q_learning`` that takes as input an environment, runs Q learning for $T$ time steps and returns $Q_T$. \n",
"\n",
"Test different learning rates:\n",
" * $\\alpha_t(s, a) = \\frac{1}{\\text{number of visits to} (s, a)}$\n",
" * $\\alpha_t(s, a) =$ constant in $]0, 1[$\n",
" * others?\n",
"\n",
"Test different initializations of the Q function and try different values of $\\varepsilon$ in the $\\varepsilon$-greedy exploration!\n",
"\n",
"It might be very useful to plot the difference between the Q-learning approximation and the output of value iteration above, as a function of time.\n"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {
"id": "OrhUOlrfv6xp"
},
"outputs": [],
"source": [
"def q_learning(env, gamma=0.99, T=5000, Q_vi=None):\n",
" \"\"\"\n",
" Q_vi is the output of value iteration.\n",
" \"\"\"\n",
" S = env.observation_space.n\n",
" A = env.action_space.n \n",
" error = np.zeros(T)\n",
" Q = np.zeros((S, A)) # can we improve this initialization? \n",
"\n",
" state = env.reset()\n",
" # to complete...\n",
" for tt in range(T):\n",
" # choose action a_t\n",
" # ...\n",
" # take action, observe next state and reward \n",
" # ...\n",
" # compute delta_t\n",
" # ...\n",
" # update Q\n",
" # ...\n",
"\n",
" error[tt] = np.abs(Q-Q_vi).max()\n",
" \n",
" plt.plot(error)\n",
" plt.xlabel('iteration')\n",
" plt.title('Q-Learning error')\n",
" plt.show()\n",
" \n",
" return Q "
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 718
},
"id": "fOetdWM4xhLt",
"outputId": "f755ca3f-86f1-4c48-ffe7-fa88d1dc68b3"
},
"outputs": [
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAYIAAAEWCAYAAABrDZDcAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADh0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uMy4yLjIsIGh0dHA6Ly9tYXRwbG90bGliLm9yZy+WH4yJAAAVq0lEQVR4nO3df5BlZX3n8fdnGX5lMcOvAYFhHFwmaw27WbQ6qFG3KEV+ZGPGctkNmionSpaYLO5G19VRawUxZcCYELPRWCzqsmoEQ0KcJBuRH7IxKkgPgjAiMAI64PBzAEEEBL77x30aL23Pz+7pO93P+1V1q895znPP/T5dt++nz3NOn05VIUnq1z8bdQGSpNEyCCSpcwaBJHXOIJCkzhkEktQ5g0CSOmcQSDMgyZIkjyTZZdS1SNvKINBOJclvJrk+yaNJ7krysSQLt/CcK5L81mzVOJWq+n5V7VVVT42yDml7GATaaST5b8BZwH8HFgIvAZYCX0qy6whLI8mCUb7+1pqqzm2tfa6MVTPHINBOIcnPA+8H3lpVX6yqn1TV7cB/BJ4PvGE79/vmJDcmeSDJxUmeN7TtI0nWJ/lhkjVJXjG07fQkFyb5TJIfAr/Zjjw+kOSrSR5O8qUk+7f+S5PUxIfo5vq27W9M8r0k9yf5H0luT3LMJsawe5IPJ/l+kruTfDzJnm3b0UnuSPKuJHcBn9pE7QcnWZ1kY5J1Sf7T5sa6Pd9rzV0GgXYWvwzsAfz1cGNVPQL8X+DYbd1hkhXAe4DXAYuArwCfG+pyNXAksC/wF8BfJtljaPsK4EJgb+Czre0NwJuAA4DdgHdspoQp+yZZDnwM+A3gIAZHP4dsZj9nAr/Qaj289X3f0PbntjE8DzhlE7WfD9wBHAycCHwwySu3MFZ1wiDQzmJ/4L6qenKKbRsYfJBvq7cAf1BVN7b9fhA4cuKooKo+U1X3V9WTVfVHwO7Avxx6/ter6m+q6umq+nFr+1RV3dzWP8/gw3lTNtX3ROBvq+qfquoJBh/qU970K0kYfLi/rao2VtXDbRwnDXV7Gjitqh4fqvOZ2hl8b18GvKuqHquqa4FzgTduYazqhEGgncV9wP6bmJ8+qG2nTYs80h7v2cI+nwd8JMmDSR4ENgKh/fad5B1t2uihtn0hgw/NCeun2OddQ8uPAntt5vU31ffg4X1X1aPA/ZvYxyLg54A1Q+P4Is8Oxnur6rFJzxuu/WBgIkQmfI9nH4VMNVZ1wiDQzuLrwOMMpnGekWQv4ATgCoCqeku7OmevqvrgFva5Hvjtqtp76LFnVX2tnQ94J4NzEPtU1d7AQwyCYsKOujXvBmDxxEqb799vE33vA34MHDE0hoVVNRxAU9U53PYDYN8kzxlqWwLcuYV9qBMGgXYKVfUQg5PF/zPJ8Ul2TbKUwZTKfWx53npBkj2GHrsCHwfeneQIgCQLk/yH1v85wJPAve257wN+fsYHNrULgdck+eUkuwGn8+wAekab2vlfwNlJDgBIckiS47b2xapqPfA14A/a9+YXgZOBz0xvGJovDALtNKrqQwxO7n4YeBi4jcG0yDFV9aMtPP3PGfzmPPH4VFVdxOBy1PPb1TA3MDi6ALiYwRTLzQymSR5jlqZHqmot8FYGJ3A3AI8A9zA4IprKu4B1wJVtHJfy7HMZW+P1DC7F/QFwEYNzCpduc/Gal+I/ptHOKsmbgDOAl1XV90ddz47Spr8eBJZV1W2jrkf98Q9HtNOqqk8leZLBpaXzKgiSvAa4jMGU0IeB64HbR1mT+uURgTQCSc5lcBlpgHHgd6vqptFWpV4ZBJLUOU8WS1Ln5uQ5gv3337+WLl066jIkaU5Zs2bNfVX1M3+lPyeDYOnSpYyPj4+6DEmaU5J8b6p2p4YkqXMGgSR1ziCQpM4ZBJLUOYNAkjpnEEhS5wwCSeqcQSBJnTMIJKlzBoEkdc4gkKTOGQSS1DmDQJI6ZxBIUucMAknqnEEgSZ0zCCSpcwaBJHXOIJCkzhkEktQ5g0CSOmcQSFLnDAJJ6pxBIEmdMwgkqXMzEgRJjk9yU5J1SVZNsX33JBe07VclWTpp+5IkjyR5x0zUI0naetMOgiS7AB8FTgCWA69PsnxSt5OBB6rqcOBs4KxJ2/8Y+Ifp1iJJ2nYzcURwFLCuqm6tqieA84EVk/qsAM5ryxcCr0oSgCSvBW4D1s5ALZKkbTQTQXAIsH5o/Y7WNmWfqnoSeAjYL8lewLuA92/pRZKckmQ8yfi99947A2VLkmD0J4tPB86uqke21LGqzqmqsaoaW7Ro0Y6vTJI6sWAG9nEncOjQ+uLWNlWfO5IsABYC9wMvBk5M8iFgb+DpJI9V1Z/NQF2SpK0wE0FwNbAsyWEMPvBPAt4wqc9qYCXwdeBE4PKqKuAVEx2SnA48YghI0uyadhBU1ZNJTgUuBnYBPllVa5OcAYxX1WrgE8Cnk6wDNjIIC0nSTiCDX8znlrGxsRofHx91GZI0pyRZU1Vjk9tHfbJYkjRiBoEkdc4gkKTOGQSS1DmDQJI6ZxBIUucMAknqnEEgSZ0zCCSpcwaBJHXOIJCkzhkEktQ5g0CSOmcQSFLnDAJJ6pxBIEmdMwgkqXMGgSR1ziCQpM4ZBJLUOYNAkjpnEEhS5wwCSeqcQSBJnTMIJKlzBoEkdc4gkKTOGQSS1DmDQJI6ZxBIUudmJAiSHJ/kpiTrkqyaYvvuSS5o269KsrS1vzrJmiTXt6+vnIl6JElbb9pBkGQX4KPACcBy4PVJlk/qdjLwQFUdDpwNnNXa7wNeU1X/GlgJfHq69UiSts1MHBEcBayrqlur6gngfGDFpD4rgPPa8oXAq5Kkqr5ZVT9o7WuBPZPsPgM1SZK20kwEwSHA+qH1O1rblH2q6kngIWC/SX3+PXBNVT0+AzVJkrbSglEXAJDkCAbTRcdups8pwCkAS5YsmaXKJGn+m4kjgjuBQ4fWF7e2KfskWQAsBO5v64uBi4A3VtV3N/UiVXVOVY1V1diiRYtmoGxJEsxMEFwNLEtyWJLdgJOA1ZP6rGZwMhjgRODyqqokewN/D6yqqq/OQC2SpG007SBoc/6nAhcDNwKfr6q1Sc5I8mut2yeA/ZKsA94OTFxieipwOPC+JNe2xwHTrUmStPVSVaOuYZuNjY3V+Pj4qMuQpDklyZqqGpvc7l8WS1LnDAJJ6pxBIEmdMwgkqXMGgSR1ziCQpM4ZBJLUOYNAkjpnEEhS5wwCSeqcQSBJnTMIJKlzBoEkdc4gkKTOGQSS1DmDQJI6ZxBIUucMAknqnEEgSZ0zCCSpcwaBJHXOIJCkzhkEktQ5g0CSOmcQSFLnDAJJ6pxBIEmdMwgkqXMGgSR1ziCQpM4ZBJLUuRkJgiTHJ7kpybokq6bYvnuSC9r2q5IsHdr27tZ+U5LjZqIeSdLWm3YQJNkF+ChwArAceH2S5ZO6nQw8UFWHA2cDZ7XnLgdOAo4Ajgc+1vYnSZolC2ZgH0cB66rqVoAk5wMrgG8P9VkBnN6WLwT+LEla+/lV9ThwW5J1bX9fn4G6fsb7/3Ytdz302I7YtSTNio+c9EJ2WzCzs/ozEQSHAOuH1u8AXrypPlX1ZJKHgP1a+5WTnnvIVC+S5BTgFIAlS5ZsV6HrN/6Y72/80XY9V5J2BkXN+D5nIghmRVWdA5wDMDY2tl3fiXNXjs1oTZI0H8zE8cWdwKFD64tb25R9kiwAFgL3b+VzJUk70EwEwdXAsiSHJdmNwcnf1ZP6rAZWtuUTgcurqlr7Se2qosOAZcA3ZqAmSdJWmvbUUJvzPxW4GNgF+GRVrU1yBjBeVauBTwCfbieDNzIIC1q/zzM4sfwk8J+r6qnp1iRJ2noZ/GI+t4yNjdX4+Pioy5CkOSXJmqr6mZOl/mWxJHXOIJCkzhkEktQ5g0CSOmcQSFLnDAJJ6pxBIEmdMwgkqXMGgSR1ziCQpM4ZBJLUOYNAkjpnEEhS5wwCSeqcQSBJnTMIJKlzBoEkdc4gkKTOGQSS1DmDQJI6ZxBIUucMAknqnEEgSZ0zCCSpcwaBJHXOIJCkzhkEktQ5g0CSOmcQSFLnDAJJ6ty0giDJvkkuSXJL+7rPJvqtbH1uSbKytf1ckr9P8p0ka5OcOZ1aJEnbZ7pHBKuAy6pqGXBZW3+WJPsCpwEvBo4CThsKjA9X1QuAFwIvS3LCNOuRJG2j6QbBCuC8tnwe8Nop+hwHXFJVG6vqAeAS4PiqerSqvgxQVU8A1wCLp1mPJGkbTTcIDqyqDW35LuDAKfocAqwfWr+jtT0jyd7AaxgcVUiSZtGCLXVIcinw3Ck2vXd4paoqSW1rAUkWAJ8D/rSqbt1Mv1OAUwCWLFmyrS8jSdqELQZBVR2zqW1J7k5yUFVtSHIQcM8U3e4Ejh5aXwxcMbR+DnBLVf3JFuo4p/VlbGxsmwNHkjS16U4NrQZWtuWVwBem6HMxcGySfdpJ4mNbG0l+H1gI/N4065AkbafpBsGZwKuT3AIc09ZJMpbkXICq2gh8ALi6Pc6oqo1JFjOYXloOXJPk2iS/Nc16JEnbKFVzb5ZlbGysxsfHR12GJM0pSdZU1djkdv+yWJI6ZxBIUucMAknqnEEgSZ0zCCSpcwaBJHXOIJCkzhkEktQ5g0CSOmcQSFLnDAJJ6pxBIEmdMwgkqXMGgSR1ziCQpM4ZBJLUOYNAkjpnEEhS5wwCSeqcQSBJnTMIJKlzBoEkdc4gkKTOGQSS1DmDQJI6ZxBIUucMAknqnEEgSZ0zCCSpcwaBJHXOIJCkzk0rCJLsm+SSJLe0r/tsot/K1ueWJCun2L46yQ3TqUWStH2me0SwCrisqpYBl7X1Z0myL3Aa8GLgKOC04cBI8jrgkWnWIUnaTtMNghXAeW35POC1U/Q5DrikqjZW1QPAJcDxAEn2At4O/P4065AkbafpBsGBVbWhLd8FHDhFn0OA9UPrd7Q2gA8AfwQ8uqUXSnJKkvEk4/fee+80SpYkDVuwpQ5JLgWeO8Wm9w6vVFUlqa194SRHAv+iqt6WZOmW+lfVOcA5AGNjY1v9OpKkzdtiEFTVMZvaluTuJAdV1YYkBwH3TNHtTuDoofXFwBXAS4GxJLe3Og5IckVVHY0kadZMd2poNTBxFdBK4AtT9LkYODbJPu0k8bHAxVX151V1cFUtBV4O3GwISNLsm24QnAm8OsktwDFtnSRjSc4FqKqNDM4FXN0eZ7Q2SdJOIFVzb7p9bGysxsfHR12GJM0pSdZU1djkdv+yWJI6ZxBIUucMAknqnEEgSZ0zCCSpcwaBJHXOIJCkzhkEktQ5g0CSOmcQSFLnDAJJ6pxBIEmdMwgkqXMGgSR1ziCQpM4ZBJLUOYNAkjpnEEhS5wwCSeqcQSBJnTMIJKlzBoEkdc4gkKTOGQSS1LlU1ahr2GZJ7gW+t51P3x+4bwbLmQsccx96G3Nv44Xpj/l5VbVocuOcDILpSDJeVWOjrmM2OeY+9Dbm3sYLO27MTg1JUucMAknqXI9BcM6oCxgBx9yH3sbc23hhB425u3MEkqRn6/GIQJI0xCCQpM51EwRJjk9yU5J1SVaNup7pSPLJJPckuWGobd8klyS5pX3dp7UnyZ+2cX8ryYuGnrOy9b8lycpRjGVrJTk0yZeTfDvJ2iT/tbXP23En2SPJN5Jc18b8/tZ+WJKr2tguSLJba9+9ra9r25cO7evdrf2mJMeNZkRbJ8kuSb6Z5O/a+rweL0CS25Ncn+TaJOOtbfbe21U17x/ALsB3gecDuwHXActHXdc0xvNvgRcBNwy1fQhY1ZZXAWe15V8B/gEI8BLgqta+L3Br+7pPW95n1GPbzJgPAl7Ulp8D3Awsn8/jbrXv1ZZ3Ba5qY/k8cFJr/zjwO235d4GPt+WTgAva8vL2nt8dOKz9LOwy6vFtZtxvB/4C+Lu2Pq/H22q+Hdh/Utusvbd7OSI4ClhXVbdW1RPA+cCKEde03arqH4GNk5pXAOe15fOA1w61/58auBLYO8lBwHHAJVW1saoeAC4Bjt/x1W+fqtpQVde05YeBG4FDmMfjbrU/0lZ3bY8CXglc2Nonj3nie3Eh8Kokae3nV9XjVXUbsI7Bz8ROJ8li4N8B57b1MI/HuwWz9t7uJQgOAdYPrd/R2uaTA6tqQ1u+CziwLW9q7HP2e9KmAF7I4DfkeT3uNk1yLXAPgx/s7wIPVtWTrctw/c+MrW1/CNiPuTXmPwHeCTzd1vdjfo93QgFfSrImySmtbdbe2wu2t2rtvKqqkszL64KT7AX8FfB7VfXDwS+AA/Nx3FX1FHBkkr2Bi4AXjLikHSbJrwL3VNWaJEePup5Z9vKqujPJAcAlSb4zvHFHv7d7OSK4Ezh0aH1xa5tP7m6Hh7Sv97T2TY19zn1PkuzKIAQ+W1V/3Zrn/bgBqupB4MvASxlMBUz8Ejdc/zNja9sXAvczd8b8MuDXktzOYPr2lcBHmL/jfUZV3dm+3sMg8I9iFt/bvQTB1cCydvXBbgxOLK0ecU0zbTUwcZXASuALQ+1vbFcavAR4qB1uXgwcm2SfdjXCsa1tp9Tmfj8B3FhVfzy0ad6OO8midiRAkj2BVzM4N/Jl4MTWbfKYJ74XJwKX1+As4mrgpHaVzWHAMuAbszOKrVdV766qxVW1lMHP6OVV9RvM0/FOSPLPkzxnYpnBe/IGZvO9Peqz5bP1YHCm/WYGc6zvHXU90xzL54ANwE8YzAOezGBu9DLgFuBSYN/WN8BH27ivB8aG9vNmBifS1gFvGvW4tjDmlzOYR/0WcG17/Mp8Hjfwi8A325hvAN7X2p/P4INtHfCXwO6tfY+2vq5tf/7Qvt7bvhc3ASeMemxbMfaj+elVQ/N6vG1817XH2onPp9l8b3uLCUnqXC9TQ5KkTTAIJKlzBoEkdc4gkKTOGQSS1DmDQF1L8rX2dWmSN8zwvt8z1WtJOxsvH5WAdkuDd1TVr27DcxbUT++BM9X2R6pqr5moT9qRPCJQ15JM3N3zTOAV7X7wb2s3e/vDJFe3e77/dut/dJKvJFkNfLu1/U27WdjaiRuGJTkT2LPt77PDr9X+IvQPk9zQ7kH/60P7viLJhUm+k+SzGb6ZkrSDeNM5aWAVQ0cE7QP9oar6pSS7A19N8qXW90XAv6rBLY4B3lxVG9ttIK5O8ldVtSrJqVV15BSv9TrgSODfAPu35/xj2/ZC4AjgB8BXGdx/559mfrjST3lEIE3tWAb3c7mWwe2u92NwzxqAbwyFAMB/SXIdcCWDm34tY/NeDnyuqp6qqruB/wf80tC+76iqpxncRmPpjIxG2gyPCKSpBXhrVT3rpl3tXMKPJq0fA7y0qh5NcgWDe+Bsr8eHlp/Cn1HNAo8IpIGHGfwLzAkXA7/Tbn1Nkl9od4acbCHwQAuBFzD414ETfjLx/Em+Avx6Ow+xiMG/Ht1p746p+c/fNqSBbwFPtSme/83gPvhLgWvaCdt7+em/Chz2ReAtSW5kcKfLK4e2nQN8K8k1Nbid8oSLGPxfgesY3FH1nVV1VwsSadZ5+agkdc6pIUnqnEEgSZ0zCCSpcwaBJHXOIJCkzhkEktQ5g0CSOvf/AQ/Xfo538TV8AAAAAElFTkSuQmCC\n",
"text/plain": [
""
]
},
"metadata": {
"needs_background": "light"
},
"output_type": "display_data"
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"videos/q_learning_policy.mp4\n"
]
},
{
"data": {
"text/html": [
""
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"Q_ql = q_learning(env, Q_vi=Q_vi)\n",
"\n",
"# Following Q-Learning policy \n",
"state = env.reset() \n",
"env.enable_rendering() \n",
"for tt in range(100): \n",
" action = Q_ql[state, :].argmax()\n",
" next_state, reward, is_terminal, info = env.step(action)\n",
" if is_terminal:\n",
" break\n",
" state = next_state\n",
"\n",
"# save video (run last cell to visualize it!)\n",
"env.save_video('./videos/q_learning_policy.mp4', framerate=10)\n",
"# clear rendering data\n",
"env.clear_render_buffer()\n",
"env.disable_rendering()\n",
"# see video\n",
"show_video(filename='./videos/q_learning_policy.mp4')"
]
}
],
"metadata": {
"colab": {
"authorship_tag": "ABX9TyM+8H1rbTADo1Hh3m1E+mXQ",
"collapsed_sections": [],
"include_colab_link": true,
"name": "Tutorial - Value Iteration and Q-Learning.ipynb",
"provenance": [],
"toc_visible": true
},
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.8.15"
}
},
"nbformat": 4,
"nbformat_minor": 1
}