{ "cells": [ { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "view-in-github" }, "source": [ "\"Open" ] }, { "cell_type": "markdown", "metadata": { "id": "Io_4iovMTlzT" }, "source": [ "# Tutorial - Value Iteration and Q-Learning\n", "---------------------------------\n", "\n", "In this tutorial, you will:\n", "\n", "* Implement the value iteration algorithm to approximate the value function when *a model of the environment is available*.\n", "* Implement the Q-Learning algorithm to approximate the value function when *the model is unknown*, that is, the agent must learn through interactions.\n", "\n", "We start with a short review of these algorithms.\n", "\n", "\n", "## Markov decision processes and value functions\n", "\n", "In reinforcement learning, an agent interacts with an enviroment by taking actions and observing rewards. Its goal is to learn a *policy*, that is, a mapping from states to actions, that maximizes the amount of reward it gathers.\n", "\n", "The enviroment is modeled as a __Markov decision process (MDP)__, defined by a set of states $\\mathcal{S}$, a set of actions $\\mathcal{A}$, a reward function $r(s, a)$ and transition probabilities $P(s'|s,a)$. When an agent takes action $a$ in state $s$, it receives a random reward with mean $r(s,a)$ and makes a transion to a state $s'$ distributed according to $P(s'|s,a)$.\n", "\n", "A __policy__ $\\pi$ is such that $\\pi(a|s)$ gives the probability of choosing an action $a$ in state $s$. __If the policy is deterministic__, we denote by $\\pi(s)$ the action that it chooses in state $s$. We are interested in finding a policy that maximizes the value function $V^\\pi$, defined as \n", "\n", "$$\n", "V^\\pi(s) = \\sum_{a\\in \\mathcal{A}} \\pi(a|s) Q^\\pi(s, a), \n", "\\quad \\text{where} \\quad \n", "Q^\\pi(s, a) = \\mathbf{E}\\left[ \\sum_{t=0}^\\infty \\gamma^t r(S_t, A_t) \\Big| S_0 = s, A_0 = a\\right].\n", "$$\n", "and represents the mean of the sum of discounted rewards gathered by the policy $\\pi$ in the MDP, where $\\gamma \\in [0, 1[$ is a discount factor ensuring the convergence of the sum. \n", "\n", "The __action-value function__ $Q^\\pi$ is the __fixed point of the Bellman operator $T^\\pi$__:\n", "\n", "$$ \n", "Q^\\pi(s, a) = T^\\pi Q^\\pi(s, a)\n", "$$\n", "where, for any function $f: \\mathcal{S}\\times\\mathcal{A} \\to \\mathbb{R}$\n", "$$\n", "T^\\pi f(s, a) = r(s, a) + \\gamma \\sum_{s'} P(s'|s,a) \\left(\\sum_{a'}\\pi(a'|s')f(s',a')\\right) \n", "$$\n", "\n", "\n", "The __optimal value function__, defined as $V^*(s) = \\max_\\pi V^\\pi(s)$ can be shown to satisfy $V^*(s) = \\max_a Q^*(s, a)$, where $Q^*$ is the __fixed point of the optimal Bellman operator $T^*$__: \n", "\n", "$$ \n", "Q^*(s, a) = T^* Q^*(s, a)\n", "$$\n", "where, for any function $f: \\mathcal{S}\\times\\mathcal{A} \\to \\mathbb{R}$\n", "$$\n", "T^* f(s, a) = r(s, a) + \\gamma \\sum_{s'} P(s'|s,a) \\max_{a'} f(s', a')\n", "$$\n", "and there exists an __optimal policy__ which is deterministic, given by $\\pi^*(s) \\in \\arg\\max_a Q^*(s, a)$.\n", "\n", "\n", "## Value iteration\n", "\n", "If both the reward function $r$ and the transition probablities $P$ are known, we can compute $Q^*$ using value iteration, which proceeds as follows:\n", "\n", "1. Start with arbitrary $Q_0$, set $t=0$.\n", "2. Compute $Q_{t+1}(s, a) = T^*Q_t(s,a)$ for every $(s, a)$.\n", "3. If $\\max_{s,a} | Q_{t+1}(s, a) - Q_t(s,a)| \\leq \\varepsilon$, return $Q_{t}$. Otherwise, set $t \\gets t+1$ and go back to 2. \n", "\n", "The convergence is guaranteed by the contraction property of the Bellman operator, and $Q_{t+1}$ can be shown to be a good approximation of $Q^*$ for small epsilon. \n", "\n", "__Question__: Can you bound the error $\\max_{s,a} | Q^*(s, a) - Q_t(s,a)|$ as a function of $\\gamma$ and $\\varepsilon$?\n", "\n", "## Q-Learning\n", "\n", "In value iteration, we need to know $r$ and $P$ to implement the Bellman operator. When these quantities are not available, we can approximate $Q^*$ using *samples* from the environment with the Q-Learning algorithm.\n", "\n", "Q-Learning with __$\\varepsilon$-greedy exploration__ proceeds as follows:\n", "\n", "1. Start with arbitrary $Q_0$, get starting state $s_0$, set $t=0$.\n", "2. Choosing action $a_t$: \n", " * With probability $\\varepsilon$ choose $a_t$ randomly (uniform distribution) \n", " * With probability $1-\\varepsilon$, choose $a_t \\in \\arg\\max_a Q_t(s_t, a)$.\n", "3. Take action $a_t$, observe next state $s_{t+1}$ and reward $r_t$.\n", "4. Compute error $\\delta_t = r_t + \\gamma \\max_a Q_t(s_{t+1}, a) - Q_t(s_t, a_t)$.\n", "5. Update \n", " * $Q_{t+1}(s, a) = Q_t(s, a) + \\alpha_t(s,a) \\delta_t$, __if $s=s_t$ and $a=a_t$__\n", " * $Q_{t+1}(s, a) = Q_{t}(s, a)$ otherwise.\n", "\n", "Here, $\\alpha_t(s,a)$ is a learning rate that can depend, for instance, on the number of times the algorithm has visited the state-action pair $(s, a)$. \n" ] }, { "cell_type": "markdown", "metadata": { "id": "KYq9-63OR8RW" }, "source": [ "# Colab setup" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "AxepTGrNR3DX", "outputId": "42376421-d387-42a8-a943-0d1c5b5b3db0" }, "outputs": [], "source": [ "if 'google.colab' in str(get_ipython()):\n", " print(\"Installing packages, please wait a few moments. Restart the runtime after the installation.\")\n", "\n", " # install rlberry library\n", " !pip install git+https://github.com/rlberry-py/rlberry.git@v0.3.0#egg=rlberry[default] > /dev/null 2>&1\n", "\n", " # packages required to show video\n", " !pip install pyvirtualdisplay > /dev/null 2>&1\n", " !apt-get install -y xvfb python-opengl ffmpeg > /dev/null 2>&1\n" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "3_bPhqKlSiF0", "outputId": "959689cb-1e62-41f3-c1ac-71741bd5bb48" }, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Create directory for saving videos\n", "!mkdir videos > /dev/null 2>&1\n", "\n", "# The following code is will be used to visualize the environments.\n", "import base64\n", "from pyvirtualdisplay import Display\n", "from IPython import display as ipythondisplay\n", "from IPython.display import clear_output\n", "from pathlib import Path\n", "\n", "def show_video(filename=None, directory='./videos'):\n", " \"\"\"\n", " Either show all videos in a directory (if filename is None) or \n", " show video corresponding to filename.\n", " \"\"\"\n", " html = []\n", " if filename is not None:\n", " files = Path('./').glob(filename)\n", " else:\n", " files = Path(directory).glob(\"*.mp4\")\n", " for mp4 in files:\n", " print(mp4)\n", " video_b64 = base64.b64encode(mp4.read_bytes())\n", " html.append(''''''.format(mp4, video_b64.decode('ascii')))\n", " ipythondisplay.display(ipythondisplay.HTML(data=\"
\".join(html)))\n", " \n", "from pyvirtualdisplay import Display\n", "display = Display(visible=0, size=(800, 800))\n", "display.start()" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "id": "ZYZCXMpisE_O" }, "outputs": [], "source": [ "# other required libraries\n", "import numpy as np\n", "import matplotlib.pyplot as plt\n", "\n" ] }, { "cell_type": "markdown", "metadata": { "id": "zOPiAupGmkxh" }, "source": [ "# Warm up: interacting with a reinforcement learning environment" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 578 }, "id": "6IZ0bVAlTjpZ", "outputId": "60cf10f4-8f13-4264-c281-1194beff4c1d" }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "[INFO] OpenGL_accelerate module loaded \n", "[INFO] Using accelerated ArrayDatatype \n", "[INFO] Generating grammar tables from /usr/lib/python3.7/lib2to3/Grammar.txt \n", "[INFO] Generating grammar tables from /usr/lib/python3.7/lib2to3/PatternGrammar.txt \n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "number of states = 13\n", "number of actions = 4\n", "transition probabilities from state 0 by taking action 1: [0. 0.9 0. 0. 0.1 0. 0. 0. 0. 0. 0. 0. 0. ]\n", "mean reward in state 0 for action 1 = 0.0\n", "videos/random_policy.mp4\n" ] }, { "data": { "text/html": [ "" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "from rlberry.envs import GridWorld\n", "\n", "# A GridWorld is an environment where an agent moves in a 2d grid and aims to reach the state which gives a reward.\n", "env = GridWorld(nrows=3, ncols=5, walls=((0,2),(1, 2)), success_probability=0.9)\n", "\n", "# Number of states and actions\n", "print(\"number of states = \", env.observation_space.n)\n", "print(\"number of actions = \", env.action_space.n)\n", "\n", "# Transitions probabilities, env.P[s, a, s'] = P(s'|s, a)\n", "print(\"transition probabilities from state 0 by taking action 1: \", env.P[0, 1, :])\n", "\n", "# Reward function: env.R[s, a] = r(s, a)\n", "print(\"mean reward in state 0 for action 1 = \", env.R[0, 1])\n", "\n", "# Following a random policy \n", "state = env.reset() # initial state \n", "env.enable_rendering() # save states for visualization\n", "for tt in range(100): # interact for 100 time steps\n", " action = env.action_space.sample() # random action, a good RL agent must have a better strategy!\n", " next_state, reward, is_terminal, info = env.step(action)\n", " if is_terminal:\n", " break\n", " state = next_state\n", "\n", "# save video \n", "env.save_video('./videos/random_policy.mp4', framerate=10)\n", "# clear rendering data\n", "env.clear_render_buffer()\n", "env.disable_rendering()\n", "# see video\n", "show_video(filename='./videos/random_policy.mp4')" ] }, { "cell_type": "markdown", "metadata": { "id": "snmFW5Bzqpwj" }, "source": [ "# Implementing Value Iteration\n", "\n", "1. Write a function ``bellman_operator`` that takes as input a function $Q$ and returns $T^* Q$.\n", "2. Write a function ``value_iteration`` that returns a function $Q$ such that $||Q-T^* Q||_\\infty \\leq \\varepsilon$\n", "3. Evaluate the performance of the policy $\\pi(s) = \\arg\\max_a Q(s, a)$, where Q is returned by ``value_iteration``." ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "id": "RPIOmpjkq0YX" }, "outputs": [], "source": [ "def bellman_operator(Q, env, gamma=0.99):\n", " S = env.observation_space.n\n", " A = env.action_space.n \n", " TQ = np.zeros((S, A))\n", "\n", " # to complete...\n", "\n", " return TQ" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "id": "tEKAtA1LsYFx" }, "outputs": [], "source": [ "def value_iteration(env, gamma=0.99, epsilon=1e-6):\n", " S = env.observation_space.n\n", " A = env.action_space.n \n", " Q = np.zeros((S, A))\n", "\n", " # to complete...\n", "\n", " return Q" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 440 }, "id": "rZ7k-rDLssSk", "outputId": "7731f953-093d-4c3b-e84f-1b356eb892c3" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "videos/value_iteration_policy.mp4\n" ] }, { "data": { "text/html": [ "" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "Q_vi = value_iteration(env)\n", "\n", "# Following value iteration policy \n", "state = env.reset() \n", "env.enable_rendering() \n", "for tt in range(100): \n", " action = Q_vi[state, :].argmax()\n", " next_state, reward, is_terminal, info = env.step(action)\n", " if is_terminal:\n", " break\n", " state = next_state\n", "\n", "# save video (run last cell to visualize it!)\n", "env.save_video('./videos/value_iteration_policy.mp4', framerate=10)\n", "# clear rendering data\n", "env.clear_render_buffer()\n", "env.disable_rendering()\n", "# see video\n", "show_video(filename='./videos/value_iteration_policy.mp4')" ] }, { "cell_type": "markdown", "metadata": { "id": "1Uw6LVyVulOX" }, "source": [ "# Implementing Q-Learning\n", "\n", "Implement a function ``q_learning`` that takes as input an environment, runs Q learning for $T$ time steps and returns $Q_T$. \n", "\n", "Test different learning rates:\n", " * $\\alpha_t(s, a) = \\frac{1}{\\text{number of visits to} (s, a)}$\n", " * $\\alpha_t(s, a) =$ constant in $]0, 1[$\n", " * others?\n", "\n", "Test different initializations of the Q function and try different values of $\\varepsilon$ in the $\\varepsilon$-greedy exploration!\n", "\n", "It might be very useful to plot the difference between the Q-learning approximation and the output of value iteration above, as a function of time.\n" ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "id": "OrhUOlrfv6xp" }, "outputs": [], "source": [ "def q_learning(env, gamma=0.99, T=5000, Q_vi=None):\n", " \"\"\"\n", " Q_vi is the output of value iteration.\n", " \"\"\"\n", " S = env.observation_space.n\n", " A = env.action_space.n \n", " error = np.zeros(T)\n", " Q = np.zeros((S, A)) # can we improve this initialization? \n", "\n", " state = env.reset()\n", " # to complete...\n", " for tt in range(T):\n", " # choose action a_t\n", " # ...\n", " # take action, observe next state and reward \n", " # ...\n", " # compute delta_t\n", " # ...\n", " # update Q\n", " # ...\n", "\n", " error[tt] = np.abs(Q-Q_vi).max()\n", " \n", " plt.plot(error)\n", " plt.xlabel('iteration')\n", " plt.title('Q-Learning error')\n", " plt.show()\n", " \n", " return Q " ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 718 }, "id": "fOetdWM4xhLt", "outputId": "f755ca3f-86f1-4c48-ffe7-fa88d1dc68b3" }, "outputs": [ { "data": { "image/png": "iVBORw0KGgoAAAANSUhEUgAAAYIAAAEWCAYAAABrDZDcAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADh0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uMy4yLjIsIGh0dHA6Ly9tYXRwbG90bGliLm9yZy+WH4yJAAAVq0lEQVR4nO3df5BlZX3n8fdnGX5lMcOvAYFhHFwmaw27WbQ6qFG3KEV+ZGPGctkNmionSpaYLO5G19VRawUxZcCYELPRWCzqsmoEQ0KcJBuRH7IxKkgPgjAiMAI64PBzAEEEBL77x30aL23Pz+7pO93P+1V1q895znPP/T5dt++nz3NOn05VIUnq1z8bdQGSpNEyCCSpcwaBJHXOIJCkzhkEktQ5g0CSOmcQSDMgyZIkjyTZZdS1SNvKINBOJclvJrk+yaNJ7krysSQLt/CcK5L81mzVOJWq+n5V7VVVT42yDml7GATaaST5b8BZwH8HFgIvAZYCX0qy6whLI8mCUb7+1pqqzm2tfa6MVTPHINBOIcnPA+8H3lpVX6yqn1TV7cB/BJ4PvGE79/vmJDcmeSDJxUmeN7TtI0nWJ/lhkjVJXjG07fQkFyb5TJIfAr/Zjjw+kOSrSR5O8qUk+7f+S5PUxIfo5vq27W9M8r0k9yf5H0luT3LMJsawe5IPJ/l+kruTfDzJnm3b0UnuSPKuJHcBn9pE7QcnWZ1kY5J1Sf7T5sa6Pd9rzV0GgXYWvwzsAfz1cGNVPQL8X+DYbd1hkhXAe4DXAYuArwCfG+pyNXAksC/wF8BfJtljaPsK4EJgb+Czre0NwJuAA4DdgHdspoQp+yZZDnwM+A3gIAZHP4dsZj9nAr/Qaj289X3f0PbntjE8DzhlE7WfD9wBHAycCHwwySu3MFZ1wiDQzmJ/4L6qenKKbRsYfJBvq7cAf1BVN7b9fhA4cuKooKo+U1X3V9WTVfVHwO7Avxx6/ter6m+q6umq+nFr+1RV3dzWP8/gw3lTNtX3ROBvq+qfquoJBh/qU970K0kYfLi/rao2VtXDbRwnDXV7Gjitqh4fqvOZ2hl8b18GvKuqHquqa4FzgTduYazqhEGgncV9wP6bmJ8+qG2nTYs80h7v2cI+nwd8JMmDSR4ENgKh/fad5B1t2uihtn0hgw/NCeun2OddQ8uPAntt5vU31ffg4X1X1aPA/ZvYxyLg54A1Q+P4Is8Oxnur6rFJzxuu/WBgIkQmfI9nH4VMNVZ1wiDQzuLrwOMMpnGekWQv4ATgCoCqeku7OmevqvrgFva5Hvjtqtp76LFnVX2tnQ94J4NzEPtU1d7AQwyCYsKOujXvBmDxxEqb799vE33vA34MHDE0hoVVNRxAU9U53PYDYN8kzxlqWwLcuYV9qBMGgXYKVfUQg5PF/zPJ8Ul2TbKUwZTKfWx53npBkj2GHrsCHwfeneQIgCQLk/yH1v85wJPAve257wN+fsYHNrULgdck+eUkuwGn8+wAekab2vlfwNlJDgBIckiS47b2xapqPfA14A/a9+YXgZOBz0xvGJovDALtNKrqQwxO7n4YeBi4jcG0yDFV9aMtPP3PGfzmPPH4VFVdxOBy1PPb1TA3MDi6ALiYwRTLzQymSR5jlqZHqmot8FYGJ3A3AI8A9zA4IprKu4B1wJVtHJfy7HMZW+P1DC7F/QFwEYNzCpduc/Gal+I/ptHOKsmbgDOAl1XV90ddz47Spr8eBJZV1W2jrkf98Q9HtNOqqk8leZLBpaXzKgiSvAa4jMGU0IeB64HbR1mT+uURgTQCSc5lcBlpgHHgd6vqptFWpV4ZBJLUOU8WS1Ln5uQ5gv3337+WLl066jIkaU5Zs2bNfVX1M3+lPyeDYOnSpYyPj4+6DEmaU5J8b6p2p4YkqXMGgSR1ziCQpM4ZBJLUOYNAkjpnEEhS5wwCSeqcQSBJnTMIJKlzBoEkdc4gkKTOGQSS1DmDQJI6ZxBIUucMAknqnEEgSZ0zCCSpcwaBJHXOIJCkzhkEktQ5g0CSOmcQSFLnDAJJ6pxBIEmdMwgkqXMzEgRJjk9yU5J1SVZNsX33JBe07VclWTpp+5IkjyR5x0zUI0naetMOgiS7AB8FTgCWA69PsnxSt5OBB6rqcOBs4KxJ2/8Y+Ifp1iJJ2nYzcURwFLCuqm6tqieA84EVk/qsAM5ryxcCr0oSgCSvBW4D1s5ALZKkbTQTQXAIsH5o/Y7WNmWfqnoSeAjYL8lewLuA92/pRZKckmQ8yfi99947A2VLkmD0J4tPB86uqke21LGqzqmqsaoaW7Ro0Y6vTJI6sWAG9nEncOjQ+uLWNlWfO5IsABYC9wMvBk5M8iFgb+DpJI9V1Z/NQF2SpK0wE0FwNbAsyWEMPvBPAt4wqc9qYCXwdeBE4PKqKuAVEx2SnA48YghI0uyadhBU1ZNJTgUuBnYBPllVa5OcAYxX1WrgE8Cnk6wDNjIIC0nSTiCDX8znlrGxsRofHx91GZI0pyRZU1Vjk9tHfbJYkjRiBoEkdc4gkKTOGQSS1DmDQJI6ZxBIUucMAknqnEEgSZ0zCCSpcwaBJHXOIJCkzhkEktQ5g0CSOmcQSFLnDAJJ6pxBIEmdMwgkqXMGgSR1ziCQpM4ZBJLUOYNAkjpnEEhS5wwCSeqcQSBJnTMIJKlzBoEkdc4gkKTOGQSS1DmDQJI6ZxBIUudmJAiSHJ/kpiTrkqyaYvvuSS5o269KsrS1vzrJmiTXt6+vnIl6JElbb9pBkGQX4KPACcBy4PVJlk/qdjLwQFUdDpwNnNXa7wNeU1X/GlgJfHq69UiSts1MHBEcBayrqlur6gngfGDFpD4rgPPa8oXAq5Kkqr5ZVT9o7WuBPZPsPgM1SZK20kwEwSHA+qH1O1rblH2q6kngIWC/SX3+PXBNVT0+AzVJkrbSglEXAJDkCAbTRcdups8pwCkAS5YsmaXKJGn+m4kjgjuBQ4fWF7e2KfskWQAsBO5v64uBi4A3VtV3N/UiVXVOVY1V1diiRYtmoGxJEsxMEFwNLEtyWJLdgJOA1ZP6rGZwMhjgRODyqqokewN/D6yqqq/OQC2SpG007SBoc/6nAhcDNwKfr6q1Sc5I8mut2yeA/ZKsA94OTFxieipwOPC+JNe2xwHTrUmStPVSVaOuYZuNjY3V+Pj4qMuQpDklyZqqGpvc7l8WS1LnDAJJ6pxBIEmdMwgkqXMGgSR1ziCQpM4ZBJLUOYNAkjpnEEhS5wwCSeqcQSBJnTMIJKlzBoEkdc4gkKTOGQSS1DmDQJI6ZxBIUucMAknqnEEgSZ0zCCSpcwaBJHXOIJCkzhkEktQ5g0CSOmcQSFLnDAJJ6pxBIEmdMwgkqXMGgSR1ziCQpM4ZBJLUuRkJgiTHJ7kpybokq6bYvnuSC9r2q5IsHdr27tZ+U5LjZqIeSdLWm3YQJNkF+ChwArAceH2S5ZO6nQw8UFWHA2cDZ7XnLgdOAo4Ajgc+1vYnSZolC2ZgH0cB66rqVoAk5wMrgG8P9VkBnN6WLwT+LEla+/lV9ThwW5J1bX9fn4G6fsb7/3Ytdz302I7YtSTNio+c9EJ2WzCzs/ozEQSHAOuH1u8AXrypPlX1ZJKHgP1a+5WTnnvIVC+S5BTgFIAlS5ZsV6HrN/6Y72/80XY9V5J2BkXN+D5nIghmRVWdA5wDMDY2tl3fiXNXjs1oTZI0H8zE8cWdwKFD64tb25R9kiwAFgL3b+VzJUk70EwEwdXAsiSHJdmNwcnf1ZP6rAZWtuUTgcurqlr7Se2qosOAZcA3ZqAmSdJWmvbUUJvzPxW4GNgF+GRVrU1yBjBeVauBTwCfbieDNzIIC1q/zzM4sfwk8J+r6qnp1iRJ2noZ/GI+t4yNjdX4+Pioy5CkOSXJmqr6mZOl/mWxJHXOIJCkzhkEktQ5g0CSOmcQSFLnDAJJ6pxBIEmdMwgkqXMGgSR1ziCQpM4ZBJLUOYNAkjpnEEhS5wwCSeqcQSBJnTMIJKlzBoEkdc4gkKTOGQSS1DmDQJI6ZxBIUucMAknqnEEgSZ0zCCSpcwaBJHXOIJCkzhkEktQ5g0CSOmcQSFLnDAJJ6ty0giDJvkkuSXJL+7rPJvqtbH1uSbKytf1ckr9P8p0ka5OcOZ1aJEnbZ7pHBKuAy6pqGXBZW3+WJPsCpwEvBo4CThsKjA9X1QuAFwIvS3LCNOuRJG2j6QbBCuC8tnwe8Nop+hwHXFJVG6vqAeAS4PiqerSqvgxQVU8A1wCLp1mPJGkbTTcIDqyqDW35LuDAKfocAqwfWr+jtT0jyd7AaxgcVUiSZtGCLXVIcinw3Ck2vXd4paoqSW1rAUkWAJ8D/rSqbt1Mv1OAUwCWLFmyrS8jSdqELQZBVR2zqW1J7k5yUFVtSHIQcM8U3e4Ejh5aXwxcMbR+DnBLVf3JFuo4p/VlbGxsmwNHkjS16U4NrQZWtuWVwBem6HMxcGySfdpJ4mNbG0l+H1gI/N4065AkbafpBsGZwKuT3AIc09ZJMpbkXICq2gh8ALi6Pc6oqo1JFjOYXloOXJPk2iS/Nc16JEnbKFVzb5ZlbGysxsfHR12GJM0pSdZU1djkdv+yWJI6ZxBIUucMAknqnEEgSZ0zCCSpcwaBJHXOIJCkzhkEktQ5g0CSOmcQSFLnDAJJ6pxBIEmdMwgkqXMGgSR1ziCQpM4ZBJLUOYNAkjpnEEhS5wwCSeqcQSBJnTMIJKlzBoEkdc4gkKTOGQSS1DmDQJI6ZxBIUucMAknqnEEgSZ0zCCSpcwaBJHXOIJCkzk0rCJLsm+SSJLe0r/tsot/K1ueWJCun2L46yQ3TqUWStH2me0SwCrisqpYBl7X1Z0myL3Aa8GLgKOC04cBI8jrgkWnWIUnaTtMNghXAeW35POC1U/Q5DrikqjZW1QPAJcDxAEn2At4O/P4065AkbafpBsGBVbWhLd8FHDhFn0OA9UPrd7Q2gA8AfwQ8uqUXSnJKkvEk4/fee+80SpYkDVuwpQ5JLgWeO8Wm9w6vVFUlqa194SRHAv+iqt6WZOmW+lfVOcA5AGNjY1v9OpKkzdtiEFTVMZvaluTuJAdV1YYkBwH3TNHtTuDoofXFwBXAS4GxJLe3Og5IckVVHY0kadZMd2poNTBxFdBK4AtT9LkYODbJPu0k8bHAxVX151V1cFUtBV4O3GwISNLsm24QnAm8OsktwDFtnSRjSc4FqKqNDM4FXN0eZ7Q2SdJOIFVzb7p9bGysxsfHR12GJM0pSdZU1djkdv+yWJI6ZxBIUucMAknqnEEgSZ0zCCSpcwaBJHXOIJCkzhkEktQ5g0CSOmcQSFLnDAJJ6pxBIEmdMwgkqXMGgSR1ziCQpM4ZBJLUOYNAkjpnEEhS5wwCSeqcQSBJnTMIJKlzBoEkdc4gkKTOGQSS1LlU1ahr2GZJ7gW+t51P3x+4bwbLmQsccx96G3Nv44Xpj/l5VbVocuOcDILpSDJeVWOjrmM2OeY+9Dbm3sYLO27MTg1JUucMAknqXI9BcM6oCxgBx9yH3sbc23hhB425u3MEkqRn6/GIQJI0xCCQpM51EwRJjk9yU5J1SVaNup7pSPLJJPckuWGobd8klyS5pX3dp7UnyZ+2cX8ryYuGnrOy9b8lycpRjGVrJTk0yZeTfDvJ2iT/tbXP23En2SPJN5Jc18b8/tZ+WJKr2tguSLJba9+9ra9r25cO7evdrf2mJMeNZkRbJ8kuSb6Z5O/a+rweL0CS25Ncn+TaJOOtbfbe21U17x/ALsB3gecDuwHXActHXdc0xvNvgRcBNwy1fQhY1ZZXAWe15V8B/gEI8BLgqta+L3Br+7pPW95n1GPbzJgPAl7Ulp8D3Awsn8/jbrXv1ZZ3Ba5qY/k8cFJr/zjwO235d4GPt+WTgAva8vL2nt8dOKz9LOwy6vFtZtxvB/4C+Lu2Pq/H22q+Hdh/Utusvbd7OSI4ClhXVbdW1RPA+cCKEde03arqH4GNk5pXAOe15fOA1w61/58auBLYO8lBwHHAJVW1saoeAC4Bjt/x1W+fqtpQVde05YeBG4FDmMfjbrU/0lZ3bY8CXglc2Nonj3nie3Eh8Kokae3nV9XjVXUbsI7Bz8ROJ8li4N8B57b1MI/HuwWz9t7uJQgOAdYPrd/R2uaTA6tqQ1u+CziwLW9q7HP2e9KmAF7I4DfkeT3uNk1yLXAPgx/s7wIPVtWTrctw/c+MrW1/CNiPuTXmPwHeCTzd1vdjfo93QgFfSrImySmtbdbe2wu2t2rtvKqqkszL64KT7AX8FfB7VfXDwS+AA/Nx3FX1FHBkkr2Bi4AXjLikHSbJrwL3VNWaJEePup5Z9vKqujPJAcAlSb4zvHFHv7d7OSK4Ezh0aH1xa5tP7m6Hh7Sv97T2TY19zn1PkuzKIAQ+W1V/3Zrn/bgBqupB4MvASxlMBUz8Ejdc/zNja9sXAvczd8b8MuDXktzOYPr2lcBHmL/jfUZV3dm+3sMg8I9iFt/bvQTB1cCydvXBbgxOLK0ecU0zbTUwcZXASuALQ+1vbFcavAR4qB1uXgwcm2SfdjXCsa1tp9Tmfj8B3FhVfzy0ad6OO8midiRAkj2BVzM4N/Jl4MTWbfKYJ74XJwKX1+As4mrgpHaVzWHAMuAbszOKrVdV766qxVW1lMHP6OVV9RvM0/FOSPLPkzxnYpnBe/IGZvO9Peqz5bP1YHCm/WYGc6zvHXU90xzL54ANwE8YzAOezGBu9DLgFuBSYN/WN8BH27ivB8aG9vNmBifS1gFvGvW4tjDmlzOYR/0WcG17/Mp8Hjfwi8A325hvAN7X2p/P4INtHfCXwO6tfY+2vq5tf/7Qvt7bvhc3ASeMemxbMfaj+elVQ/N6vG1817XH2onPp9l8b3uLCUnqXC9TQ5KkTTAIJKlzBoEkdc4gkKTOGQSS1DmDQF1L8rX2dWmSN8zwvt8z1WtJOxsvH5WAdkuDd1TVr27DcxbUT++BM9X2R6pqr5moT9qRPCJQ15JM3N3zTOAV7X7wb2s3e/vDJFe3e77/dut/dJKvJFkNfLu1/U27WdjaiRuGJTkT2LPt77PDr9X+IvQPk9zQ7kH/60P7viLJhUm+k+SzGb6ZkrSDeNM5aWAVQ0cE7QP9oar6pSS7A19N8qXW90XAv6rBLY4B3lxVG9ttIK5O8ldVtSrJqVV15BSv9TrgSODfAPu35/xj2/ZC4AjgB8BXGdx/559mfrjST3lEIE3tWAb3c7mWwe2u92NwzxqAbwyFAMB/SXIdcCWDm34tY/NeDnyuqp6qqruB/wf80tC+76iqpxncRmPpjIxG2gyPCKSpBXhrVT3rpl3tXMKPJq0fA7y0qh5NcgWDe+Bsr8eHlp/Cn1HNAo8IpIGHGfwLzAkXA7/Tbn1Nkl9od4acbCHwQAuBFzD414ETfjLx/Em+Avx6Ow+xiMG/Ht1p746p+c/fNqSBbwFPtSme/83gPvhLgWvaCdt7+em/Chz2ReAtSW5kcKfLK4e2nQN8K8k1Nbid8oSLGPxfgesY3FH1nVV1VwsSadZ5+agkdc6pIUnqnEEgSZ0zCCSpcwaBJHXOIJCkzhkEktQ5g0CSOvf/AQ/Xfo538TV8AAAAAElFTkSuQmCC\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "videos/q_learning_policy.mp4\n" ] }, { "data": { "text/html": [ "" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "Q_ql = q_learning(env, Q_vi=Q_vi)\n", "\n", "# Following Q-Learning policy \n", "state = env.reset() \n", "env.enable_rendering() \n", "for tt in range(100): \n", " action = Q_ql[state, :].argmax()\n", " next_state, reward, is_terminal, info = env.step(action)\n", " if is_terminal:\n", " break\n", " state = next_state\n", "\n", "# save video (run last cell to visualize it!)\n", "env.save_video('./videos/q_learning_policy.mp4', framerate=10)\n", "# clear rendering data\n", "env.clear_render_buffer()\n", "env.disable_rendering()\n", "# see video\n", "show_video(filename='./videos/q_learning_policy.mp4')" ] } ], "metadata": { "colab": { "authorship_tag": "ABX9TyM+8H1rbTADo1Hh3m1E+mXQ", "collapsed_sections": [], "include_colab_link": true, "name": "Tutorial - Value Iteration and Q-Learning.ipynb", "provenance": [], "toc_visible": true }, "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.15" } }, "nbformat": 4, "nbformat_minor": 1 }