{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Function Approximation (1) \n", "## Tile Coding in the Mountain Car problem\n", "\n", "In this notebook we will show benefits of FA in the Mountain Car problem\n", "\n", "The Goal of Mountain Car problem is to reach the top of a hill when the obvious solution of accelerating does not work when starting form the bottom of the valley.\n", "\n", "https://gym.openai.com/envs/MountainCar-v0/" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "[2018-04-12 18:01:49,008] Making new env: MountainCar-v0\n" ] } ], "source": [ "import gym\n", "import numpy as np\n", "import matplotlib.pyplot as plt\n", "\n", "env = gym.make(\"MountainCar-v0\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "From the previous notebook, we can obtain the action and state space" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Actions : Discrete(3)\n", "Variables: Box(2,)\n", "Max. var: [ 0.6 0.07]\n", "Min. var: [-1.2 -0.07]\n" ] } ], "source": [ "print('Actions : ',env.action_space)\n", "print('Variables: ',env.observation_space)\n", "print('Max. var: ',env.observation_space.high)\n", "print('Min. var: ',env.observation_space.low)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Variables correspond to position in x axis and speed. Actions correspond to forward and backward acceleration respectively.\n", "\n", "Let's see performance of random behaviour" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "observation = env.reset()\n", "for _ in range(300):\n", " env.render()\n", " action = env.action_space.sample() # this takes random actions\n", " observation, reward, done, info = env.step(action)\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "For this problem we will define Tile Coding. The following Class define a TileCoding. Each variable will be discretized using *numTilings* grids, each one with *tilesPerTiling x tilesPerTiling* dimension. Tiles are overlaping in the usual way:\n", "\n", "\n", "\n" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "class Tilecoder:\n", "\n", " def __init__(self, numTilings, tilesPerTiling):\n", " # Set max value for normalization of inputs\n", " self.maxNormal = 1\n", " self.maxVal = env.observation_space.high\n", " self.minVal = env.observation_space.low\n", " self.numTilings = numTilings\n", " self.tilesPerTiling = tilesPerTiling\n", " self.dim = len(self.maxVal)\n", " self.numTiles = (self.tilesPerTiling**self.dim) * self.numTilings\n", " self.actions = env.action_space.n\n", " self.n = self.numTiles * self.actions\n", " self.tileSize = np.divide(np.ones(self.dim)*self.maxNormal, self.tilesPerTiling-1)\n", "\n", " def getFeatures(self, variables):\n", " # Ensures range is always between 0 and self.maxValue\n", " values = np.zeros(self.dim)\n", " for i in range(len(env.observation_space.shape)+1):\n", " values[i] = self.maxNormal * ((variables[i] - self.minVal[i])/(self.maxVal[i]-self.minVal[i]))\n", " tileIndices = np.zeros(self.numTilings)\n", " matrix = np.zeros([self.numTilings,self.dim])\n", " for i in range(self.numTilings):\n", " for i2 in range(self.dim):\n", " matrix[i,i2] = int(values[i2] / self.tileSize[i2] + i / self.numTilings)\n", " for i in range(1,self.dim):\n", " matrix[:,i] *= self.tilesPerTiling**i\n", " for i in range(self.numTilings):\n", " tileIndices[i] = (i * (self.tilesPerTiling**self.dim) + sum(matrix[i,:])) \n", " return tileIndices\n", "\n", " def oneHotVector(self, features, action):\n", " oneHot = np.zeros(self.n)\n", " for i in features:\n", " index = int(i + (self.numTiles*action))\n", " oneHot[index] = 1\n", " return oneHot\n", "\n", " def getVal(self, theta, features, action):\n", " val = 0 \n", " for i in features:\n", " index = int(i + (self.numTiles*action))\n", " val += theta[index]\n", " return val\n", "\n", " def getQ(self, features, theta):\n", " Q = np.zeros(self.actions)\n", " for i in range(self.actions):\n", " Q[i] = tile.getVal(theta, features, i)\n", " return Q\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Q-learning with TileCoding\n", "\n", "Let's start defining one function to implement epsilon-greedy procedure and another one to sum long-term reward of the current episode form position *t*" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": [ "def e_greedy_policy(Qs):\n", " return env.action_space.sample() if (np.random.random() <= epsilon) else np.argmax(Q)\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Definition of funtion to collect scores of an episode whith completely greedy policy. Just to compare scores" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [], "source": [ "def rollout(niter):\n", " G = 0\n", " for i in range(niter):\n", " state = env.reset()\n", " for _ in range(1000):\n", " F = tile.getFeatures(state)\n", " Q = tile.getQ(F, theta) \n", " action = np.argmax(Q)\n", " state, reward, done, info = env.step(action)\n", " G += reward\n", " if done: break\n", " return G/niter\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now, we define a TileCoder of 7x14 and apply Q-learning procedure" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Average reward = -193.345\n", "Average reward = -160.68\n", "Average reward = -142.11\n", "Average reward = -129.25\n", "Average reward = -125.67\n", "Average reward = -131.57\n", "Average reward = -128.69\n", "Average reward = -122.05\n", "Average reward = -123.705\n", "Average reward = -125.395\n", "Average reward = -123.175\n", "Average reward = -122.74\n", "Average reward = -127.015\n", "Average reward = -122.975\n", "Average reward = -120.465\n" ] } ], "source": [ "tile = Tilecoder(7,14) # Definition of tiles (7x (14x14)) \n", "theta = np.random.uniform(-0.001, 0, size=(tile.n)) # Parameters for FA (7x (14x14)) = 1.372 parameters\n", "\n", "# Parameters of learning\n", "alpha = 0.05\n", "gamma = 1\n", "numEpisodes = 3000\n", "epsilon = 0.05\n", "\n", "# Variables to collect scores\n", "rewardTracker = []\n", "rewardTracker2 = []\n", "episodeSum = 0\n", "counter = 0 \n", "\n", "for episodeNum in range(1,numEpisodes+1):\n", " G = 0\n", " state = env.reset()\n", " while True:\n", " F = tile.getFeatures(state) # Vector of 1.372 representing state \n", " Q = tile.getQ(F, theta) # Q-values for given state all actions\n", " action = e_greedy_policy(Q) # select action with epsilon-greedy procedure\n", " Qs = Q[action]\n", " state2, reward, done, info = env.step(action)\n", " G += reward\n", " if done == True:\n", " theta += np.multiply((alpha*(reward - Qs)), tile.oneHotVector(F,action))\n", " episodeSum += G\n", " rewardTracker.append(G) # Store reward collected\n", " rewardTracker2.append(rollout(1)) # Store reward collected TESTING with epsilon = 0\n", " if episodeNum %200 == 0:\n", " print('Average reward = {}'.format(episodeSum / 200))\n", " episodeSum = 0\n", " break\n", " Q = tile.getQ(tile.getFeatures(state2), theta)\n", " theta += np.multiply((alpha*(reward - Qs+gamma*np.max(Q))), tile.oneHotVector(F,action))\n", " state = state2\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's see behaviour learnt " ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [], "source": [ "state = env.reset()\n", "for _ in range(200):\n", " env.render()\n", " F = tile.getFeatures(state)\n", " Q = tile.getQ(F, theta) \n", " action = np.argmax(Q)\n", " state, reward, done, info = env.step(action)\n", " if done: break" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "plt.plot(rewardTracker)\n", "plt.plot(rewardTracker2)\n", "plt.show()" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "def smoothen(data,window_width):\n", " cumsum_vec = np.cumsum(np.insert(data, 0, 0)) \n", " return (cumsum_vec[window_width:] - cumsum_vec[:-window_width]) / window_width\n", "\n", "plt.plot(smoothen(rewardTracker2,100),label='greedy')\n", "plt.plot(smoothen(rewardTracker,100),label='e-greedy')\n", "\n", "plt.legend()\n", "plt.show()\n", "\n", "tiles_rew = rewardTracker2" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Just to be sure of the benefits of the approach, let's compare learning performance without FA on a single grid. To do a fair comparison, we will use approximately the same ammount of parameters, so 1 tile of 40x40, that is, 1.600 parameters" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Average reward = -200.0\n", "Average reward = -200.0\n", "Average reward = -200.0\n", "Average reward = -200.0\n", "Average reward = -200.0\n", "Average reward = -200.0\n", "Average reward = -200.0\n", "Average reward = -200.0\n", "Average reward = -200.0\n", "Average reward = -200.0\n", "Average reward = -200.0\n", "Average reward = -200.0\n", "Average reward = -200.0\n", "Average reward = -199.835\n", "Average reward = -199.995\n" ] } ], "source": [ "tile = Tilecoder(1,40)\n", "theta = np.random.uniform(-0.001, 0, size=(tile.n))\n", "alpha = 0.05\n", "gamma = 1\n", "numEpisodes = 3000\n", "rewardTracker = []\n", "rewardTracker2 = []\n", "episodeSum = 0\n", "counter = 0 \n", "epsilon = 0.05\n", "\n", "for episodeNum in range(1,numEpisodes+1):\n", " G = 0\n", " state = env.reset()\n", " while True:\n", " #env.render()\n", " F = tile.getFeatures(state)\n", " Q = tile.getQ(F, theta)\n", " action = e_greedy_policy(Q)\n", " #action = np.argmax(Q)\n", " Qs = Q[action]\n", " state2, reward, done, info = env.step(action)\n", " G += reward\n", " if done == True:\n", " theta += np.multiply((alpha*(reward - Qs)), tile.oneHotVector(F,action))\n", " episodeSum += G\n", " rewardTracker.append(G)\n", " rewardTracker2.append(rollout(1))\n", " if episodeNum %200 == 0:\n", " print('Average reward = {}'.format(episodeSum / 200))\n", " #rewardTracker.append(episodeSum/ 100) \n", " episodeSum = 0\n", " break\n", " Q = tile.getQ(tile.getFeatures(state2), theta)\n", " theta += np.multiply((alpha*(reward - Qs+gamma*np.max(Q))), tile.oneHotVector(F,action))\n", " state = state2" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "No learning at all. Let's try to reduce the number of parameters to allow higher generalization" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Average reward = -200.0\n", "Average reward = -199.7\n", "Average reward = -198.73\n", "Average reward = -197.785\n", "Average reward = -197.905\n", "Average reward = -194.21\n", "Average reward = -192.71\n", "Average reward = -178.28\n", "Average reward = -191.21\n", "Average reward = -184.51\n", "Average reward = -177.005\n", "Average reward = -181.715\n", "Average reward = -176.215\n", "Average reward = -164.04\n", "Average reward = -157.57\n" ] } ], "source": [ "tile = Tilecoder(1,14)\n", "theta = np.random.uniform(-0.001, 0, size=(tile.n))\n", "alpha = 0.05\n", "gamma = 1\n", "numEpisodes = 3000\n", "rewardTracker = []\n", "rewardTracker2 = []\n", "episodeSum = 0\n", "counter = 0 \n", "epsilon = 0.05\n", "\n", "for episodeNum in range(1,numEpisodes+1):\n", " G = 0\n", " state = env.reset()\n", " while True:\n", " #env.render()\n", " F = tile.getFeatures(state)\n", " Q = tile.getQ(F, theta)\n", " action = e_greedy_policy(Q)\n", " #action = np.argmax(Q)\n", " Qs = Q[action]\n", " state2, reward, done, info = env.step(action)\n", " G += reward\n", " if done == True:\n", " theta += np.multiply((alpha*(reward - Qs)), tile.oneHotVector(F,action))\n", " episodeSum += G\n", " rewardTracker.append(G)\n", " rewardTracker2.append(rollout(1))\n", " if episodeNum %200 == 0:\n", " print('Average reward = {}'.format(episodeSum / 200))\n", " #rewardTracker.append(episodeSum/ 100) \n", " episodeSum = 0\n", " break\n", " Q = tile.getQ(tile.getFeatures(state2), theta)\n", " theta += np.multiply((alpha*(reward - Qs+gamma*np.max(Q))), tile.oneHotVector(F,action))\n", " state = state2" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "aggre_rew = rewardTracker2\n", "plt.plot(smoothen(tiles_rew,100),label='Tiles')\n", "plt.plot(smoothen(aggre_rew,100),label='Agreg.')\n", "plt.legend()\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Exercises \n", "\n", "1. Try to find better parameters for agregation approach\n", "2. Play with the parameters of Tiles\n", "3. Use same approximation with Sarsa or MonteCarlo\n", "4. **Try n-steps back-up**\n", "5. Use other exploration strategies and play with functions reducing alpha and/or epsilon" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.4" } }, "nbformat": 4, "nbformat_minor": 2 }