Simple Reinforcement Learning Library

reinforcement, learning, RL
pip install kyoka==0.2.1


kyoka : Simple Reinforcement Learning Library

Build Status Coverage Status PyPI license

Implemented algorithmes

  • MonteCarlo
  • Sarsa
  • QLearning
  • SarsaLambda
  • QLambda
  • deep Q-network (DQN)


Getting Started


RL(Reinforcement Learning) algorithms learns which action is good or bad through trial-and-error.
So what we need to do is making our learning task in RL format.

This library provides two template classes to make your task in RL format.

  • BaseDomain class which represents our learning task
  • ValueFunction class which RL algorithm uses to save trial-and-error result

So let's see how to use these template classes through simple maze example.

Example. Find the best policy to escape from the maze

Here we will find the best policy to escape from the below maze by using RL algorithm.

S: start, G: goal, X: wall


Step1. Create MazeDomain class

BaseDomain class requires you to implement 5 methods

  • generate_initial_state()
    • returns initial state that RL algorithms starts simulation from.
  • generate_possible_actions(state)
    • returns valid actions in passed state. RL algorithms choose next action from these actions.
  • transit_state(state, action)
    • returns next state after applied the passed action on the passed state.
  • calculate_reward(state)
    • returns how good the passed state is.
  • is_terminal_state(state)
    • returns if passed state is terminal state or not.
from kyoka.domain.base_domain import BaseDomain

class MazeDomain(BaseDomain):


  # we use current position of the maze as "state". So here we return start position of the maze.
  def generate_initial_state(self):
    return (0, 0)

  # the position of the goal is (row=0, column=8)
  def is_terminal_state(self, state):
    return (0, 8) == state

  # we can always move to 4 directions.
  def generate_possible_actions(self, state):
    return [self.ACTION_UP, self.ACTION_DOWN, self.ACTION_RIGHT, self.ACTION_LEFT]

  # RL algorithm can get reward only when he reaches to the goal.
  def calculate_reward(self, state):
    return 1 if self.is_terminal_state(state) else 0

  def transit_state(self, state, action):
    row, col = state
    wall_position = [(1,2), (2,2), (3,2), (4,5), (0,7), (1,7), (2,7)]
    height, width = 6, 9
    if action == self.ACTION_UP:
      row = max(0, row-1)
    elif action == self.ACTION_DOWN:
      row = min(height-1, row+1)
    elif action == self.ACTION_RIGHT:
      col= min(width-1, col+1)
    elif action == self.ACTION_LEFT:
      col = max(0, col-1)
    if (row, col) not in wall_position:
      return (row, col)
      return state # If destination is the wall or edge of the maze then position does not change.

Ok! next is ValueFunction!!

Step2. Create MazeActionValueFunction class

BaseActionValueFunction class requires you to implement 2 methods.

  • calculate_value(state, action)
    • fetch current value of state and action pair.
  • update_function(state, action, new_value)
    • update value of passed state and action pair by passed value.

The state space of this example is very small (state space = |state| x |action| = (6 x 9) x 4 = 216).
So we prepare the table (3-dimentional array) and save value on it.

from kyoka.value_function.base_action_value_function import BaseActionValueFunction

class MazeActionValueFunction(BaseActionValueFunction):

  # call this method before start learning
  def setUp(self):
    maze_width, maze_height, action_num = 6, 9, 4
    self.table = [[[0 for k in range(action_num)] for j in range(maze_height)] for i in range(maze_width)]

  # just take value from the table
  def calculate_value(self, state, action):
    row, col = state
    return self.table[row][col][action]

  # just insert value into the table
  def update_function(self, state, action, new_value):
    row, col = state
    self.table[row][col][action] = new_value

Step3. Running RL algorithm and see its result

OK, let's try QLearning for our maze task.

from kyoka.policy.epsilon_greedy_policy import EpsilonGreedyPolicy
from kyoka.algorithm.td_learning.q_learning import QLearning

domain = MazeDomain()
policy = EpsilonGreedyPolicy(eps=0.1)
value_function = MazeActionValueFunction()

# You can easily replace algorithm like "rl_algo = Sarsa(alpha=0.1, gamma=0.7)"
rl_algo = QLearning(alpha=0.1, gamma=0.7)
rl_algo.setUp(domain, policy, value_function)

That's all !! Let's visualize the value function which QLearning learned.
(If you interested in MazeHelper utility class, Please checkout complete code.)

>>> print MazeHelper.visualize_policy(domain, value_function)

 S -> v-X-vvvX^

Great!! QLearning found the policy which leads us to goal in 14 steps. (14 step is minimum step to the goal !!)

Sample code

In sample directory, we prepared more practical sample code as jupyter notebook and script.
You can also checkout another RL task example tick-tack-toe .


You can use pip like this.

pip install kyoka