Efficient Reinforcement Learning

Motivation | Summary | Results | How to use this repository

This is a Python reimplementation for the Taxi domain of

An Object-Oriented Representation for Efficient Reinforcement Learning (C. Diuk's Dissertation)
An Object-Oriented Representation for Efficient Reinforcement Learning (Paper by C. Diuk et al.)

In the Taxi domain the goal is to navigate the taxi (initially yellow box) towards
the passenger (blue letter), take a Pickup action and then deliver the taxi with
passenger inside (green box) towards the destination (magenta letter) and perform
a Dropoff action. A reward of -1 is obtained for every time step it takes until delivery.
Successful Dropoff results in +20 reward, while non-successful Dropoff or Pickup is
penalized with -10. This task was introduced by Dietterich.

Motivation

It is a well known empirical fact in reinforcement learning that model-based approaches (e.g., R_max) are more sample-efficient than model-free algorithms (e.g., Q-learning). One of the main reasons may be that model-based learning tackles the exploration-exploitation dilemma in a smarter way by using the accumulated experience to build an approximate model of the environment. Furthermore, it has been shown that rich state representations such as in a factored MDP can make model-based learning even more sample-efficient. Factored MDPs enable an effective parametrization of transition and reward dynamics by using dynamic Bayesian networks (DBNs) to represent partial dependency relations between state variables, thereby the environment dynamics can be learned with less samples. A major downside of these approaches is that the DBNs need to be provided as prior knowledge which might be impossible sometimes.

Motivated by human intelligence, Diuk et al. introduce a new framework propositional object-oriented MDPs (OO-MDPs) to model environments and their dynamics. As it turns out, humans are way more sample-efficient than state-of-the-art algorithms when playing games such as Taxi (Diuk actually performed an experiment). Diuk argues that humans must use some prior knowledge when playing this game, he further speculates that this knowledge might come in form of object representations, e.g., identifying horizontal lines as walls when observing that the taxi cannot move through them.

Diuk et al. provide a learning algorithm for deterministic OO-MDPs (DOOR_max) which outperforms factored R_max. As prior knowledge DOOR_max needs the objects and relations to consider, which seems more natural as these may also be used throughout different games. Furthermore, this approach may also be used to inherit human biases.

Summary

This part shall give an overview about the different reimplemented algorithms. These can be divided into model-free and model-based approaches.

Model-free Approaches

In model-free algorithms the agent learns the optimal action-value function (or value function or policy) directly from experience without having an actual model of the environment. Probably the most famous model-free algorithm is Q-learning which also builds the basis for the (perhaps even more famous) DQN paper.

Q-learning

Q-learning aims to approximate the optimal action-value function from which the optimal policy can be inferred. In the simplest case, a table (Q-table) is used as a function approximator.

The basic idea is to start with a random action-value function and then iteratively update this function towards the optimal action-value function. The update comes after each action a with the observed reward r and new state s^', the update rule is very simple and is derived from Bellman's optimality equation:

$\displaystyle Q(s,a)\leftarrow (1-\alpha) Q(s,a) + \alpha\left[r + \gamma \max_{a^{'}} Q(s^{'}, a^{'})\right]$

where α is the learning rate. To allow for exploration, Q-learning commonly uses ε-greedy exploration or the Greedy in the Limit with Infinite Exploration approach (see David Silver, p.13 ff).

Diuk uses two variants of Q-learning:

Q-learning: standard Q-learning approach with ε-greedy exploration where parameters α=0.1 and ε=0.6 have been found via parameter search.
Q-learning with optimistic initialization: instead of some random initialization of the Q-table a smart initialization to an optimistic value (maximum possible value of any state action pair $v_{max}$ ) is used. Thereby unvisited state-action pairs become more likely to be visited. Here, α was is to 1 (deterministic environment) and ε to 0 (exploration ensured via initialization).

Model-based Approaches

In model-based approaches the agent learns a model of the environment by accumulating experience. Then, an optimal action-value function (or value function or policy) is obtained through planning. Planning can be done exactly or approximately. In the experiments, Diuk et al. use exact planning, more precisely value iteration. The difference between the following three algorithms lies in the way they learn the environment dynamics.

R_max

R_max is a provably efficient state-of-the-art algorithm to surpass the exploration-exploitation dilemma through an intuitive approach: R_max divides state-action pairs into known (state-action pairs which have been visited often enough to build an accurate transition/reward function) and unknown. Whenever a state is known, the algorithm uses the empirical transition and reward function for planning. In case a state is unknown, R_max assumes a transition to a fictious state from which maximum reward can be obtained consistently (hence the name) and it uses that for planning. Therefore, actions which have not been tried out (often enough) in the actual state will be preferred unless the known action also leads to maximal return. The parameter M defines the number of observations the agent has to see until it considers a transition/reward to be known, in a deterministic case such as the Taxi domain, it can be set to 1. R_max is guaranteed to find a near-optimal action-value function in polynomial time.

Learning transition and reward dynamics

The 5x5 Taxi domain has 500 different states:

5 x positions for taxi
5 y positions for taxi
5 passenger locations (4 designated locations plus in-taxi)
4 destinations

In the standard R_max approach without any domain knowledge (except for the maximum possible reward R_max, the number of states |S|, the number of actions |A|), the states are simply enumerated and the agent will not be able to transfer knowledge throughout the domain. E.g., assume the agent performs an action North at some location on the grid and learns the state transition (more precisely it would learn something like picking action 1 at state 200 results in ending up in state 220). Being at the same location but with a different passenger location or destination location the agent will not be able to predict the outcome of action North. It will take the agent at least 3000 (|S| · |A|) steps until it has fully learned the 5x5 Taxi transition dynamics. Furthermore, the learned transition and reward dynamics are rather difficult to interpret. To address this shortcoming, the agent needs a different representation and some prior knowledge.

Factored R_max

Factored R_max is a R_max adaptation that builds on a factored MDP environment representation. In a factored MDP a state is represented as tuple (hence factored state), e.g., in the Taxi domain the state can be represented as the 4-tuple

[taxi x location, taxi y location, passenger location, passenger destination]

(Note that passenger location actually enumerates the different (x, y) start passenger locations plus whether the passenger is in taxi.) This representation allows to represent partial dependency relations for the environment dynamics between variables using Dynamic Bayesian Networks (DBNs). E.g., for action North we know that each state variable at time t+1 only depends on its own value at time t, i.e., the x location at time t+1 under action North is independent of the y location, passenger location and passenger destination at time t. This knowledge is encoded in a DBN (each action may have a different DBN) and it enables Factored R_max to much more sample-efficient learning. The downside of this approach is that this kind of prior knowledge may not be available and that it lacks some generalization, e.g., although Factored R_max knows that the x location is independent of all other state variables, Factored R_max still needs to perfom action North at each x location to learn the outcome.

DOOR_max

DOOR_max is a R_max adaptation that builds on a deterministic (propositional) object-oriented MDP (OO MDP) environment representation. This representation is based on objects and their interactions, a state is presented as the union of all (object) attribute values. Additionally, each state has an attributed boolean vector describing which relations are enabled and which are not in that state. During a transition each attribute of the state may exert some kind of effect which results in an attribute change. There are some limitations to the effects that can occur which are well explained in Diuk's dissertation. The basic idea of DOOR_max is to recover the deterministic OO MDP using condition-effect learners (in these learners conditions are basically the relations that need to hold in order for an effect to occur). The paper results show that in DOOR_max knowledge can much better transfer throughout the domain compared to the other algorithms indicating that DOOR_max offers better generalization. Another feature is that the learned transition dynamics is easy to interpret, e.g., DOOR_max will learn that action North has the effect ot incrementing taxi.y by 1 when the relation touch_north(taxi, wall) outputs False and there wont be any change in taxi.y if touch_north(taxi, wall) outputs True.

Results

Experimental Setup

The experimental setup is described on p.31 of Diuk's Dissertation or p.7 of the paper. It consists of testing against six probe states and reporting the number of steps the agent had to take until the optimal policy for these 6 start states was reached. Since there is some randomness in the trials, each algorithm runs 100 times and the results are then averaged.

Differences between Reimplementation and Diuk

There are some differences between this reimplementation and Diuk's approach which are listed below:

For educational purposes, the reward function is also learned in this reimplementation (always in the simplest possible way). Note that Diuk mainly focused on learning the transition model:

I will focus on learning dynamics and assume the reward function is available as a black box function.
It is unknwon whether in Diuk's setting during training the passenger start location and destination could be the same. The original definition by Diettetrich states:

To keep things uniform, the taxi must pick up and drop off the passenger even if he/she is already at the destination.

Therefore, in this reimplementation this was also possible during training. While the results for R_max an its adaptations indicate that Diuk used the same setting, there is a discrepancy for Q-learning. When the setting was changed such that passenger start and destination could not be the same (these are the results in brackets), similar results to Diuk could be obtained.
Some implementation details are different such as the update procedure of the empirical transition and reward functions or the condition-effect-learners which were not well enough documented or which did not fit into the reimplementation structure.

Dissertation Results (p.49)

The dissertation results align with the reimplementation results. Clearly, DOOR_max outperforms the other algorithms in terms of sample-efficiency.

For the differences in Q-Learning and the values in brackets, refer to 2) of Differences between Reimplementation and Diuk.

The results were obtained on a lenovo thinkpad yoga 260 (i7-6500 CPU @ 2.50 GHz x 4).

Domain knowledge	Algorithm	Diuk's Results		Reimplementation Results
		# Steps	Time/step	# Steps	Time/step	Total Time
\|S\|, \|A\|	Q-learning	106859	<1ms	120148 (119941)	<1ms (<1ms)	4.9s (4.9s)
\|S\|, \|A\|, R_max	Q-learning - optimistic initialization	29350	<1ms	75289 (28989)	<1ms (<1ms)	4.2s (1.7s)
\|S\|, \|A\|, R_max	R_max	4151	74ms	4080	2.9ms	11.8s
R_max, DBN structure	Factored R_max	1676	97.7ms	1686	30ms	51.4s
Objects, relations to consider, R_max	DOOR_max	529	48.2ms	498	36ms	17.9s
\|A\|, visualization of game	Humans (non- videogamers)	101	NA	NA	NA	NA
\|A\|, visualization of game	Humans (videogamers)	48.8	NA	NA	NA	NA

Paper Results (p.7)

The paper results align with the reimplementation results. These results show that DOOR_max not only outperforms Factored R_max in terms of sample-efficiency, but also scales much better to larger problems. Note that the number of states increases by a factor of more than 14 times.

The results were obtained on a cluster from which I do not know the CPU specifics (this is not too important since the focus lies on the comparison). Note that Diuk et al. used a more powerful machine for the paper result: the average step times are notably smaller compared to the dissertation results.

	Diuk's Result			Reimplementation Results
	Taxi 5x5	Taxi 10x10	Ratio	Taxi 5x5	Taxi 10x10	Ratio
Number of states	500	7200	14.40	500	7200	14.40
Factored R_max # steps Time per step	1676 43.59ms	19866 306.71ms	11.85 7.03	1687 24.61ms	21868 1.02s	12.96 41.45
DOOR_max # steps Time per step	529 13.88ms	821 293.72ms	1.55 21.16	502 22.12ms	1086 1.22s	2.16 55.15

How to use this repository

Installation

Building from Source

git clone --depth 1 https://github.com/borea17/efficient_rl/
cd efficient_rl
python setup.py install

Via Pip

pip install efficient_rl

Reproduce results

After successful installation, download dissertation_script.py and paper_script.py (which are in folder efficient_rl), then run

python dissertation_script.py 
python paper_script.py

Defaultly, each agent runs only once. To increase the number of repetitions change n_repetitions in the scripts.

WARNING: It is not recommended to run paper_script.py on a standard computer as it may take several hours.

Contributions

If you want to use this repository for a different environment, you may want to have a look at efficient_rl/environment folder. There is a self written environment called TaxiEnvironmentClass.py and there are extensions to the gym Taxi environment in the corresponding folders.

Contributions are welcome and if needed, I will provide a more detailed documentation.

efficient-rl
Release 1.0.1

Release 1.0.1

1.0.1

1.0

Documentation

Efficient Reinforcement Learning

Motivation

Summary

Model-free Approaches

Q-learning

Model-based Approaches

R_max

Learning transition and reward dynamics

Factored R_max

DOOR_max

Results

Experimental Setup

Differences between Reimplementation and Diuk

Dissertation Results (p.49)

Paper Results (p.7)

How to use this repository

Installation

Building from Source

Via Pip

Reproduce results

Contributions

Stats

Development practices

Releases

Contributors

efficient-rl Release 1.0.1

Release 1.0.1 Toggle Dropdown 1.0.1 1.0

Documentation

Efficient Reinforcement Learning

Motivation

Summary

Model-free Approaches

Q-learning

Model-based Approaches

Rmax

Learning transition and reward dynamics

Factored Rmax

DOORmax

Results

Experimental Setup

Differences between Reimplementation and Diuk

Dissertation Results (p.49)

Paper Results (p.7)

How to use this repository

Installation

Building from Source

Via Pip

Reproduce results

Contributions

Stats

Development practices

Releases

Contributors

efficient-rl
Release 1.0.1

Release 1.0.1

1.0.1

1.0

R_max

Factored R_max

DOOR_max