This package provides utilities for generation, filtering, solving, visualizing, and processing of mazes for training ML systems. Primarily built for the maze-transformer interpretability project. You can find our paper on it here: http://arxiv.org/abs/2309.10498
This package includes a variety of maze generation algorithms, including randomized depth first search, Wilson's algorithm for uniform spanning trees, and percolation. Datasets can be filtered to select mazes of a certain length or complexity, remove duplicates, and satisfy custom properties. A variety of output formats for visualization and training ML models are provided.
This package is available on PyPI, and can be installed via
pip install maze-dataset
The full hosted documentation is available at https://understanding-search.github.io/maze-dataset/.
Additionally:
- our notebooks serve as a good starting point for understanding the package:
- combined, single page docs are available as:
- test coverage reports are available on the coverage page or the
coverage/
folder - generation benchmark results are available on the benchmarks page or the
benchmarks/
folder
To create a MazeDataset
, which inherits from torch.utils.data.Dataset
, you first create a MazeDatasetConfig
:
from maze_dataset import MazeDataset, MazeDatasetConfig
from maze_dataset.generation import LatticeMazeGenerators
cfg: MazeDatasetConfig = MazeDatasetConfig(
name="test", # name is only for you to keep track of things
grid_n=5, # number of rows/columns in the lattice
n_mazes=4, # number of mazes to generate
maze_ctor=LatticeMazeGenerators.gen_dfs, # algorithm to generate the maze
maze_ctor_kwargs=dict(do_forks=False), # additional parameters to pass to the maze generation algorithm
)
and then pass this config to the MazeDataset.from_config
method:
dataset: MazeDataset = MazeDataset.from_config(cfg)
This method can search for whether a dataset with matching config hash already exists on your filesystem in the expected location, and load it if so. It can also generate a dataset on the fly if needed.
The elements of the dataset are SolvedMaze
objects:
>>> m = dataset[0]
>>> type(m)
maze_dataset.maze.lattice_maze.SolvedMaze
Which can be converted to a variety of formats:
# visual representation as ascii art
m.as_ascii()
# RGB image, optionally without solution or endpoints, suitable for CNNs
m.as_pixels()
# text format for autoreregressive transformers
from maze_dataset.tokenization import MazeTokenizerModular, TokenizationMode
m.as_tokens(maze_tokenizer=MazeTokenizerModular(
tokenization_mode=TokenizationMode.AOTP_UT_rasterized, max_grid_size=100,
))
# advanced visualization with many features
from maze_dataset.plotting import MazePlot
MazePlot(maze).plot()
This project uses Poetry for development. To install with dev requirements, run
poetry install --with dev
A makefile is included to simplify common development tasks:
-
make help
will print all available commands - all tests via
make test
- unit tests via
make unit
- notebook tests via
make test_notebooks
- unit tests via
- formatter (black, pycln, and isort) via
make format
- formatter in check-only mode via
make check-format
- formatter in check-only mode via
If you use this code in your research, please cite our paper:
@misc{maze-dataset,
title={A Configurable Library for Generating and Manipulating Maze Datasets},
author={Michael Igorevich Ivanitskiy and Rusheb Shah and Alex F. Spies and Tilman Räuker and Dan Valentine and Can Rager and Lucia Quirke and Chris Mathwin and Guillaume Corlouer and Cecilia Diniz Behn and Samy Wu Fung},
year={2023},
eprint={2309.10498},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={http://arxiv.org/abs/2309.10498}
}