lazy_dataset

Lazy_dataset is a helper to deal with large datasets that do not fit into memory. It allows to define transformations that are applied lazily, (e.g. a mapping function to read data from HDD). When someone iterates over the dataset all transformations are applied.

Supported transformations:

dataset.map(map_fn): Apply the function map_fn to each example (builtins.map)
dataset[2]: Get example at index 2.
dataset['example_id'] Get that example that has the example id 'example_id'.
dataset[10:20]: Get a sub dataset that contains only the examples in the slice 10 to 20.
dataset.filter(filter_fn, lazy=True) Drops examples where filter_fn(example) is false (builtins.filter).
dataset.concatenate(*others): Concatenates two or more datasets (numpy.concatenate)
dataset.intersperse(*others): Combine two or more datasets such that examples of each input dataset are evenly spaced (https://stackoverflow.com/a/19293603).
dataset.zip(*others): Zip two or more datasets
dataset.shuffle(reshuffle=False): Shuffles the dataset. When reshuffle is True it shuffles each time when you iterate over the data.
dataset.tile(reps, shuffle=False): Repeats the dataset reps times and concatenates it (numpy.tile)
dataset.groupby(group_fn): Groups examples together. In contrast to itertools.groupby a sort is not nessesary, like in pandas (itertools.groupby, pandas.DataFrame.groupby)
dataset.sort(key_fn, sort_fn=sorted): Sorts the examples depending on the values key_fn(example) (list.sort)
dataset.batch(batch_size, drop_last=False): Batches batch_size examples together as a list. Usually followed by a map (tensorflow.data.Dataset.batch)
dataset.random_choice(): Get a random example (numpy.random.choice)
dataset.cache(): Cache in RAM (similar to ESPnet's keep_all_data_on_mem)
dataset.diskcache(): Cache to a cache directory on the local filesystem (useful in clusters network slow filesystems)
...

>>> from IPython.lib.pretty import pprint
>>> import lazy_dataset
>>> examples = {
...     'example_id_1': {
...         'observation': [1, 2, 3],
...         'label': 1,
...     },
...     'example_id_2': {
...         'observation': [4, 5, 6],
...         'label': 2,
...     },
...     'example_id_3': {
...         'observation': [7, 8, 9],
...         'label': 3,
...     },
... }
>>> for example_id, example in examples.items():
...     example['example_id'] = example_id
>>> ds = lazy_dataset.new(examples)
>>> ds
  DictDataset(len=3)
MapDataset(_pickle.loads)
>>> ds.keys()
('example_id_1', 'example_id_2', 'example_id_3')
>>> for example in ds:
...     print(example)
{'observation': [1, 2, 3], 'label': 1, 'example_id': 'example_id_1'}
{'observation': [4, 5, 6], 'label': 2, 'example_id': 'example_id_2'}
{'observation': [7, 8, 9], 'label': 3, 'example_id': 'example_id_3'}
>>> def transform(example):
...     example['label'] *= 10
...     return example
>>> ds = ds.map(transform)
>>> for example in ds:
...     print(example)
{'observation': [1, 2, 3], 'label': 10, 'example_id': 'example_id_1'}
{'observation': [4, 5, 6], 'label': 20, 'example_id': 'example_id_2'}
{'observation': [7, 8, 9], 'label': 30, 'example_id': 'example_id_3'}
>>> ds = ds.filter(lambda example: example['label'] > 15)
>>> for example in ds:
...     print(example)
{'observation': [4, 5, 6], 'label': 20, 'example_id': 'example_id_2'}
{'observation': [7, 8, 9], 'label': 30, 'example_id': 'example_id_3'}
>>> ds['example_id_2']
{'observation': [4, 5, 6], 'label': 20, 'example_id': 'example_id_2'}
>>> ds
      DictDataset(len=3)
    MapDataset(_pickle.loads)
  MapDataset(<function transform at 0x7ff74efb6620>)
FilterDataset(<function <lambda> at 0x7ff74efb67b8>)

Comparison with PyTorch's DataLoader

See here for a feature and throughput comparison of lazy_dataset with PyTorch's DataLoader.

Installation

Install it directly with Pip, if you just want to use it:

pip install lazy_dataset

If you want to make changes or want the most recent version: Clone the repository and install it as follows:

git clone https://github.com/fgnt/lazy_dataset.git
cd lazy_dataset
pip install --editable .

lazy-dataset
Release 0.0.14

Release 0.0.14

0.0.14

0.0.13

0.0.12

0.0.11

0.0.10

0.0.9

0.0.8

0.0.7

0.0.6

0.0.4

Documentation

lazy_dataset

Comparison with PyTorch's DataLoader

Installation

Stats

Development practices

Releases

Contributors

lazy-dataset Release 0.0.14

Release 0.0.14 Toggle Dropdown 0.0.14 0.0.13 0.0.12 0.0.11 0.0.10 0.0.9 0.0.8 0.0.7 0.0.6 0.0.4

Documentation

lazy_dataset

Comparison with PyTorch's DataLoader

Installation

Stats

Development practices

Releases

Contributors

lazy-dataset
Release 0.0.14

Release 0.0.14

0.0.14

0.0.13

0.0.12

0.0.11

0.0.10

0.0.9

0.0.8

0.0.7

0.0.6

0.0.4