Package renamed to torchdatasets!

Use map, apply, reduce or filter directly on Dataset objects
cache data in RAM/disk or via your own method (partial caching supported)
Full PyTorch's Dataset and IterableDataset support
General torchdatasets.maps like Flatten or Select
Extensible interface (your own cache methods, cache modifiers, maps etc.)
Useful torchdatasets.datasets classes designed for general tasks (e.g. file reading)
Support for torchvision datasets (e.g. ImageFolder, MNIST, CIFAR10) via td.datasets.WrapDataset
Minimal overhead (single call to super().__init__())

Version	Docs	Tests	Coverage	Style	PyPI	Python	PyTorch	Docker	Roadmap

💡 Examples

Check documentation here: https://szymonmaszke.github.io/torchdatasets

General example

Create image dataset, convert it to Tensors, cache and concatenate with smoothed labels:

import torchdatasets as td
import torchvision

class Images(td.Dataset): # Different inheritance
    def __init__(self, path: str):
        super().__init__() # This is the only change
        self.files = [file for file in pathlib.Path(path).glob("*")]

    def __getitem__(self, index):
        return Image.open(self.files[index])

    def __len__(self):
        return len(self.files)


images = Images("./data").map(torchvision.transforms.ToTensor()).cache()

You can concatenate above dataset with another (say labels) and iterate over them as per usual:

for data, label in images | labels:
    # Do whatever you want with your data

Cache first 1000 samples in memory, save the rest on disk in folder ./cache:

images = (
    ImageDataset.from_folder("./data").map(torchvision.transforms.ToTensor())
    # First 1000 samples in memory
    .cache(td.modifiers.UpToIndex(1000, td.cachers.Memory()))
    # Sample from 1000 to the end saved with Pickle on disk
    .cache(td.modifiers.FromIndex(1000, td.cachers.Pickle("./cache")))
    # You can define your own cachers, modifiers, see docs
)

To see what else you can do please check torchdatasets documentation

Integration with `torchvision`

Using torchdatasets you can easily split torchvision datasets and apply augmentation only to the training part of data without any troubles:

import torchvision

import torchdatasets as td

# Wrap torchvision dataset with WrapDataset
dataset = td.datasets.WrapDataset(torchvision.datasets.ImageFolder("./images"))

# Split dataset
train_dataset, validation_dataset, test_dataset = torch.utils.data.random_split(
    model_dataset,
    (int(0.6 * len(dataset)), int(0.2 * len(dataset)), int(0.2 * len(dataset))),
)

# Apply torchvision mappings ONLY to train dataset
train_dataset.map(
    td.maps.To(
        torchvision.transforms.Compose(
            [
                torchvision.transforms.RandomResizedCrop(224),
                torchvision.transforms.RandomHorizontalFlip(),
                torchvision.transforms.ToTensor(),
                torchvision.transforms.Normalize(
                    mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]
                ),
            ]
        )
    ),
    # Apply this transformation to zeroth sample
    # First sample is the label
    0,
)

Please notice you can use td.datasets.WrapDataset with any existing torch.utils.data.Dataset instance to give it additional caching and mapping powers!

🔧 Installation

🐍 pip

Latest release:

pip install --user torchdatasets

Nightly:

pip install --user torchdatasets-nightly

🐋 Docker

CPU standalone and various versions of GPU enabled images are available at dockerhub.

For CPU quickstart, issue:

docker pull szymonmaszke/torchdatasets:18.04

Nightly builds are also available, just prefix tag with nightly_. If you are going for GPU image make sure you have nvidia/docker installed and it's runtime set.

❓ Contributing

If you find any issue or you think some functionality may be useful to others and fits this library, please open new Issue or create Pull Request.

To get an overview of thins one can do to help this project, see Roadmap

torchdata-nightly
Release 1626653773

Release 1626653773

1.0.0

1638404477

1638318149

1638231653

1638145265

1638058910

1637972384

1637886027

1637799646

1637713238

Documentation

Package renamed to torchdatasets!

💡 Examples

General example

Integration with `torchvision`

🔧 Installation

🐍 pip

Latest release:

Nightly:

🐋 Docker

❓ Contributing

Stats

Development practices

Releases

Contributors

torchdata-nightly Release 1626653773

Release 1626653773 Toggle Dropdown 1.0.0 1638404477 1638318149 1638231653 1638145265 1638058910 1637972384 1637886027 1637799646 1637713238

Documentation

Package renamed to torchdatasets!

💡 Examples

General example

Integration with torchvision

🔧 Installation

🐍 pip

Latest release:

Nightly:

🐋 Docker

❓ Contributing

Stats

Development practices

Releases

Contributors

torchdata-nightly
Release 1626653773

Release 1626653773

1.0.0

1638404477

1638318149

1638231653

1638145265

1638058910

1637972384

1637886027

1637799646

1637713238

Integration with `torchvision`