FFCV Dataloader with Pytorch Lightning
FFCV is a fast dataloader for neural networks training: https://github.com/libffcv/ffcv
In this repository, all the steps to install and configure it with pytorch-lightning are presented.
The idea is to provide very generic methods and utils, while letting the user decide and configure anything.
Installation
Tested with:
Ubuntu 22.04.2 LTS
python 3.11
ffcv==1.0.2
pytorch==2.0.1
pytorch-lightning==2.0.4
Dependencies
You can install dependencies (FFCV, Pytorch) with the provided environment.yml
file:
conda env create --file environment.yml
conda activate ffcv-pl
This should correctly create a conda environment named ffcv-pl
.
Note: Modify the pytorch-cuda version to the one compatible with your system.
Note: Solving environment can take quite a long time. I suggest to use libmamba solver to speed up the process.
If the above does not work, then another option is manual installation:
-
create conda environment
conda create --name ffcv-pl conda activate ffcv-pl
-
install pytorch according to official website
# in my environment the command is the following conda install pytorch torchvision torchaudio pytorch-cuda=[your-version] -c pytorch -c nvidia
-
install ffcv dependencies and pytorch-lightning
# can take some time for solving, but should not create conflicts conda install cupy pkg-config libjpeg-turbo">=2.1.4" opencv numba pytorch-lightning">=2.0.0" -c pytorch -c conda-forge
-
install ffcv
pip install ffcv
For further help, check out FFCV installation guidelines: ffcv official page
Package
Once dependencies are installed, it is safe to install the package:
pip install ffcv_pl
Dataset Creation
You need to save your dataset in ffcv format (.beton
).
Official FFCV docs.
This package provides you the create_beton_wrapper
method, which allows to easily create
a .beton
dataset from a torch
dataset.
Example from the dataset_creation.py
script:
from ffcv_pl.generate_dataset import create_beton_wrapper
from torch.utils.data.dataset import Dataset
import numpy as np
from PIL import Image
class ToyImageLabelDataset(Dataset):
def __init__(self, n_samples: int):
self.samples = [Image.fromarray((np.random.rand(32, 32, 3) * 255).astype('uint8')).convert('RGB')
for _ in range(n_samples)]
def __len__(self):
return len(self.samples)
def __getitem__(self, idx):
return (self.samples[idx], int(idx))
def main():
# 1. Instantiate the torch dataset that you want to create
# Important: the __get_item__ dataset must return tuples! (This depends on FFCV library)
image_label_dataset = ToyImageLabelDataset(n_samples=256)
# 2. call the method, and it will automatically create the .beton dataset for you.
create_beton_wrapper(image_label_dataset, "./data/image_label.beton")
if __name__ == '__main__':
main()
Dataloader and Datamodule
Merge the PL Datamodule with the FFCV Loader object.
Official FFCV Loader docs.
Official Pytorch-Lightning DataModule docs.
In main.py
a complete example on how to use the FFCVDataModule
method and train a
Lightning Model is given.
The main steps to follow are:
- create
FFCVPipelineManager
object, which needs the path to a previously created.beton
file, a list of operations to perform on each item returned by your dataset and an ordering option for Loading. - create the
FFCVDataModule
object, which is a Lightning Module with FFCV Loader. - Pass the data module to Pytorch Lightning trainer, and run!
Suggestion : read FFCV performance guide to better understand which options fit your needs.
Complete Example from the main.py
script:
import pytorch_lightning as pl
import torch
from ffcv.fields.basics import IntDecoder
from ffcv.fields.rgb_image import RandomResizedCropRGBImageDecoder, CenterCropRGBImageDecoder
from ffcv.loader import OrderOption
from ffcv.transforms import ToTensor, ToTorchImage
from pytorch_lightning.strategies.ddp import DDPStrategy
from torch import nn
from torch.optim import Adam
from torchvision.transforms import RandomHorizontalFlip
from ffcv_pl.data_loading import FFCVDataModule
from ffcv_pl.ffcv_utils.augmentations import DivideImage255
from ffcv_pl.ffcv_utils.utils import FFCVPipelineManager
# define the LightningModule
class LitAutoEncoder(pl.LightningModule):
def __init__(self):
super().__init__()
self.encoder = nn.Sequential(nn.Linear(32 * 32 * 3, 64), nn.ReLU(), nn.Linear(64, 3))
self.decoder = nn.Sequential(nn.Linear(3, 64), nn.ReLU(), nn.Linear(64, 32 * 32 * 3))
def training_step(self, batch, batch_idx):
x = batch[0]
b, c, h, w = x.shape
x = x.reshape(b, -1)
z = self.encoder(x)
x_hat = self.decoder(z)
loss = nn.functional.mse_loss(x_hat, x)
# Logging to TensorBoard by default
self.log("train_loss", loss)
return loss
def validation_step(self, batch, batch_idx):
pass
def configure_optimizers(self):
optimizer = Adam(self.parameters(), lr=1e-3)
return optimizer
def main():
seed = 1234
pl.seed_everything(seed, workers=True)
batch_size = 16
gpus = 2
nodes = 1
workers = 8
# image label dataset
train_manager = FFCVPipelineManager("./data/image_label.beton", # previously defined using dataset_creation.py
pipeline_transforms=[
# image pipeline
[RandomResizedCropRGBImageDecoder((32, 32)),
ToTensor(),
ToTorchImage(),
DivideImage255(dtype=torch.float32),
RandomHorizontalFlip(p=0.5)],
# label (int) pipeline
[IntDecoder(),
ToTensor()
]
],
ordering=OrderOption.RANDOM) # random ordering for training
val_manager = FFCVPipelineManager("./data/image_label.beton",
pipeline_transforms=[
# image pipeline (different from train)
[CenterCropRGBImageDecoder((32, 32), ratio=1.),
ToTensor(),
ToTorchImage(),
DivideImage255(dtype=torch.float32)],
# label (int) pipeline
None # if None, uses default
],
ordering=OrderOption.SEQUENTIAL) # sequential ordering for validation
# datamodule creation
# ignore test and predict steps, since managers are not defined.
data_module = FFCVDataModule(batch_size, workers, train_manager=train_manager, val_manager=val_manager,
is_dist=True, seed=seed)
# define model
model = LitAutoEncoder()
# trainer
trainer = pl.Trainer(strategy=DDPStrategy(find_unused_parameters=False), deterministic=True,
accelerator='gpu', devices=gpus, num_nodes=nodes, max_epochs=5, logger=False)
# start training!
trainer.fit(model, data_module)
if __name__ == '__main__':
main()
Code Citations
-
Pytorch-Lightning:
@software{Falcon_PyTorch_Lightning_2019, author = {Falcon, William and {The PyTorch Lightning team}}, doi = {10.5281/zenodo.3828935}, license = {Apache-2.0}, month = mar, title = {{PyTorch Lightning}}, url = {https://github.com/Lightning-AI/lightning}, version = {1.4}, year = {2019} }
-
FFCV:
@misc{leclerc2022ffcv, author = {Guillaume Leclerc and Andrew Ilyas and Logan Engstrom and Sung Min Park and Hadi Salman and Aleksander Madry}, title = {{FFCV}: Accelerating Training by Removing Data Bottlenecks}, year = {2022}, howpublished = {\url{https://github.com/libffcv/ffcv/}}, note = {commit 2544abdcc9ce77db12fecfcf9135496c648a7cd5} }