ml-dataloader
ml-dataloader is an efficient and flexible data loading pipeline for deep learning, written in pure Python.
Install
pip install ml-dataloader
Examples (similar to Pytorch-dataloader)
- suppose data store in python list
from dataloader.dataset import Dataset
from dataloader.dataloader import DataLoader
from dataloader.util.kind import DataKind
from dataloader.util.to_tf_tensor import to_tf_tensor # or to_pt_tensor
data = list(range(10))
dataset = Dataset(data, kind=DataKind.MEM_SEQ)
dl = DataLoader(dataset, batch_size=2, shuffle=False, fn_to_tensor=to_tf_tensor)
for batch in dl:
print(batch)
# tf.Tensor([0 1], shape=(2,), dtype=int32)
# tf.Tensor([2 3], shape=(2,), dtype=int32)
# tf.Tensor([4 5], shape=(2,), dtype=int32)
# tf.Tensor([6 7], shape=(2,), dtype=int32)
# tf.Tensor([8 9], shape=(2,), dtype=int32)
- suppose
train.tsv
storing the data
from dataloader.dataset import Dataset
from dataloader.dataloader import DataLoader
from dataloader.util.kind import DataKind
from dataloader.util.to_tf_tensor import to_tf_tensor
filename = 'train.tsv'
dataset = Dataset(filename, kind=DataKind.FILE)
dl = DataLoader(dataset, batch_size=2, shuffle=True, fn_to_tensor=to_tf_tensor)
for batch in dl:
print(batch)
- suppose
train.tsv
storing the data and usingmmap
from dataloader.dataset import Dataset
from dataloader.dataloader import DataLoader
from dataloader.util.kind import DataKind
from dataloader.util.to_tf_tensor import to_tf_tensor
filename = 'train.tsv'
dataset = Dataset(filename, kind=DataKind.MMAP_FILE)
dl = DataLoader(dataset, batch_size=2, shuffle=True, fn_to_tensor=to_tf_tensor)
for batch in dl:
print(batch)
NOTES:
- if transform is slow, the dataloader will be stuck while num_workers > 0
Examples with Pipeline (similar to Tensorpack-dataflow)
- suppose data store in python list
from dataloader.pipeline.dataset import Dataset
from dataloader.pipeline.dataloader import DataLoader
from dataloader.pipeline.processor import MapDataProcessKind
from dataloader.util.kind import DataKind
from dataloader.util.to_tf_tensor import to_tf_tensor
data = list(range(10))
dataset = Dataset(data, kind=DataKind.MEM_SEQ)
dl = DataLoader(dataset, batch_size=2, shuffle=False, processor_kind=MapDataProcessKind.NORMAL, fn_to_tensor=to_tf_tensor)
for batch in dl:
print(batch)
# tf.Tensor([0 1], shape=(2,), dtype=int32)
# tf.Tensor([2 3], shape=(2,), dtype=int32)
# tf.Tensor([4 5], shape=(2,), dtype=int32)
# tf.Tensor([6 7], shape=(2,), dtype=int32)
# tf.Tensor([8 9], shape=(2,), dtype=int32)
- suppose
train.tsv
storing the data
from dataloader.pipeline.dataset import Dataset
from dataloader.pipeline.dataloader import DataLoader
from dataloader.pipeline.processor import MapDataProcessKind
from dataloader.util.kind import DataKind
from dataloader.util.to_tf_tensor import to_tf_tensor
filename = 'train.tsv'
dataset = Dataset(filename, kind=DataKind.FILE)
dl = DataLoader(dataset, batch_size=2, shuffle=True, processor_kind=MapDataProcessKind.MULTI_PROCESS, num_procs=20, fn_to_tensor=to_tf_tensor)
for batch in dl:
print(batch)
- suppose
train.tsv
storing the data and usingmmap
from dataloader.pipeline.dataset import Dataset
from dataloader.pipeline.dataloader import DataLoader
from dataloader.pipeline.processor import MapDataProcessKind
from dataloader.util.kind import DataKind
from dataloader.util.to_tf_tensor import to_tf_tensor
filename = 'train.tsv'
dataset = Dataset(filename, kind=DataKind.MMAP_FILE)
dl = DataLoader(dataset, batch_size=2, shuffle=True, processor_kind=MapDataProcessKind.MULTI_PROCESS, num_procs=20, fn_to_tensor=to_tf_tensor)
for batch in dl:
print(batch)
NOTES:
- the fully supported parameters, pls ref to DataLoader definition
- with MultiThreadMapData/MultiProcessMapDataZMQ, the order won't be kept as defined in dataset
- in order to keep order as defined in
Dataset
, MapData can be used, but it could be slow compare with MultiThreadMapData and MultiProcessMapDataZMQ. Another way, process the data with pool_transform, then pass the processed data asDataKind.MEM_SEQ
kind intoDataset
, i.e.,dataset = Dataset(processed, DataKind.MEM_SEQ)
, and avoid usingMultiThreadMapData/MultiProcessMapDataZMQ
Refs:
FAQ
-
[__NSPlaceholderDate initialize] may have been in progress in another thread when fork()
if one meets such error, please set
OBJC_DISABLE_INITIALIZE_FORK_SAFETY
as yes:export OBJC_DISABLE_INITIALIZE_FORK_SAFETY=YES
-
publish to pypi (on mac)
pip install twine python setup.py sdist twine upload --verbose --repository-url "https://upload.pypi.org/legacy/" -u "pypi-username" -p "pypi-password" dist/*