Drop-in replacement for pandas DataFrames that allows to store data on an hdf5 file and manipulate data directly from that hdf5 file without loading it in memory.
This is very much a work in progress, some features might not work yet or cause bugs. Save your data elsewhere before converting it to an H5DataFrame.
If you miss a feature from pandas DataFrames, please fill an issue or feel free to contribute.
This library provides the H5DataFrame
object, replacing the regular pandas.DataFrame
.
An H5DataFrame
can be created from a pandas.DataFrame
or from a dictionnary of (column_name -> column_values).
>>> import pandas as pd
>>> from h5dataframe import H5DataFrame
>>> df = pd.DataFrame({'a': [1, 2, 3], 'b': [4, 5, 6]},
index=['r1', 'r2', 'r3'])
>>> hdf = H5DataFrame(df)
>>> hdf
a b
r1 1 4
r2 2 5
r3 3 6
[RAM]
[3 rows x 2 columns]
At this point, all the data is still loaded in RAM, as indicated by the second-to-last line. To write the data to an hdf5 file, use the H5DataFrame.write()
method.
>>> hdf.write('path/to/file.h5')
>>> hdf
a b
r1 1 4
r2 2 5
r3 3 6
[FILE]
[3 rows x 2 columns]
The H5DataFrame
is now backed on an hdf5 file, only loading data in RAM when requested.
Alternatively, an H5DataFrame
can be read directly from an previously created hdf5 file with the H5DataFrame.read()
method.
>>> from h5dataframe import H5Mode
>>> H5DataFrame.read('path/to/file.h5', mode=H5Mode.READ)
a b
r1 1 4
r2 2 5
r3 3 6
[FILE]
[3 rows x 2 columns]
The default mode is READ
('r'
) which creates a read-only H5DataFrame
. To modify the data, use mode=H5Mode.READ_WRITE
('r+'
).
From pip:
pip install h5dataframe
From source:
git clone git@github.com:Vidium/h5dataframe.git