loompy
★ This repository is under construction, and not yet ready for public use. Be patient.
.loom
is an efficient file format for very large omics datasets,
consisting of a main matrix and a variable number of row and column
annotations. We use loom files to store single-cell gene expression
data: the main matrix contains the actual expression values (one
column per cell, one row per gene); row and column annotations
contain metadata for genes and cells, such as Name
, Chromosome
,
Position
(for genes), and Strain
, Sex
, Age
(for cells).
Loom files (.loom
) are created in the HDF5 file format, which
supports an internal collection of numerical multidimensional datasets.
HDF5 is supported by many computer languages, including Java, MATLAB,
Mathematica, Python, R, and Julia. .loom
files are accessible from
any language that supports HDF5.
Installation
Use pip from your terminal:
pip install loompy
Note: there are some prerequisites, which will be installed along with loompy. If you use the popular Anaconda Python distribution, all prerequisites will have already been installed.
Getting started
import loom
ds = loom.connect("cortex.loom")
print ds.row_attrs.keys()
This will print the names of all the row attribute in the file.
Understanding the semantics of loom files
Connecting, not loading and saving
Loom files are stored on disk and are never loaded entirely. They are more like databases: you connect, retrieve some subset of the data, maybe update some attributes.
Reading and writing
Loom files are based on HDF5, a file format suitable for large multidimensional datasets. They are designed to be mostly created once, then used as read-only. They do not support writing and reading concurrently. They also do no support journalling, so if something happens during a write, the entire file can be lost. Therefore, do not use loom files as your primary data storage. They are for working with data, not keeping it safe.
Loom files are great for distribution of large datasets, which are then used as read-only for analytical purposes.
Efficient indexing
The main matrix is stored in chunked format. That is, instead of being stored by rows or by columns, it is stored as a sequence of little rectangles. As a consequence, both rows and columns (as well as submatrices) can be efficiently accessed.
Documentation
The loompy API is documented at loompy.readthedocs.org.