pybda

Analysis of big biological data sets for distributed HPC clusters.


Keywords
bigdata, analysis, pipeline, workflow, spark, pyspark, machinelearning, apache-spark, big-data, machine-learning, python, snakemake
License
GPL-3.0
Install
pip install pybda==0.1.0

Documentation

PyBDA

Project Status travis circleci codecov codedacy readthedocs bioconda version

A commandline tool for analysis of big biological data sets for distributed HPC clusters.

About

PyBDA is a Python library and command line tool for big data analytics and machine learning scaling to tera byte sized data sets.

In order to make PyBDA scale to big data sets, we use Apache Spark's DataFrame API which, if developed against, automatically distributes data to the nodes of a high-performance cluster and does the computation of expensive machine learning tasks in parallel. For scheduling, PyBDA uses Snakemake to automatically execute pipelines of jobs. In particular, PyBDA will first build a DAG of methods/jobs you want to execute in succession (e.g. dimensionality reduction into clustering) and then compute every method by traversing the DAG. In the case of a successful computation of a job, PyBDA will write results and plots, and create statistics. If one of the jobs fails PyBDA will report where and which method failed (owing to Snakemake's scheduling) such that the same pipeline can effortlessly be continued from where it failed the last time.

For instance, if you want to first reduce your data set into a lower dimensional space, cluster it using several cluster centers, and fit a random forest you would first specify a config file similar to this:

$ cat data/pybda-usecase.config

spark: spark-submit
infile: data/single_cell_imaging_data.tsv
predict: data/single_cell_imaging_data.tsv
outfolder: data/results
meta: data/meta_columns.tsv
features: data/feature_columns.tsv
dimension_reduction: pca
n_components: 5
clustering: kmeans
n_centers: 50, 100, 110, 120, 130, 140, 150, 160, 170, 180, 190, 200
regression: forest
family: binomial
response: is_infected
sparkparams:
  - "--driver-memory=3G"
  - "--executor-memory=6G"
debug: true

Executing PyBDA, and calling the methods above, is then as easy as this:

$ pybda run data/pybda-usecase.config local

Installation

I recommend installing PyBDA from Bioconda:

$ conda install -c bioconda pybda

You can however also directly install using PyPI:

$ pip install pybda

Otherwise you could download the latest release and install that.

Documentation

Check out the documentation here. The documentation will walk you though

  • the installation process,
  • setting up Apache Spark,
  • using pybda.

Author

Simon Dirmeier simon.dirmeier@bsse.ethz.ch