chopin2

Supervised Classification with Hyperdimensional Computing


Keywords
apache-spark, backward-elimination, feature-selection, gpgpu, hd-computing, machine-learning, supervised-learning, vsa
License
MIT-feh
Install
pip install chopin2==1.0.9.post1

Documentation

chopin2

Supervised Classification with Hyperdimensional Computing.

Conda

This repository includes some Python 3.8 utilities to build a Hyperdimensional Computing classification model according to the architecture originally introduced in https://doi.org/10.1109/DAC.2018.8465708

The src/generators folder contains two Python 3.8 scripts able to create training a test datasets with randomly selected samples from:

Due to the size of the datasets, they have not been reported on this repository but can be retrieved from:

The isolet dataset is part of the original forked version of the repository and it has been maintained in order to provide a simple toy model for testing purposes only.

Install

We deployed chopin2 as a Python 3.8 package that can be installed through pip and conda, as well as a Docker image.

Please, use one of the following commands to start playing with chopin2:

# Install chopin2 with pip
pip install chopin2

# Install chopin2 with conda
conda install -c conda-forge chopin2

# Initialise the Docker image
docker build -t chopin2 .
docker run -it chopin2

Please note that chopin2 is also available as a Galaxy tool. It's wrapper is available under the official Galaxy ToolShed at https://toolshed.g2.bx.psu.edu/view/fabio/chopin2

Usage

Once installed, you are ready to start playing with chopin2.

Try running the following command to run chopin2 on the isolet dataset:

chopin2 --dimensionality 10000 \
        --levels 100 \
        --retrain 10 \
        --pickle ../dataset/isolet/isolet.pkl \
        --psplit_training 80 \
        --dump \
        --nproc 4 \
        --verbose

In order to run it on Spark, other arguments must be specified:

chopin2 --dimensionality 10000 \
        --levels 100 \
        --retrain 10 \
        --pickle ../dataset/isolet/isolet.pkl \
        --psplit_training 80 \
        --dump \
        --spark \
        --slices 10 \
        --master local \
        --memory 2048m \
        --verbose

List of standard arguments:

--dimensionality    -- Dimensionality of the HD model (default 10000)
--levels            -- Number of level hypervectors (default 2)
--retrain           -- Number of retraining iterations (default 0)
--stop              -- Stop retraining if the error rate does not change (default False)
--dataset           -- Path to the dataset file
--fieldsep          -- Field separator (default ",")
--psplit_training   -- Percentage of observations that will be used to train the model. 
                       The remaining percentage will be used to test the classification model
--crossv_k          -- Number of folds for cross validation.
                       Cross validate HD models if --k_folds greater than 1
--seed              -- Seed for reproducing random sampling of the observations in the dataset 
                       and build both the training and test set (default 0)
--pickle            -- Path to the pickle file. If specified, "--dataset", "--fieldsep", and "--training" parameters are not used
--dump              -- Build a summary and log files (default False)
--cleanup           -- Delete the classification model as soon as it produces the prediction accuracy (default False)
--keep_levels       -- Do not delete the level hypervectors. It works in conjunction with --cleanup only (default True)
--nproc             -- Number of parallel jobs for the creation of the HD model.
                       This argument is ignored if --spark is enabled (default 1)
--verbose           -- Print results in real time (default False)
--cite              -- Print references and exit
-v, --version       -- Print the current chopin2.py version and exit

List of arguments to enable backward variable selection:

--features                     -- Path to a file with a single column containing the whole set or a subset of feature
--select_features              -- This triggers the backward variable selection method for the identification of the most significant features.
                                  Warning: computationally intense!
--group_min                    -- Minimum number of features among those specified with the --features argument (default 1)
--accuracy_threshold           -- Stop the execution if the best accuracy achieved during the previous group of runs is lower than this number (default 60.0)
--accuracy_uncertainty_perc    -- Take a run into account even if its accuracy is lower than the best accuracy achieved in the same group minus its "accuracy_uncertainty_perc" percent

List of argument for the execution of the classifier on a Spark distributed environment:

--spark     -- Build the classification model in a Apache Spark distributed environment
--slices    -- Number of slices in case --spark argument is enabled. 
               This argument is ignored if --gpu is enabled
--master    -- Master node address
--memory    -- Executor memory

List of arguments for the execution of the classifier on NVidia powered GPUs:

--gpu       -- Build the classification model on an NVidia powered GPU. 
               This argument is ignored if --spark is specified
--tblock    -- Number of threads per block in case --gpu argument is enabled. 
               This argument is ignored if --spark is enabled

Credits

Please credit our work in your manuscript by citing:

Fabio Cumbo, Eleonora Cappelli, and Emanuel Weitschek, "A brain-inspired hyperdimensional computing approach for classifying massive DNA methylation data of cancer", MDPI Algorithms, 2020 https://doi.org/10.3390/a13090233

Fabio Cumbo, Emanuel Weitschek, and Daniel Blankenberg, "hdlib: A Python library for designing Vector-Symbolic Architectures", Journal of Open Source Software, 2023 https://doi.org/10.21105/joss.05704

Do not forget to also cite the following paper from which this works takes inspiration:

Mohsen Imani, Chenyu Huang , Dequian Kong, Tajana Rosing, "Hierarchical Hyperdimensional Computing for Energy Efficient Classification", IEEE/ACM Design Automation Conference (DAC), 2018 https://doi.org/10.1109/DAC.2018.8465708