BootstrapCCpy

CC library for python3


Keywords
clustering
License
MIT
Install
pip install BootstrapCCpy==0.2.2

Documentation

BootstrapCCpy

The Bootstrap Consensus Clustering method is a faster and simpler implementation of the well known resampling-based method for class discovery and visualization developed by Monti et al. In particular, the BootstrapCCpy package diminishes the number of required parameters on the original implementation, that requires to define the proportion of items and/or features to sample in each iteration. In BootstrapCCpy, the item/feature sample is applied over a bootstrap technique diminishing the number of parameters and avoiding user specific parameter selection. Another drawback of the original implementation is its secuencial implementation, which make it impractical for Big Data Analytics approaches. The aim of this work is to improve a Pyhton library implementation, BootstrapCCpy, in order to reduce execution time by paralelizing critical secuencial steps, as well as the proposal of a bootstrap sampling approach that eliminates user defined parameters. It also provides visualization facilities out of the box, such as heatmaps.

Note

We have also developed a version in R: BootstrapCC

Getting started

Download this repository

git clone https://github.com/NNelo/BootstrapCCpy.git

Please check out dependencies section in case you are having trouble.

Import the library

from BootstrapCCpy import BootstrapCCpy as bcc

Instance Consensus Clustering

CC = bcc.BootstrapCCpy(cluster=clusteringAlgoritm, K=number, B=number, n_cores=number)

Please refer to method section for further explanation of the parameters.

Methods

constructor

BootstrapCCpy(cluster, K, B, n_cores)

Parameters

  • cluster

    The class of a clustering algorithm implementation (Mandatory)

    For example, you could head to scikit-learn to pick the one of your preference. Let's use KMeans and do it properly

      cluster=KMeans().__class__
    
  • K

    Positive Integer (Mandatory)

    Refers to the maximum number of clusters to try

    For example, if it's set to 4, the algorithm will process the data in 2, 3, and 4 clusters.

  • B

    Positive Integer (Mandatory)

    Amount of bootstrap samples to be performed by the algorithm for each cluster number.

  • n_cores

    Integer (Optional, default: -1)

    The number of CPU cores to be used by the algorithm to fit the data. If it's set to -1, all available cores will be used.

fit

fit(data, verbose)

Trains the algorithm with the provided data to discover the optimal number of clusters. This function can be called just once per object instance.

⚠️ Take into account that this method is CPU and memory intensive, it may take a long time to be completed. ⚠️

Parameters

  • data

    ndarray (Mandatory)

  • 🚧 verbose

    boolean (Optional, default: False)

    Determines if it should print messages when fitting

    This method is not completely developed, please refer to this issue

get_best_k

get_best_k()

This returns the optimal number of clusters discovered by analytical methods

Returns

  • k

    Positive Integer

plot_consensus_distribution

plot_consensus_distribution()

plot_consensus_heatmap

plot_consensus_heatmap()

predict

predict()

predict_data

predict_data(data)

get_areas

get_areas()

Tips

Dependencies: kneed

Next steps

Authors

*both authors must be considered as the first author