SpaGoG

Sparse data classification using Graph of Graphs


Keywords
GoG, Missing, values, Graphs
License
MIT
Install
pip install SpaGoG==0.26

Documentation

SpaGoG

An implementation of "SpaGoG: Graph of Graphs to classify tabular data with large fractions of missing data". SpaGoG (Sparse data classification using Graph of Graphs) is a model for classifying tabular data with large missing rates. SpaGoG represents the tabular data as a graph of graphs and involves multiple graph-data classification techniques to classify the samples from different points of view. This implementation is done with python 3.8 using PyTorch software.

Scheme Figure

How to use?

Installation

SpaGoG source code is available as a PyPI package (see https://pypi.org/project/SpaGoG/):

pip install spagog

Usage Example

Given train (both train_X, test_y) and test (at least test_X) sets of type pandas.DataFrame, SpaGoG can be executed, for example, using the following command:

from spagog.gog_model import gog_model

test_y = gog_model(train_X=train_X, train_Y=train_Y, test_X=test_X, model="gc", verbosity=1, to_numpy=False, evaluate_metrics=False)

Argument List

Here are all the arguments accepted by spagog.gog_model.gog_model:

  • model: str

The SpaGoG model to run. Options: ["gc", "gnc", "gc+nc"].

  • train_X: pandas.DataFrame

The features of the training set.

  • train_y: pandas.DataFrame

The labels of the training set.

  • test_X: pandas.DataFrame

The features of the test set.

  • test_y: pandas.DataFrame

The labels of the test set. If set to None, the evaluate_metrics should be set to False. Default: None.

  • val_X: pandas.DataFrame

The features of the validation set. If set to None, it will be derived from the training set with 80:20 ratio. Default: None.

  • val_y: pandas.DataFrame

The labels of the validation set. Default: None.

  • evaluate_metrics: bool

Whether to evaluate and return the accuracy score on the data sets. If set to True, the test_y argument must not be None. Default: True.

  • dataset_name: str

The name of the data set to run, for a cleaner output text. Default: "".

  • feature_selection: int

Number of significant features to run the data on. The feature seslction process is executed only if 1 <= feature_selection <= num_features. Default: 100.

  • edges: pandas.DataFrame

Edge list between the different samples (train, val and test), if there are any. If set to None, the edeges are calculated as a K-Nearest-Neighbors graph. Default: None.

  • probs: bool

Whether to return soft labels for the test set predicrtions. Default: False.

  • to_numpy: bool

Whether to return the test set predicrtions as a numpy.array. If set to False, the predictions type will be torch.Tensor. Default: False.

  • verbosity: int

Verbosity level of the running process. Set 0 for no output, 1 for evaluation metrics and timing report, and 2 to track the full learning process. Options: [0, 1, 2]. Defuault: 0.