biclustering

Biclustering and association rule mining using Suffix Tree and Frequent Closed Itemset based algorithm.


License
MIT
Install
pip install biclustering==2023.5.31.5

Documentation

FIST-Biclustering-Python

Implementation of FIST algorithm for generating Biclusters, Frequent Closed Patterns, Association Rules using Python

Abstract

Association rule mining and biclustering are two popular techniques in data mining that can be used to uncover interesting patterns and relationships in large datasets. However, these techniques are often computationally expensive and can be challenging to apply to large da-tasets. This paper presents a novel approach that combines association rule mining and bi-clustering using a suffix tree data structure. It is based on the frequent closed itemsets framework and requires a unique scan of the database. This data structure is used to reduce memory usage and improve the extraction efficiency, allowing parallel processing of the tree branches. Experimental results show that the proposed algorithm (Frequent Itemset Suffix Tree: FIST) is effective in uncovering meaningful patterns and relationships in large datasets and outperforms existing state-of-the-art algorithms in terms of efficiency and scalability.

USER GUIDE

In this section of the chapter, we will discuss how to use this python implementation of the algorithm. We will be briefly discussing the following topics:

  • Environment & python installation
  • Installation of external Libraries
  • Using the source code or the package
  • Transforming input dataset to suitable form
  • Integrating the dataset with the python program and generating outputs
  • Results

1. Environment & python installation:

I have used an windows PC with 64-bit operating system, x64-based processor for running the python program.

To run the program, a suitable python installation is required. I have used Python 3.10.6.

To install Python 3 on your machine, Visit the official Python website https://www.python.org and navigate to the Downloads section. Download a version of Python 3.10 or onwards.

2. Installation of external libraries:

In this project we are using 2 external libraries on top of default python installation:

  • Pandas
  • PyDot

We are using pandas for the useful functionalities it provides for handling CSV files and DataFrames. PyDot is an interface to GraphViz which helps in creating graph based diagrams using python script.

To install Pandas use: python -m pip install pandas

To install PyDot: python -m pip install pydot

You may additionally need to install a GraphViz driver for PyDot to work properly.

3. Using the source code or the package

The source code of the program can be cloned from this git repository.

Source Code: https://github.com/Mijanur08/biclusturing-using-suffixTree

The program is also uploaded in pypi.org as a python package. The latest version is 2023.5.31.3

Package link: https://pypi.org/project/biclustering/

Instead of cloning the git repository, the package can be directly installed using the following pip install command.

python -m pip install biclustering --user

4. Transforming Input Dataset into Suitable Form

The input dataset must be in csv file and have the following 2 attributes:

  • An ID attribute which is unique for each of the rows
  • An itemsets attribute which contains comma separated names or values ā€“ each name or value representing an item.

An example of the input dataset is ā€“

Exampledata.csv

ID,Itemsets

O1,"P1,P3,A2,A4"

O2,"P1,P3,A1,A2"

O3,"P2,P5,A3"

O4,"P1,P3,P4,A2,A4"

O5,"P1,P2,P3,P5,A2,A4"

5. Integrating the dataset with our python program and generating output

Our python implementation has module named fist.py. This file acts as an interface between the user and the complete algorithmic process.

First import the class FIST from the fist module and initialize a FIST class.

If we are using the source code from git repository, we need to specify the fist.py file while importing FIST class.

from fist import FIST

fist = FIST()

Otherwise, if we are directly using the package after installing it by pip, then we need to specify the package name ā€˜biclusteringā€™.

from biclustering import FIST

fist = FIST()

Next step is to call the fist.process function. This function has 9 parameters. The function signature and description of all parameters are given below-

def process(self, input_file_dir: str, input_dataset_name:str,

id_attribute: str, itemset_attribute: str, max_entries = 10000,

min_supp_percent=1.0, min_conf_percent=0.0,

min_supp_count_outputs = 1,produce_final_img=False)->none

Parameters
Input_file_dir Datatype String Literal
Optional No
Default Value NA
Description Path to the directory where the input CSV file (dataset) is located and the outputs will be generated in an output folder in that directory.
input_dataset_name Datatype String Literal
Optional No
Default Value NA
Description Filename of the input CSV file including its file extension
id_attribute Datatype String Literal
Optional No
Default Value NA
Description Name of the ID column of the dataset. The ID column must have unique value for each row.
itemset_attribute Datatype String Literal
Optional No
Default Value NA
Description Name of the Item List column of the dataset. This column should contain comma separated item names. These item names will form itemsets.
max_entries Datatype Integer
Optional Yes
Default Value 10000
Description It specifies maximum number of rows are to be read in a dataset. If a dataset has total number of rows greater than max_entries, the first max_entries number of rows will be read during the execution.
min_support_percent Datatype Float
Optional Yes
Default Value 1\.0
Description

The provided percentage of the total number of rows will be used as minimum support count for constructing the number table.

This value will also be embedded within the file name of generated files.

min_conf_percent Datatype Float
Optional Yes
Default Value 0\.0
Description

Minimum confidence filters the generated association rules. If confidence of an association rule is greater than min_conf_percent, only then it will be written in the output.

This value will also be embedded within the file name of generated files.

min_supp_count_outputs Datatype Integer
Optional Yes
Default Value 1
Description To extract the FCPs and Bi-clusters we can use a different support count than the previous minimum support. This parameter carries the value of minimum support count which is used for generating the outputs.
produce_final_img Datatype Boolean
Optional Yes
Default Value False
Description This parameter decides whether to generate the image of the suffix tree or not.

Here is the sample code for execution in python :

fist.process(ā€œ./ā€,ā€sample.csvā€,ā€IDā€,ā€Itemsetsā€,min_support_percent=30,min_conf_percent=40,produce_final_img=True)

6. Results

To avoid error during the execution, ensure that output folder exists in the input file directory. All the output files will be generated in this output folder. After execution, total 24 files including the image of the suffix tree is generated. They are as follows:

Number Table NumberTable.dataset={dataset name}.minSupport={value}.csv
SFD SFD.dataset={dataset name}.minSupport={value}.csv
Suffix Tree suffixTree.dataset={dataset name}.minSupport={value}.csv
suffixTree.dataset={dataset name}.minSupport={value}.json
FCPs FCP.dataset={dataset name}.minSupport={value}.csv
FCP.dataset={dataset name}.minSupport={value}.json
Generators Generators.dataset={dataset name}.minSupport={value}.csv
Generators.dataset={dataset name}.minSupport={value}.json
Bi-clusters biclusters.dataset={dataset name}.minSupport{value}.minSize={value}.csv
biclusters.dataset={dataset name}.minSupport{value}.minSize={value}.json
Biclusters.withNames.dataset={dataset name}.minSupport{value}.minSize={value}.csv
Biclusters.withNames.dataset={dataset name}.minSupport{value}.minSize={value}.json

Association Rules

- Exact

- PB

- SB

rule.{type}.dataset={dataset name}.minSupport={value}.minConf={value}.csv
rule.{type}.dataset={dataset name}.minSupport={value}.minConf={value}.json
rule.withNames.{type}.dataset={dataset name}.minSupport={value}.minConf={value}.csv
rule.withNames.{type}.dataset={dataset name}.minSupport={value}.minConf={value}.json

7. References

1. Kartick Chandra Mondal, Nicolas Pasquier, Anirban Mukhopadhyay, Ujjwal Maulik, and Sanghamitra Bandhopadyay: A New Approach for Association Rule Mining and Bi-clustering Using Formal Concept Analysis. (2012) [Link]

2. Katick Chandra Mondal: Algorithms for Data Mining and Bio-informatics. (2016) [Link]