FIST-Biclustering-Python

Implementation of FIST algorithm for generating Biclusters, Frequent Closed Patterns, Association Rules using Python

Abstract

Association rule mining and biclustering are two popular techniques in data mining that can be used to uncover interesting patterns and relationships in large datasets. However, these techniques are often computationally expensive and can be challenging to apply to large da-tasets. This paper presents a novel approach that combines association rule mining and bi-clustering using a suffix tree data structure. It is based on the frequent closed itemsets framework and requires a unique scan of the database. This data structure is used to reduce memory usage and improve the extraction efficiency, allowing parallel processing of the tree branches. Experimental results show that the proposed algorithm (Frequent Itemset Suffix Tree: FIST) is effective in uncovering meaningful patterns and relationships in large datasets and outperforms existing state-of-the-art algorithms in terms of efficiency and scalability.

USER GUIDE

In this section of the chapter, we will discuss how to use this python implementation of the algorithm. We will be briefly discussing the following topics:

Environment & python installation
Installation of external Libraries
Using the source code or the package
Transforming input dataset to suitable form
Integrating the dataset with the python program and generating outputs
Results

1. Environment & python installation:

I have used an windows PC with 64-bit operating system, x64-based processor for running the python program.

To run the program, a suitable python installation is required. I have used Python 3.10.6.

To install Python 3 on your machine, Visit the official Python website https://www.python.org and navigate to the Downloads section. Download a version of Python 3.10 or onwards.

2. Installation of external libraries:

In this project we are using 2 external libraries on top of default python installation:

Pandas
PyDot

We are using pandas for the useful functionalities it provides for handling CSV files and DataFrames. PyDot is an interface to GraphViz which helps in creating graph based diagrams using python script.

To install Pandas use: python -m pip install pandas

To install PyDot: python -m pip install pydot

You may additionally need to install a GraphViz driver for PyDot to work properly.

3. Using the source code or the package

The source code of the program can be cloned from this git repository.

Source Code: https://github.com/Mijanur08/biclusturing-using-suffixTree

The program is also uploaded in pypi.org as a python package. The latest version is 2023.5.31.3

Package link: https://pypi.org/project/biclustering/

Instead of cloning the git repository, the package can be directly installed using the following pip install command.

python -m pip install biclustering --user

4. Transforming Input Dataset into Suitable Form

The input dataset must be in csv file and have the following 2 attributes:

An ID attribute which is unique for each of the rows
An itemsets attribute which contains comma separated names or values – each name or value representing an item.

An example of the input dataset is –

Exampledata.csv

ID,Itemsets

O1,"P1,P3,A2,A4"

O2,"P1,P3,A1,A2"

O3,"P2,P5,A3"

O4,"P1,P3,P4,A2,A4"

O5,"P1,P2,P3,P5,A2,A4"

5. Integrating the dataset with our python program and generating output

Our python implementation has module named fist.py. This file acts as an interface between the user and the complete algorithmic process.

First import the class FIST from the fist module and initialize a FIST class.

If we are using the source code from git repository, we need to specify the fist.py file while importing FIST class.

from fist import FIST

fist = FIST()

Otherwise, if we are directly using the package after installing it by pip, then we need to specify the package name ‘biclustering’.

from biclustering import FIST

fist = FIST()

Next step is to call the fist.process function. This function has 9 parameters. The function signature and description of all parameters are given below-

def process(self, input_file_dir: str, input_dataset_name:str,

id_attribute: str, itemset_attribute: str, max_entries = 10000,

min_supp_percent=1.0, min_conf_percent=0.0,

min_supp_count_outputs = 1,produce_final_img=False)->none

Parameters
Input_file_dir	Datatype	String Literal
	Optional	No
	Default Value	NA
	Description	Path to the directory where the input CSV file (dataset) is located and the outputs will be generated in an output folder in that directory.
input_dataset_name	Datatype	String Literal
	Optional	No
	Default Value	NA
	Description	Filename of the input CSV file including its file extension
id_attribute	Datatype	String Literal
	Optional	No
	Default Value	NA
	Description	Name of the ID column of the dataset. The ID column must have unique value for each row.
itemset_attribute	Datatype	String Literal
	Optional	No
	Default Value	NA
	Description	Name of the Item List column of the dataset. This column should contain comma separated item names. These item names will form itemsets.
max_entries	Datatype	Integer
	Optional	Yes
	Default Value	10000
	Description	It specifies maximum number of rows are to be read in a dataset. If a dataset has total number of rows greater than max_entries, the first max_entries number of rows will be read during the execution.
min_support_percent	Datatype	Float
	Optional	Yes
	Default Value	1\.0
	Description	The provided percentage of the total number of rows will be used as minimum support count for constructing the number table. This value will also be embedded within the file name of generated files.
min_conf_percent	Datatype	Float
	Optional	Yes
	Default Value	0\.0
	Description	Minimum confidence filters the generated association rules. If confidence of an association rule is greater than min_conf_percent, only then it will be written in the output. This value will also be embedded within the file name of generated files.
min_supp_count_outputs	Datatype	Integer
	Optional	Yes
	Default Value	1
	Description	To extract the FCPs and Bi-clusters we can use a different support count than the previous minimum support. This parameter carries the value of minimum support count which is used for generating the outputs.
produce_final_img	Datatype	Boolean
	Optional	Yes
	Default Value	False
	Description	This parameter decides whether to generate the image of the suffix tree or not.

Here is the sample code for execution in python :

fist.process(“./”,”sample.csv”,”ID”,”Itemsets”,min_support_percent=30,min_conf_percent=40,produce_final_img=True)

6. Results

To avoid error during the execution, ensure that output folder exists in the input file directory. All the output files will be generated in this output folder. After execution, total 24 files including the image of the suffix tree is generated. They are as follows:

Number Table	NumberTable.dataset={dataset name}.minSupport={value}.csv
SFD	SFD.dataset={dataset name}.minSupport={value}.csv
Suffix Tree	suffixTree.dataset={dataset name}.minSupport={value}.csv
Suffix Tree	suffixTree.dataset={dataset name}.minSupport={value}.json
FCPs	FCP.dataset={dataset name}.minSupport={value}.csv
FCPs	FCP.dataset={dataset name}.minSupport={value}.json
Generators	Generators.dataset={dataset name}.minSupport={value}.csv
Generators	Generators.dataset={dataset name}.minSupport={value}.json
Bi-clusters	biclusters.dataset={dataset name}.minSupport{value}.minSize={value}.csv
	biclusters.dataset={dataset name}.minSupport{value}.minSize={value}.json
	Biclusters.withNames.dataset={dataset name}.minSupport{value}.minSize={value}.csv
	Biclusters.withNames.dataset={dataset name}.minSupport{value}.minSize={value}.json
Association Rules - Exact - PB - SB	rule.{type}.dataset={dataset name}.minSupport={value}.minConf={value}.csv
	rule.{type}.dataset={dataset name}.minSupport={value}.minConf={value}.json
	rule.withNames.{type}.dataset={dataset name}.minSupport={value}.minConf={value}.csv
	rule.withNames.{type}.dataset={dataset name}.minSupport={value}.minConf={value}.json

7. References

1. Kartick Chandra Mondal, Nicolas Pasquier, Anirban Mukhopadhyay, Ujjwal Maulik, and Sanghamitra Bandhopadyay: A New Approach for Association Rule Mining and Bi-clustering Using Formal Concept Analysis. (2012) [Link]

2. Katick Chandra Mondal: Algorithms for Data Mining and Bio-informatics. (2016) [Link]

biclustering
Release 2023.5.31.5

Release 2023.5.31.5

1.0.5

2023.5.31.5

2023.5.31.4

2023.5.31.3

2023.5.31.2

2023.5.31.1

2023.5.31.0

Documentation

FIST-Biclustering-Python

Stats

Development practices

Releases

Contributors

biclustering Release 2023.5.31.5

Release 2023.5.31.5 Toggle Dropdown 1.0.5 2023.5.31.5 2023.5.31.4 2023.5.31.3 2023.5.31.2 2023.5.31.1 2023.5.31.0

Documentation

FIST-Biclustering-Python

Stats

Development practices

Releases

Contributors

biclustering
Release 2023.5.31.5

Release 2023.5.31.5

1.0.5

2023.5.31.5

2023.5.31.4

2023.5.31.3

2023.5.31.2

2023.5.31.1

2023.5.31.0