cocoa-system

COCOA: COrrelation COefficient-Aware Data Augmentation


License
Apache-2.0
Install
pip install cocoa-system==0.1.0

Documentation

COCOA

Using the system

The project contains three main python files: index_generation.py, COCOA.py, and SBE.py. index_generation.py is responsible to generate the inverted index and also the order index out of the tables stored in the DB. These index structures are used to efficiently find the joinable external tables and calculate the non-linear correlation between the external columns and the ML task target column. COCOA calculates the non-linear correlation in a linear time complexity. SBE.py contains methods to find the joinable tables using the inverted index generated by "generate_inverted_index()" function in index_generation.py. "enrich_SBE()" function, based on the provided parameters, enriches the given dataset. "enrich_COCOA()" function in COCOA.py enriches the input dataset in the same way except AugX leveraging the order index generated by "generate_order_index()" function in index_generation.py.

Having the inverted index stored in the DB, the functions should be called in this order:

generate_inverted_index()

generate_order_index()

enrich_COCOA('movie', 'movie_title', 'imdb_score', 50, 10000)

Parameters of enrich_COCOA function are name of the dataset, the query column, the target column (the ML model. Prediction column), the number of final columns to enrich the input dataset with, and the number of external tables to fetch and to look for the best correlating columns in.