This is a cutting-edge project leveraging advanced Machine Learning technologies to accurately discern and classify data types from various values. Designed to enhance data preprocessing and analysis pipelines, this tool automates the often tedious and error-prone task of manually identifying data types.
To quickly start using the pipeline just install and follow notebook below.
Important
openai_api_key is required for running L2 model inference.
- Has two models, L1 model (uses Classifier) that identifies normal datatypes ( integer, float, alphanumeric, range_type, date & time, open_ended_text, close_ended_text)
- L2 model further classifies L1 datatype result that are integer or float to measure,dimension or unknown (if not classified) (uses LLM) and date & time into one of 41 date-time formats like (YYYY-MM-DDTHH:MM:SS, YYYY/MM/DD, MM-DD-YYYY HH:MM AM/PM ) (uses RegEx).
Binary installers for the latest released version are available at the Python Package Index (PyPI)
# PyPI
> pip install legoai
Note
Source Ecommerce: https://www.kaggle.com/datasets/olistbr/brazilian-ecommerce
Total Tables: 9
, Total Columns: 52
Source Healthcare: https://mitre.box.com/shared/static/aw9po06ypfb9hrau4jamtvtz0e5ziucz.zip
Total Tables: 18
, Total Columns: 249
The project is released under the MIT License
Any contributions to this project is welcomed, you can follow the steps below for contribution:
- Fork the repository.
- Create a new branch feature/* (git checkout -b feature)
- Make your changes.
- Commit your changes (git commit -am 'Add new feature')
- Push to the branch (git push origin feature)
- Create a new Pull Request.