dforge

A package for efortless data manager.


License
MIT
Install
pip install dforge==2.1.5

Documentation

Data Forge

Data Forge is a Python package designed to provide a comprehensive suite of tools specialized in managing and analyzing data. It offers a user-friendly toolkit suitable for individuals of all levels of expertise. This initial release encompasses the following features:

  • Effortless reading of datafiles with support for Pandas-compatible extensions.
  • Simplified dataframe creation by specifying column names and corresponding values.
  • Automatic generation of hexadecimal and numeric identifiers.
  • Provision of common statistical analyses for specified columns.
  • Histogram plotting functionality.
  • Easy visualization of class distribution within the database.
  • Data standardization capabilities.
  • Feature discretization support.
  • Creation of fold columns for conducting stratified k-fold analyses.

Components

'PDBuilder' Class

This class enables the creation and manipulation of Pandas DataFrames. It supports the following functionalities:

  • Reading data from various file formats: CSV, Excel, JSON, Parquet, Feather, and Pickle.
  • Generating unique identifiers for rows.
  • Adding new data to the DataFrame.
  • Displaying the DataFrame.
  • Retrieving column names.
  • Printing specific columns.

'PDNumPro' Class

This class extends the PDBuilder class and provides additional functionalities for numerical analysis and preprocessing. It includes the following features:

  • Computing statistics for numerical columns.
  • Plotting histograms.
  • Analyzing data balance.
  • Standardizing numerical data.
  • Reconstructing data.
  • Converting non-numeric columns to numeric values.
  • Creating folds for cross-validation.

Dependencies

This toolkit requires the following dependencies:

  • pandas
  • numpy
  • matplotlib
  • scikit-learn

Installation

To install the dependencies, run:

pip install dforge

Usage

You can use this toolkit in your Python projects by importing the necessary classes and functions. Here's an example of how to use the PDBuilder class:

import dforge as df

# Create a DataFrame from data
data = [[1, 'A', 10], [2, 'B', 20], [3, 'C', 30]]
columns = ['ID', 'Category', 'Value']
data_pd = df.PDBuilder(data=data, columns=columns)

# Display DataFrame
data_pd.show_dataset()

# Add new data
new_data = [[4, 'D', 40], [5, 'E', 50]]
data_pd.add_data(new_data)

# Display DataFrame with added data
data_pd.show_dataset()

Work in progress...

Stay tuned for upcoming releases, which will incorporate the following enhancements:

  • Missing data completion functionality.
  • Conversion of dataframes into datasets suitable for machine learning inputs.
  • Introduction of PDTextAnalysis for handling features related to text manipulation.