Data Forge is a Python package designed to provide a comprehensive suite of tools specialized in managing and analyzing data. It offers a user-friendly toolkit suitable for individuals of all levels of expertise. This initial release encompasses the following features:
- Effortless reading of datafiles with support for Pandas-compatible extensions.
- Simplified dataframe creation by specifying column names and corresponding values.
- Automatic generation of hexadecimal and numeric identifiers.
- Provision of common statistical analyses for specified columns.
- Histogram plotting functionality.
- Easy visualization of class distribution within the database.
- Data standardization capabilities.
- Feature discretization support.
- Creation of fold columns for conducting stratified k-fold analyses.
This class enables the creation and manipulation of Pandas DataFrames. It supports the following functionalities:
- Reading data from various file formats: CSV, Excel, JSON, Parquet, Feather, and Pickle.
- Generating unique identifiers for rows.
- Adding new data to the DataFrame.
- Displaying the DataFrame.
- Retrieving column names.
- Printing specific columns.
This class extends the PDBuilder class and provides additional functionalities for numerical analysis and preprocessing. It includes the following features:
- Computing statistics for numerical columns.
- Plotting histograms.
- Analyzing data balance.
- Standardizing numerical data.
- Reconstructing data.
- Converting non-numeric columns to numeric values.
- Creating folds for cross-validation.
This toolkit requires the following dependencies:
- pandas
- numpy
- matplotlib
- scikit-learn
To install the dependencies, run:
pip install dforge
You can use this toolkit in your Python projects by importing the necessary classes and functions. Here's an example of how to use the PDBuilder class:
import dforge as df
# Create a DataFrame from data
data = [[1, 'A', 10], [2, 'B', 20], [3, 'C', 30]]
columns = ['ID', 'Category', 'Value']
data_pd = df.PDBuilder(data=data, columns=columns)
# Display DataFrame
data_pd.show_dataset()
# Add new data
new_data = [[4, 'D', 40], [5, 'E', 50]]
data_pd.add_data(new_data)
# Display DataFrame with added data
data_pd.show_dataset()
Stay tuned for upcoming releases, which will incorporate the following enhancements:
- Missing data completion functionality.
- Conversion of dataframes into datasets suitable for machine learning inputs.
- Introduction of PDTextAnalysis for handling features related to text manipulation.