Text-Preprocessing-Toolkit

A package that automates text preprocessing


Keywords
text, preprocessing, toolkit, automation, python3
License
MIT
Install
pip install Text-Preprocessing-Toolkit==0.0.2

Documentation

Text Preprocessing Toolkit

This repository contains a Python package for text preprocessing tasks. The toolkit includes functions for various preprocessing steps such as tokenization, lemmatization, stopword removal, text normalization, and more. It aims to provide a convenient and customizable solution for preparing text data for downstream tasks like natural language processing (NLP) and machine learning.

Features

  • Lowercasing: Convert all text to lowercase.
  • Punctuation Removal: Remove punctuation marks from text.
  • Stopword Removal: Remove common words (e.g., "and", "the") that do not contribute much meaning.
  • Lemmatization: Reduce words to their base or root form (e.g., "running" -> "run").
  • Spell Correction: Correct misspelled words in the text.
  • URL and HTML Tag Removal: Clean URLs and HTML tags from text.
  • Special Character Removal: Remove non-alphanumeric characters.

Requirements

  • Python 3.8 or higher
  • flake8 for linting
  • pytest for testing
  • Any dependencies defined in requirements.txt

Installation

To install the package, clone the repository and install the necessary dependencies.

Clone the repository:

git clone https://github.com/your-username/text-preprocessing-toolkit.git
cd text-preprocessing-toolkit

Install dependencies:

pip install -r requirements.txt

Alternatively, if you want to install the package globally:

pip install .

Usage

You can use this toolkit in your Python project by importing the preprocessing functions:

from text_preprocessing_toolkit import processor

text = "Your sample text goes here!"

# Preprocess text
cleaned_text = processor.preprocess(text, steps=[
    "lowercase",
    "remove_punctuation",
    "remove_stopwords",
    "lemmatize_text",
    "remove_special_characters",
    "remove_url",
    "remove_html_tags",
    "correct_spellings"
])

print(cleaned_text)

Available Preprocessing Steps:

  • lowercase: Convert text to lowercase.
  • remove_punctuation: Remove punctuation characters.
  • remove_stopwords: Remove stopwords (common words like 'the', 'and', etc.).
  • lemmatize_text: Lemmatize words (reduce to base form).
  • remove_special_characters: Remove special characters from text.
  • remove_url: Remove URLs from text.
  • remove_html_tags: Remove HTML tags.
  • correct_spellings: Correct common spelling mistakes.

Running Tests

This repository includes unit and integration tests using pytest. To run the tests:

  1. Install pytest if you haven't already:
pip install pytest
  1. Run the tests:
pytest

Tests are located in the tests/ directory.

Code Linting

This project uses flake8 for linting. To check the code for style issues:

flake8 text_preprocessing_toolkit

CI/CD

This repository is integrated with GitHub Actions for continuous integration and continuous deployment (CI/CD). Every time a new commit is pushed or a pull request is created to the main branch, the following steps will be automatically performed:

  • Linting: Code will be checked for style issues using flake8.
  • Testing: Unit tests will be run using pytest.
  • Build: The package will be built using python -m build.
  • Publish: The package will be uploaded to PyPI (if a release is created).

Contributing

We welcome contributions! If you'd like to contribute to the project, please follow these steps:

  1. Fork the repository.
  2. Create a new branch (git checkout -b feature-name).
  3. Make your changes and commit them (git commit -m 'Add feature').
  4. Push to your forked repository (git push origin feature-name).
  5. Create a pull request.

License

This project is licensed under the MIT License - see the LICENSE file for details.


Notes:

  • Replace the repository URL in the git clone command with your actual GitHub repository URL.
  • Update any project-specific features or configurations that might be necessary.

________________________________________________________________________________________________