Text Preprocessing Toolkit

This repository contains a Python package for text preprocessing tasks. The toolkit includes functions for various preprocessing steps such as tokenization, lemmatization, stopword removal, text normalization, and more. It aims to provide a convenient and customizable solution for preparing text data for downstream tasks like natural language processing (NLP) and machine learning.

Features

Lowercasing: Convert all text to lowercase.
Punctuation Removal: Remove punctuation marks from text.
Stopword Removal: Remove common words (e.g., "and", "the") that do not contribute much meaning.
Lemmatization: Reduce words to their base or root form (e.g., "running" -> "run").
Spell Correction: Correct misspelled words in the text.
URL and HTML Tag Removal: Clean URLs and HTML tags from text.
Special Character Removal: Remove non-alphanumeric characters.

Requirements

Python 3.8 or higher
flake8 for linting
pytest for testing
Any dependencies defined in requirements.txt

Installation

To install the package, clone the repository and install the necessary dependencies.

Clone the repository:

git clone https://github.com/your-username/text-preprocessing-toolkit.git
cd text-preprocessing-toolkit

Install dependencies:

pip install -r requirements.txt

Alternatively, if you want to install the package globally:

pip install .

Usage

You can use this toolkit in your Python project by importing the preprocessing functions:

from text_preprocessing_toolkit import processor

text = "Your sample text goes here!"

# Preprocess text
cleaned_text = processor.preprocess(text, steps=[
    "lowercase",
    "remove_punctuation",
    "remove_stopwords",
    "lemmatize_text",
    "remove_special_characters",
    "remove_url",
    "remove_html_tags",
    "correct_spellings"
])

print(cleaned_text)

Available Preprocessing Steps:

lowercase: Convert text to lowercase.
remove_punctuation: Remove punctuation characters.
remove_stopwords: Remove stopwords (common words like 'the', 'and', etc.).
lemmatize_text: Lemmatize words (reduce to base form).
remove_special_characters: Remove special characters from text.
remove_url: Remove URLs from text.
remove_html_tags: Remove HTML tags.
correct_spellings: Correct common spelling mistakes.

Running Tests

This repository includes unit and integration tests using pytest. To run the tests:

Install pytest if you haven't already:

pip install pytest

Run the tests:

pytest

Tests are located in the tests/ directory.

Code Linting

This project uses flake8 for linting. To check the code for style issues:

flake8 text_preprocessing_toolkit

CI/CD

This repository is integrated with GitHub Actions for continuous integration and continuous deployment (CI/CD). Every time a new commit is pushed or a pull request is created to the main branch, the following steps will be automatically performed:

Linting: Code will be checked for style issues using flake8.
Testing: Unit tests will be run using pytest.
Build: The package will be built using python -m build.
Publish: The package will be uploaded to PyPI (if a release is created).

Contributing

We welcome contributions! If you'd like to contribute to the project, please follow these steps:

Fork the repository.
Create a new branch (git checkout -b feature-name).
Make your changes and commit them (git commit -m 'Add feature').
Push to your forked repository (git push origin feature-name).
Create a pull request.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Notes:

Replace the repository URL in the git clone command with your actual GitHub repository URL.
Update any project-specific features or configurations that might be necessary.

________________________________________________________________________________________________

Text-Preprocessing-Toolkit
Release 0.0.2

Release 0.0.2

0.0.2

0.0.1

Documentation

Text Preprocessing Toolkit

Features

Requirements

Installation

Clone the repository:

Install dependencies:

Usage

Available Preprocessing Steps:

Running Tests

Code Linting

CI/CD

Contributing

License

Notes:

________________________________________________________________________________________________

Stats

Development practices

Releases

Contributors

Text-Preprocessing-Toolkit Release 0.0.2

Release 0.0.2 Toggle Dropdown 0.0.2 0.0.1

Documentation

Text Preprocessing Toolkit

Features

Requirements

Installation

Clone the repository:

Install dependencies:

Usage

Available Preprocessing Steps:

Running Tests

Code Linting

CI/CD

Contributing

License

Notes:

________________________________________________________________________________________________

Stats

Development practices

Releases

Contributors

Text-Preprocessing-Toolkit
Release 0.0.2

Release 0.0.2

0.0.2

0.0.1