This repository contains a Python package for text preprocessing tasks. The toolkit includes functions for various preprocessing steps such as tokenization, lemmatization, stopword removal, text normalization, and more. It aims to provide a convenient and customizable solution for preparing text data for downstream tasks like natural language processing (NLP) and machine learning.
- Lowercasing: Convert all text to lowercase.
- Punctuation Removal: Remove punctuation marks from text.
- Stopword Removal: Remove common words (e.g., "and", "the") that do not contribute much meaning.
- Lemmatization: Reduce words to their base or root form (e.g., "running" -> "run").
- Spell Correction: Correct misspelled words in the text.
- URL and HTML Tag Removal: Clean URLs and HTML tags from text.
- Special Character Removal: Remove non-alphanumeric characters.
- Python 3.8 or higher
-
flake8
for linting -
pytest
for testing - Any dependencies defined in
requirements.txt
To install the package, clone the repository and install the necessary dependencies.
git clone https://github.com/your-username/text-preprocessing-toolkit.git
cd text-preprocessing-toolkit
pip install -r requirements.txt
Alternatively, if you want to install the package globally:
pip install .
You can use this toolkit in your Python project by importing the preprocessing functions:
from text_preprocessing_toolkit import processor
text = "Your sample text goes here!"
# Preprocess text
cleaned_text = processor.preprocess(text, steps=[
"lowercase",
"remove_punctuation",
"remove_stopwords",
"lemmatize_text",
"remove_special_characters",
"remove_url",
"remove_html_tags",
"correct_spellings"
])
print(cleaned_text)
- lowercase: Convert text to lowercase.
- remove_punctuation: Remove punctuation characters.
- remove_stopwords: Remove stopwords (common words like 'the', 'and', etc.).
- lemmatize_text: Lemmatize words (reduce to base form).
- remove_special_characters: Remove special characters from text.
- remove_url: Remove URLs from text.
- remove_html_tags: Remove HTML tags.
- correct_spellings: Correct common spelling mistakes.
This repository includes unit and integration tests using pytest
. To run the tests:
- Install
pytest
if you haven't already:
pip install pytest
- Run the tests:
pytest
Tests are located in the tests/
directory.
This project uses flake8
for linting. To check the code for style issues:
flake8 text_preprocessing_toolkit
This repository is integrated with GitHub Actions for continuous integration and continuous deployment (CI/CD). Every time a new commit is pushed or a pull request is created to the main
branch, the following steps will be automatically performed:
-
Linting: Code will be checked for style issues using
flake8
. -
Testing: Unit tests will be run using
pytest
. -
Build: The package will be built using
python -m build
. - Publish: The package will be uploaded to PyPI (if a release is created).
We welcome contributions! If you'd like to contribute to the project, please follow these steps:
- Fork the repository.
- Create a new branch (
git checkout -b feature-name
). - Make your changes and commit them (
git commit -m 'Add feature'
). - Push to your forked repository (
git push origin feature-name
). - Create a pull request.
This project is licensed under the MIT License - see the LICENSE file for details.
- Replace the repository URL in the
git clone
command with your actual GitHub repository URL. - Update any project-specific features or configurations that might be necessary.