Version: 0.0.1
Author: Gaurav Jaiswal
A comprehensive Python toolkit for preprocessing text, designed to simplify NLP workflows. This package provides various utilities like stopword removal, punctuation handling, spell-checking, lemmatization, and more to clean and preprocess text effectively.
- Remove Punctuation: Strips punctuation marks from text.
- Remove Stopwords: Removes common stopwords to reduce noise in textual data.
- Remove Special Characters: Cleans text by removing unnecessary symbols.
- Lowercase Conversion: Standardizes text to lowercase.
- Spell Correction: Identifies and corrects misspelled words.
- Lemmatization: Converts words to their base forms.
- Stemming: Reduces words to their root forms using a stemming algorithm.
- HTML Tag Removal: Cleans HTML tags from the text.
- URL Removal: Detects and removes URLs.
- Customizable Pipeline: Allows users to apply preprocessing steps in a specified order.
- Quick Dataset Preview: Provides a summary of text datasets, including word and character counts.
Clone the repository or install the package using pip
:
pip install Text_Preprocessing_Toolkit
from TPT import TPT
You can add custom stopwords during initialization:
tpt = TPT(custom_stopwords=["example", "custom"])
text = "This is an <b>example</b> sentence with a URL: https://example.com."
processed_text = tpt.preprocess(text)
print(processed_text)
custom_steps = ["lowercase", "remove_punctuation", "remove_stopwords"]
processed_text = tpt.preprocess(text, steps=custom_steps)
print(processed_text)
texts = [
"This is a sample text.",
"Another <b>example</b> with HTML tags and a URL: https://example.com.",
"Spellngg errors corrected!",
]
tpt.head(texts, n=3)
Method | Description |
---|---|
remove_punctuation |
Removes punctuation from text. |
remove_stopwords |
Removes stopwords from text. |
remove_special_characters |
Cleans text by removing special characters. |
remove_url |
Removes URLs from the text. |
remove_html_tags |
Strips HTML tags from text. |
correct_spellings |
Corrects spelling mistakes in the text. |
lowercase |
Converts text to lowercase. |
lemmatize_text |
Lemmatizes text using WordNet. |
stem_text |
Applies stemming to reduce words to their root forms. |
preprocess |
Applies a series of preprocessing steps to the input text. |
head |
Displays a quick summary of a text dataset. |
This is a <b>sample</b> text with a URL: https://example.com. Check spellngg errors!
sample text check spelling errors
- Python >= 3.8
- Libraries:
nltk
,pandas
,spellchecker
,IPython
To install the dependencies:
pip install -r requirements.txt
Contributions are welcome! To contribute:
- Fork this repository.
- Clone your forked repository.
- Create a new branch for your feature.
- Make your changes, write tests, and ensure the code passes.
- Submit a pull request for review.
To test the package locally:
- Install development dependencies:
pip install pytest
- Run tests:
pytest
This project is licensed under the MIT License. See the LICENSE
file for details.
-
Gaurav Jaiswal
GitHub