Text Preprocessing Toolkit (TPT)

Version: 0.0.1
Author: Gaurav Jaiswal
A comprehensive Python toolkit for preprocessing text, designed to simplify NLP workflows. This package provides various utilities like stopword removal, punctuation handling, spell-checking, lemmatization, and more to clean and preprocess text effectively.

Features

Remove Punctuation: Strips punctuation marks from text.
Remove Stopwords: Removes common stopwords to reduce noise in textual data.
Remove Special Characters: Cleans text by removing unnecessary symbols.
Lowercase Conversion: Standardizes text to lowercase.
Spell Correction: Identifies and corrects misspelled words.
Lemmatization: Converts words to their base forms.
Stemming: Reduces words to their root forms using a stemming algorithm.
HTML Tag Removal: Cleans HTML tags from the text.
URL Removal: Detects and removes URLs.
Customizable Pipeline: Allows users to apply preprocessing steps in a specified order.
Quick Dataset Preview: Provides a summary of text datasets, including word and character counts.

Installation

Clone the repository or install the package using pip:

pip install Text_Preprocessing_Toolkit

Usage

Import the Package

from TPT import TPT

Initialize the Toolkit

You can add custom stopwords during initialization:

tpt = TPT(custom_stopwords=["example", "custom"])

Preprocess Text with Default Pipeline

text = "This is an <b>example</b> sentence with a URL: https://example.com."
processed_text = tpt.preprocess(text)
print(processed_text)

Customize Preprocessing Steps

custom_steps = ["lowercase", "remove_punctuation", "remove_stopwords"]
processed_text = tpt.preprocess(text, steps=custom_steps)
print(processed_text)

Quick Dataset Summary

texts = [
    "This is a sample text.",
    "Another <b>example</b> with HTML tags and a URL: https://example.com.",
    "Spellngg errors corrected!",
]
tpt.head(texts, n=3)

Available Methods

Method	Description
`remove_punctuation`	Removes punctuation from text.
`remove_stopwords`	Removes stopwords from text.
`remove_special_characters`	Cleans text by removing special characters.
`remove_url`	Removes URLs from the text.
`remove_html_tags`	Strips HTML tags from text.
`correct_spellings`	Corrects spelling mistakes in the text.
`lowercase`	Converts text to lowercase.
`lemmatize_text`	Lemmatizes text using WordNet.
`stem_text`	Applies stemming to reduce words to their root forms.
`preprocess`	Applies a series of preprocessing steps to the input text.
`head`	Displays a quick summary of a text dataset.

Example Output

Input

This is a <b>sample</b> text with a URL: https://example.com. Check spellngg errors!

Output (Default Pipeline)

sample text check spelling errors

Requirements

Python >= 3.8
Libraries: nltk, pandas, spellchecker, IPython

To install the dependencies:

pip install -r requirements.txt

Contributing

Contributions are welcome! To contribute:

Fork this repository.
Clone your forked repository.
Create a new branch for your feature.
Make your changes, write tests, and ensure the code passes.
Submit a pull request for review.

Testing

To test the package locally:

Install development dependencies:
```
pip install pytest
```
Run tests:
```
pytest
```

License

This project is licensed under the MIT License. See the LICENSE file for details.

Author

Gaurav Jaiswal
GitHub

Text-Preprocessing-Toolkit
Release 0.0.1

Release 0.0.1

0.0.2

0.0.1

Documentation

Text Preprocessing Toolkit (TPT)

Features

Installation

Usage

Import the Package

Initialize the Toolkit

Preprocess Text with Default Pipeline

Customize Preprocessing Steps

Quick Dataset Summary

Available Methods

Example Output

Input

Output (Default Pipeline)

Requirements

Contributing

Testing

License

Author

Stats

Development practices

Releases

Contributors

Text-Preprocessing-Toolkit Release 0.0.1

Release 0.0.1 Toggle Dropdown 0.0.2 0.0.1

Documentation

Text Preprocessing Toolkit (TPT)

Features

Installation

Usage

Import the Package

Initialize the Toolkit

Preprocess Text with Default Pipeline

Customize Preprocessing Steps

Quick Dataset Summary

Available Methods

Example Output

Input

Output (Default Pipeline)

Requirements

Contributing

Testing

License

Author

Stats

Development practices

Releases

Contributors

Text-Preprocessing-Toolkit
Release 0.0.1

Release 0.0.1

0.0.2

0.0.1