Text-Preprocessing-Toolkit

A package that automates text preprocessing


Keywords
text, preprocessing, toolkit, automation, python3
License
MIT
Install
pip install Text-Preprocessing-Toolkit==0.0.1

Documentation

Text Preprocessing Toolkit (TPT)

Version: 0.0.1
Author: Gaurav Jaiswal
A comprehensive Python toolkit for preprocessing text, designed to simplify NLP workflows. This package provides various utilities like stopword removal, punctuation handling, spell-checking, lemmatization, and more to clean and preprocess text effectively.


Features

  • Remove Punctuation: Strips punctuation marks from text.
  • Remove Stopwords: Removes common stopwords to reduce noise in textual data.
  • Remove Special Characters: Cleans text by removing unnecessary symbols.
  • Lowercase Conversion: Standardizes text to lowercase.
  • Spell Correction: Identifies and corrects misspelled words.
  • Lemmatization: Converts words to their base forms.
  • Stemming: Reduces words to their root forms using a stemming algorithm.
  • HTML Tag Removal: Cleans HTML tags from the text.
  • URL Removal: Detects and removes URLs.
  • Customizable Pipeline: Allows users to apply preprocessing steps in a specified order.
  • Quick Dataset Preview: Provides a summary of text datasets, including word and character counts.

Installation

Clone the repository or install the package using pip:

pip install Text_Preprocessing_Toolkit

Usage

Import the Package

from TPT import TPT

Initialize the Toolkit

You can add custom stopwords during initialization:

tpt = TPT(custom_stopwords=["example", "custom"])

Preprocess Text with Default Pipeline

text = "This is an <b>example</b> sentence with a URL: https://example.com."
processed_text = tpt.preprocess(text)
print(processed_text)

Customize Preprocessing Steps

custom_steps = ["lowercase", "remove_punctuation", "remove_stopwords"]
processed_text = tpt.preprocess(text, steps=custom_steps)
print(processed_text)

Quick Dataset Summary

texts = [
    "This is a sample text.",
    "Another <b>example</b> with HTML tags and a URL: https://example.com.",
    "Spellngg errors corrected!",
]
tpt.head(texts, n=3)

Available Methods

Method Description
remove_punctuation Removes punctuation from text.
remove_stopwords Removes stopwords from text.
remove_special_characters Cleans text by removing special characters.
remove_url Removes URLs from the text.
remove_html_tags Strips HTML tags from text.
correct_spellings Corrects spelling mistakes in the text.
lowercase Converts text to lowercase.
lemmatize_text Lemmatizes text using WordNet.
stem_text Applies stemming to reduce words to their root forms.
preprocess Applies a series of preprocessing steps to the input text.
head Displays a quick summary of a text dataset.

Example Output

Input

This is a <b>sample</b> text with a URL: https://example.com. Check spellngg errors!

Output (Default Pipeline)

sample text check spelling errors

Requirements

  • Python >= 3.8
  • Libraries: nltk, pandas, spellchecker, IPython

To install the dependencies:

pip install -r requirements.txt

Contributing

Contributions are welcome! To contribute:

  1. Fork this repository.
  2. Clone your forked repository.
  3. Create a new branch for your feature.
  4. Make your changes, write tests, and ensure the code passes.
  5. Submit a pull request for review.

Testing

To test the package locally:

  1. Install development dependencies:
    pip install pytest
  2. Run tests:
    pytest

License

This project is licensed under the MIT License. See the LICENSE file for details.


Author