Easy Text Augmenter

Easy Text Augmenter is a Python package for augmenting text data directly on your pandas dataframe using various NLP techniques. There are only 3 techniques for now :

augment_random_word
augment_random_character
augment_word_bert

Installation

!pip install easy-nlp-augmentation
import easy_text_augmenter
easy_text_augmenter.info()

How to use

augment_random_word

import pandas as pd
from easy_text_augmenter import augment_random_word

df = pd.DataFrame({
    'text': ['This is a test', 'Another test data ', 'Of course we need more data', 'Newton does not like apple', 'Hello world I am a human'],
    'label': ['A', 'A', 'B', 'B', 'A']
})
classes_to_augment = ['A', 'B']
augmented_df = augment_random_word(df, classes_to_augment, augmentation_percentage=0.8, text_column='text')
print(augmented_df)

Result :

                          text label
0               This is a test     A
1           Another test data      A
2  Of course we need more data     B
3   Newton does not like apple     B
4     Hello world I am a human     A
5             Th is is a te st     A
6                 Another data     A
7   Does not newton like apple     B

augment_random_character

from easy_text_augmenter import augment_random_word

classes_to_augment = ['A', 'B']
augmented_df = augment_random_character(df, classes_to_augment, augmentation_percentage=0.8, text_column='text')
print(augmented_df)

Result :

                          text label
0               This is a test     A
1           Another test data      A
2  Of course we need more data     B
3   Newton does not like apple     B
4     Hello world I am a human     A
5               This is a estt     A
6            Another te8t data     A
7   Newtun d0e8 not like apple     B

augment_word_bert

from easy_text_augmenter import augment_word_bert

classes_to_augment = ['A', 'B']
augmented_df = augment_word_bert(df, classes_to_augment, augmentation_percentage=0.8, text_column='text', model_path='bert-base-uncased', random_state=70)
print(augmented_df)

Result :

                                          text label
0                               This is a test     A
1                           Another test data      A
2                  Of course we need more data     B
3                   Newton does not like apple     B
4                     Hello world I am a human     A
5                         another test of data     A
6                      this term is not a test     A
7  newton does absolutely not like every apple     B

Authors

Contact me at :

Documentation

augment_random_word

Description:

The augment_random_word function augments a specified percentage of samples in given classes of a DataFrame by randomly applying one of three augmentation techniques (swap, delete, split) to the text column.

augment_random_word(df, classes_to_augment, augmentation_percentage, text_column, random_state=42, weights=[0.5, 0.3, 0.2])

Parameters:

df (pandas.DataFrame): The input DataFrame containing the text data and labels.
classes_to_augment (list): A list of class labels that need to be augmented.
augmentation_percentage (float): The percentage of samples to augment from each specified class.
text_column (str): The name of the column in the DataFrame that contains the text data.
random_state (int, optional): A random seed used for specify which rows to augment. Default is 42.
weights (list, optional): A list of weights to determine the probability of selecting each augmentation type. Default is [0.5, 0.3, 0.2] for swap, delete, and split, respectively.

weights techniques :

swap: randomly swap word in text.
delete: randomly delete word in text.
split: randomly split word in text.

Returns:

pandas.DataFrame: A new DataFrame with the augmented data appended to the original data.

augment_random_character

Description:

The augment_random_character function performs random character-based augmentations on specific classes of text data within a DataFrame. It uses several augmentation techniques to randomly alter characters in the text, increasing the diversity of the dataset.

augment_random_character(df, classes_to_augment, augmentation_percentage, text_column, random_state=42, weights=[0.2, 0.2, 0.2, 0.2, 0.2])

Parameters:

df (pd.DataFrame): The input DataFrame containing text data and their corresponding labels.
classes_to_augment (list): A list of class labels indicating which classes should be augmented.
augmentation_percentage (float): The percentage of samples in each class that should be augmented.
text_column (str): The column name in the DataFrame that contains the text data to be augmented.
random_state (int, optional): A random seed used for specify which rows to augment. Default is 42.
weights (list, optional): A list of weights for each augmentation technique, used to determine the probability of choosing each technique. Default is [0.2, 0.2, 0.2, 0.2, 0.2].

weights techniques :

aug_ocr: OCR-based augmentation.
aug_keyboard: Keyboard error simulation.
aug_insert: Random character insertion.
aug_swap: Random character swapping.
aug_delete: Random character deletion.

Returns:

pandas.DataFrame: A new DataFrame with the augmented data appended to the original data.

augment_word_bert

Description:

The augment_word_bert function augments text data in a DataFrame using a BERT-based word augmentation technique. It inserts or substitutes words in the specified text column for a given percentage of samples in the specified classes.

def augment_word_bert(df, classes_to_augment, augmentation_percentage, text_column, model_path, random_state=42, weights=[0.7, 0.3])

Parameters:

df (pandas.DataFrame): The DataFrame containing the data to be augmented.
classes_to_augment (list): A list of class labels indicating which classes should be augmented.
augmentation_percentage (float): The percentage of samples within each class to augment (e.g., 0.2 for 20%).
text_column (str): The name of the column in the DataFrame that contains the text to be augmented.
model_path (str): The path to the pre-trained BERT model used for augmentation.
random_state (int, optional): A random seed used for specify which rows to augment. Default is 42.
weights (list, optional): The weights for choosing between the insertion and substitution augmentation techniques (default is [0.7, 0.3]).

Returns:

pandas.DataFrame: The original DataFrame with additional augmented samples.

easy-nlp-augmentation
Release 1.4

Release 1.4

1.4

1.3

1.2

1.1

Documentation

Easy Text Augmenter

Installation

How to use

augment_random_word

augment_random_character

augment_word_bert

Authors

Documentation

augment_random_word

augment_random_character

augment_word_bert

Stats

Development practices

Releases

Contributors

easy-nlp-augmentation Release 1.4

Release 1.4 Toggle Dropdown 1.4 1.3 1.2 1.1

Documentation

Easy Text Augmenter

Installation

How to use

augment_random_word

augment_random_character

augment_word_bert

Authors

Documentation

augment_random_word

augment_random_character

augment_word_bert

Stats

Development practices

Releases

Contributors

easy-nlp-augmentation
Release 1.4

Release 1.4

1.4

1.3

1.2

1.1