corruptor

Scientific text noise corruption for language model pretraining


Keywords
corrupt, geco, personal-data
License
MIT
Install
pip install corruptor==0.0.0

Documentation

Corruptor

PyPI PyPI - License PyPI - Python Version

Want to realistically corrupt your (textual) data? Use Corruptor!

pip install corruptor

The supported type of corruptions:

  • OCR variation
  • Phonetic variation
  • Typing error
  • Edit (insert, delete, replace, swap)

Getting started

There are three different classes that can be used.

BasicCorruptor

The basic corruptor provides methods for each type of corruption, using default configuration.

>>> from corruptor import BasicCorruptor
>>> basic = BasicCorruptor()
>>> basic.ocr('johnson')
'johnst0n'
>>> basic.phonetic('johnson')
'johnzon'
>>> basic.typo('johnson')
'johhson'
>>> basic.insert('johnson')
'johnsson'
>>> basic.delete('johnson')
'jhnson'
>>> basic.replace('johnson')
'johnsin'
>>> basic.swap('johnson')
'johnsno'

ProbabilisticCorruptor

This class selects the type of corruption at random, based on provided weights.

>>> from corruptor import ProbabilisticCorruptor
>>> prob = ProbabilisticCorruptor({'none': 0.33, 'phonetic': 0.33, 'typo': 0.33})
>>> prob.corrupt('conner')
'conner'
>>> prob.corrupt('conner')
'conneah'
>>> prob.corrupt('conner')
'conber'

DataFrameCorruptor

In short, the DataFrame corruptor randomly corrupts n rows of a pandas DataFrame.

>>> import pandas as pd
>>> from corruptor import DataFrameCorruptor
>>> df = pd.DataFrame({'firstname': ['frank', 'john'], 'lastname': ['johnson', 'conner']})
>>> dfc = DataFrameCorruptor({'firstname': (0.5, prob), 'lastname': (0.5, prob)})
>>> dfc.corrupt(df, n=2)
  firstname lastname
0     frahk  johnson
1      john   conber