zarnitsa package for data augmentation


Keywords
augmentation, NLP, distributions, data-science, ml, pandas, statistics
License
Apache-2.0
Install
pip install zarnitsa==0.0.18

Documentation

GitHub Wheel python

zarnitsa package

Zarnitsa package with data augmentation tools.

  • Internal data augmentation using existed data
  • External data augmentation setting known statistical distributions by yourself
  • NLP augmentation

Principal scheme of project (currently)

Screenshot_20210724_134741

Requirements

  • Python3
  • numpy
  • pandas
  • nlpaug
  • wget
  • scikt-learn

Installation

Install package using PyPI:

pip install zarnitsa

Or using actual github repo:

pip install git+https://github.com/AlexKay28/zarnitsa

Usage

Simple usage examples:

Augmentation internal.

This is type of augmentation which you may use in case of working with numerical features.

>>> from zarnitsa.DataAugmenterInternally import DataAugmenterInternally

>>> daug_comb = DataAugmenterInternally()
>>> aug_types = [
>>>     "normal",
>>>     "uniform",
>>>     "permutations",
>>> ]

>>> # pd Series object example
>>> s = pd.Series([1, 2, 3, 4, 5, 6, 7, 8, 9, 10])
>>> for aug_type in aug_types:
>>>     print(aug_type)
>>>     print(daug_comb.augment_column(s, freq=0.5, return_only_aug=True, aug_type=aug_type))
normal
7     9.958794
3     0.057796
0    -3.135995
6     7.197400
8    13.258087
dtype: float64
uniform
2    10.972232
8     5.335357
9     9.111281
5     5.964971
4    -0.210732
dtype: float64
permutations
4     6
5     4
9    10
3     3
2     5
dtype: int64

Augmentation NLP

This is type of augmentation which you may use in case of working with textual information.

>>> from zarnitsa.DataAugmenterNLP import DataAugmenterNLP

>>> daug = DataAugmenterNLP()

>>> text = "This is sentence example to augment. Thank you for interesting"

>>> daug.augment_column_wordnet(text)
'This be sentence example to augment. Thank you for concern'

>>> daug.augment_column_del(text, reps=1)
'This is sentence example to augment. you for interesting'

>>> daug.augment_column_permut(text, reps=1)
'This sentence is example to augment. Thank you for interesting'

Augmentation External

This is type of augmentation which you may use in case of working with distribution modeling having prior knowlege about it

Doing df[...] = np.nan we imitate data sparsness or misses which we try to fill up using augmentations

>>> size = 500
>>> serial_was = pd.Series(daug.augment_distrib_random(aug_type='normal', loc=0, scale=1, size=size))
>>> serial_new = copy(serial_was)![Screenshot_20210724_133433](https://user-images.githubusercontent.com/55444371/126865853-2e09b4dd-f864-43d8-9e63-741c3862d153.png)

>>> serial_new.loc[serial_new.sample(100).index] = None
>>> serial_new = daug.augment_column(serial_new, aug_type='normal', loc=0, scale=1)

>>> plt.figure(figsize=(12, 8))
>>> serial_was.hist(bins=100)
>>> serial_new.hist(bins=100)

Screenshot_20210724_133404 Screenshot_20210724_133433

>>> size=50
>>> df = pd.DataFrame({
>>>     'data1': daug.augment_distrib_random(aug_type='normal', loc=0, scale=1, size=size),
>>>     'data2': daug.augment_distrib_random(aug_type='normal', loc=0, scale=1, size=size),
>>> })
>>> for col in df.columns:
>>>     df[col].loc[df[col].sample(10).index] = None
>>> plt.figure(figsize=(12, 8))
>>> df.plot()
>>> daug.augment_dataframe(df, aug_type='normal', loc=0, scale=1).plot()

Screenshot_20210724_133635