Using Machine Learning to learn how to Compress


License
MIT
Install
pip install shrynk==0.2.21

Documentation

Build Status PyPI PyPI HitCount

You can read the introductory blog post or try it live at https://shrynk.ai

Features

  • ✓ Compress your data smartly based on Machine Learning
  • ✓ Takes User Requirements in the form of weights for size, write_time and read_time
  • ✓ Trains & caches a model based on compression methods available in the system, using packaged data
  • CLI for compressing and decompressing
  • ✓ Works with CSV, JSON and Bytes in general

CLI

shrynk compress myfile.json       # will yield e.g. myfile.json.gz or myfile.json.bz2
shrynk decompress myfile.json.gz  # will yield myfile.json

shrynk compress myfile.csv --size 0 --write 1 --read 0

shrynk benchmark myfile.csv                  # shows benchmark results
shrynk benchmark --predict myfile.csv        # will also show the current prediction
shrynk benchmark --save --predict myfile.csv # will add the result to the training data too

Usage

Installation:

pip install shrynk

Then in Python:

import pandas as pd
from shrynk import save, load

# save dataframe compressed
my_df = pd.DataFrame({"a": [1]})
file_path = save(my_df, "mypath.csv")
# e.g. mypath.csv.bz2

# load compressed file
loaded_df = load(file_path)

If you just want the prediction, you can also:

import pandas as pd
from shrynk import infer

infer(pd.DataFrame({"a": [1]}))
# {"engine": "csv", "compression": "bz2"}

Add your own data

If you want more control you can do the following:

import pandas as pd
from shrynk import PandasCompressor

df = pd.DataFrame({"a": [1, 2, 3]})

pdc = PandasCompressor("default")
pdc.run_benchmarks(df) # adds data to the default

pdc.train_model(size=3, write=1, read=1)

pdc.predict(df)