cleaningbenchmark

UNKNOWN


License
MIT
Install
pip install cleaningbenchmark==0.0.1

Documentation

Installation and Usage

Install

    pip install cleaningbenchmark

    # or

    python setup.py install

Test it

    testcleaningbenchmark.py

What is this?

Requires

*numpy *usaddress

Datasets we have

  • police union contributions
  • stock market dataset from Luna
  • Jun Yang's politican voting data
  • movies dataset
  • propublica

What's missing

  • Noise models, model eval
  • Base Data representation and how to corrupt it

Noise

  • Constraint based noise
    • Start with simple functional dependency constraints
    • white does an input file of constraints look like
    • given a constraint how to generate errors
    • gvine a set of constraints, how to generate errors
  • Constraint based noise + synthetic noise
  • Graph of celaning opertations and applying subpaths in tnhe graph
  • Samples of ground truth

What to evaluate

  • Constraint based stuff
  • Chris Re's stuff
  • Active clean
  • Normal models

How to use our benchmark in existing data cleaning system/exepreiment setups

  • walk through

Benchmarks to look at

  • OLTPBench
  • LinearRoad
  • Ask Mike

Wu:

  • does test.py api look sane