datapot

Library for automatic feature extraction from JSON-datasets


License
GPL-3.0
Install
pip install datapot==0.1.3

Documentation

Datapot | Usage | Examples | Features | Authors

Datapot

Build Status Open source tool for machine learning on semi-structured data that creates numeric object-feature matrix from JSON. The idea of Datapot is to make the process of data preparation and feature extraction automatic, easy and effective.

Usage

Install Datapot

Using pip:

$ pip install datapot

Or clone Datapot repo:

$ git clone https://github.com/bashalex/datapot.git
$ cd datapot
$ pip install .

To create a Datapot object simply write the following:

>>> import datapot as dp 
>>> datapot = dp.DataPot()

Datapot has two main methods:

  • detect()
  • fit()
  • transform()

Method detect(data, limit) goes through the first N objects (N = limit), passes the possible features to Transformers. Each Transformer evaluates if a feature from current field or a number of fields can be created. As a result a dict of features and Transformers is created. Method fit(data) trains the detected Transformers on the given set if it is required.

To apply detect() and fit() to JSON Lines file:

>>> data = open('datapot/data/job.jsonlines', 'r')
>>> datapot.detect(data, limit=100)
>>> datapot.fit(data)
DataPot class instance
 - number of features without transformation: 9
 - number of new features: 82
features to transform: 
	('Id', [NumericTransformer])
	('FullDescription', [TfidfTransformer])
	('ContractType', [SVDOneHotTransformer])
	('ContractTime', [SVDOneHotTransformer])
	('Company', [SVDOneHotTransformer])
	('Category', [SVDOneHotTransformer])
	('SalaryNormalized', [NumericTransformer])

Method transform(data) generates a pandas. DataFrame with new features that were detected and trained on the detect() and fit() calls.

>>> df = datapot.transform(data)
num of new features: 82

Examples

Look for more examples of using Datapot with different datasets and more Transformer specific.

Features

Datapot provides many ways of extracting features from JSON-s.

Data types that can be processed:

  • Boolean
  • Numerical
  • Numerical array (transform array to their sum divided by average length of array in training set)
  • Time series (сalculate descriptive statistical properties of a given time series)
  • Timestamp (date, time, day of week, day of month etc.)
  • Text (bag of words tf-idf, word2vec)
  • Categorical (one-hot encoding, dimension reduction)

Manually selected features:

  • Identity (keep the field unchanged)
  • Group Dimensionality Reduce (change the dimensionality of features in the same JSON field)

Authors

  • Alex Bash
  • Yuriy Mokriy
  • Nikita Savelyev
  • Michal Rozenwald
  • Peter Romov

Datapot is a course work project of the Faculty of Computer Science of the Higher School of Economics.