autoparse

Learn parsers from unlabelled data for formatted string mining


Keywords
AUTOMATON, LOG, PARSER, UNSUPERVISED
License
MIT
Install
pip install autoparse==0.1

Documentation

AutoParse

This package expose an algorithm that automatically build parsers for formatted documents.

For instance if some documents are generated by the two (hidden) templates:

  • PAIEMENT PSC <4-numbers> XXXXX CARTE <8-numbers>
  • XXXXX CARTE <8-numbers> PAIEMENT CB <4-numbers> XXXXX

Where XXXXX may take any value and may be any number of n digits. The algorithm will build a parser that extract the values taken by XXXXX:

from autoparse import AutomatonFitter

docs = [
    "PAIEMENT PSC 3421 BORDEAUX LA PISCINE CARTE 48392017", 
    "PAIEMENT PSC 5261 PARIS PHIE ITALI CARTE 28495719", 
    "PAIEMENT PSC 9468 LYON LE BON TEMPS CARTE 40273819", 
    "RECEIPT BANK T1 CARTE 39284847 PAIEMENT CB 2807 GB LONDON", 
    "NETFLIX CARTE 40129578 PAIEMENT CB 0602 PARIS"
]

aut_fit = AutomatonFitter(docs)
automaton = aut_fit.fit_build()
automaton.execute("PAIEMENT PSC 2341 LAUSANNE EXPEDIA CARTE 12439751")
# ('lausanne expedia', {})

To install the package, run

pip install autoparse

More examples here

generated automaton