AutoParse
This package expose an algorithm that automatically build parsers for formatted documents.
For instance if some documents are generated by the two (hidden) templates:
- PAIEMENT PSC <4-numbers> XXXXX CARTE <8-numbers>
- XXXXX CARTE <8-numbers> PAIEMENT CB <4-numbers> XXXXX
Where XXXXX
may take any value and may be any number of n digits. The algorithm will build a parser that extract the values taken by XXXXX
:
from autoparse import AutomatonFitter
docs = [
"PAIEMENT PSC 3421 BORDEAUX LA PISCINE CARTE 48392017",
"PAIEMENT PSC 5261 PARIS PHIE ITALI CARTE 28495719",
"PAIEMENT PSC 9468 LYON LE BON TEMPS CARTE 40273819",
"RECEIPT BANK T1 CARTE 39284847 PAIEMENT CB 2807 GB LONDON",
"NETFLIX CARTE 40129578 PAIEMENT CB 0602 PARIS"
]
aut_fit = AutomatonFitter(docs)
automaton = aut_fit.fit_build()
automaton.execute("PAIEMENT PSC 2341 LAUSANNE EXPEDIA CARTE 12439751")
# ('lausanne expedia', {})
To install the package, run
pip install autoparse
More examples here