regexy

Linear time regex matching


Keywords
python, python3, regex, regex-engine, stream-processing
License
MIT
Install
pip install regexy==0.13

Documentation

regexy

Build Status Coverage Status pypi licence

A Python library for parsing, compiling, and executing regular expressions. All searches execute in linear time with respect to the size of the regular expression and search text.

Mostly based on Thompson's NFA.

Be aware! This is nothing more than an experiment for researching purposes.

Status

  • Matcher
  • Basic operators: *, ?, + and |
  • Capturing groups
  • Symbols escaping
  • Shorthands: \w, \d, \s, \W, \D, \S
  • Sets [...] (+ ranges and shorthands)
  • Repetition ranges {n, m}
  • non-capturing groups
  • Greedy and non-greedy match
  • ^ and $ symbols
  • \b word boundary
  • Match any (dot)
  • Sets complement
  • Lookahead assertion (?=...) and (?!...) (limited to a single char)
  • Assertions \A, \z, \B
  • Named capturing groups
  • search
  • full_match
  • Flags
  • User friendly compiling errors
  • ... ?

Compatibility

  • Python +3.5

Install

$ pip install regexy

Usage

Notice regexy returns all capturing groups specified within a repeated sub-expression

import regexy

regexy.match(regexy.compile(r'((a)*b)'), 'aab')
# Match<('aab', ('a', 'a'))>

regexy.match(regexy.compile(r'a'), 'b')
# None

regexy.match(regexy.compile(r'a'), 'a')
# Match<()>

Streams are supported (i.e: network and files)

Note: Capturing may take as much RAM as all of the data in worst case when the full regex is captured

import io
import regexy


def stream_gen():
    stream = io.BytesIO(b'Im a stream')
    stream_wrapper = io.TextIOWrapper(stream, encoding='utf-8', write_through=True)

    while True:
        chars = stream_wrapper.read(5)

        if not chars:
            break

        yield from chars

regexy.match(regexy.compile(r'(\w+| +)*'), stream_gen())
# (('Im', ' ', 'a', ' ', 'stream'),)

Here is a (undocumented) way to print the generated NFA for debugging purposes:

import regexy

str(regexy.compile(r'a*').state)
# ('*', [('a', [('*', [...])]), ('EOF', [])])
# The [...] thing means it's recursive

Docs

Read The Docs

Tests

$ make test

License

MIT