Another language grammar parser. Inspired by modgrammar and pyparsing


Keywords
chomsky, language, grammar, parser
License
BSD-3-Clause
Install
pip install chomsky==1.0.1

Documentation

chomsky

Build Status

I needed a language grammar parser for the plywood project, and modgrammar looked like it would be perfect, except I couldn't get the simplest of grammars to work. pyparsing is excellent, but doesn't give me objects back, only lists and strings - I need more than that. I would recommend pyparsing for your project. Unless you really want objects, or if you are doing a language (chomsky has lots of built-in stuff for making programming language grammars).

Besides, I like writing parsers, and I know how I want this one to work, so screw it, I'll do it myself!

INSTALLATION

$ pip install chomsky
# local development
$ pip install -e .

USAGE

Matchers

Matcher objects are the most basic building blocks. They are not smart, they return only strings and lists, and they make no assumptions about what you might be trying to build. For instance, the Chars Matcher does not assume that you want to consume whitespace.

Matcher objects are great for building a small parsing language for consistent data, where Grammar objects are not needed. But for building a language parser, you will probably use the more heavy-duty Grammar building blocks.

Char

Matches a single letter from a string of accepted letters. There are lots of built-in strings in the string module.

test/matchers/test_letter_matcher.py

matcher = Char('abcde')
matcher('a') => 'a'
matcher('bcd') => 'b'
matcher('f') => raise ParseException
# shorthand:
matcher = A('abcde')

import string
matcher = A(string.ascii_letters + string.digits + '_')

Chars

Matches one or more letters from string of accepted letters.

You can also set min and max options. min will raise a ParseException if the matched word is not long enough. Default is 1. max will stop matching once max characters are matched.

test/matchers/test_word_matcher.py
matcher = Chars('abcde')
matcher('a') => 'a'
matcher('bcd') => 'bcd'
matcher('defg') => 'defg'
matcher('fghi') => ParseException

# max
matcher = Chars('abcde', max=2)
matcher('bcd') => 'bc'

# min
matcher = Chars('abcde', min=3)
matcher('ab') => ParseException

Literal

Matches a literal string.

test/matchers/test_literal_matcher.py
matcher = Literal('abcde')
matcher('a') => 'a'
matcher('bcd') => 'bcd'
matcher('defg') => 'defg'
matcher('fghi') => ParseException

Whitespace

test/matchers/test_whitespace_matcher.py
matcher = Whitespace()  # default is " \t"
matcher("    ") => "    "
matcher(" \t\n ") => " \t"
matcher = Whitespace(" \t\n")
matcher(" \t\n ") => " \t\n "

Regex

These have two options: group and advance.

group says which group or groups to return. Default is 0 (the entire match). A list or tuple of groups will return a list of results. advance indicates what group to advance past. Default is 0 (the entire match). This is a quick way to build a matching system that can parse consistently formatted data, for example.

test/matchers/test_regex_matcher.py
matcher = Regex("([a-zA-Z_][0-9])")
matcher('a1') => 'a1'

# group
matcher = Regex("([a-zA-Z_][0-9])", group=1)
matcher('a1') => 'a'

# to demonstrate `advance`, I will have to add two regex Matchers, which
# returns a list
matcher = Regex("([a-zA-Z_][0-9])", group=1, advance=1) + Regex("([0-9])", group=1)
matcher('a1') => ['a', '1']

Sequence

There are two flavors of Sequence. One you can declare yourself, called Sequence, the other is created automatically when you add or multiply Matcher objects. Don't worry about that one, it "just works" (we saw it above in the Regex example).

test/matchers/test_sequence_matcher.py
matcher = Sequence(Literal('Hello '), Literal('World'), Char('!.'))
matcher('Hello World!') => ['Hello ', 'World', '!']
matcher('Hello World.') => ['Hello ', 'World', '.']
matcher('Hello, World.') => ParseException

The automatic Sequence type is created whenever you use addition or multiplication to repeat a series of Matcher-s.

Addition:

test/matchers/test_matcher_addition.py
matcher = Literal('Hello ') + Literal('World') + Char('!.')
matcher('Hello World!') => ['Hello ', 'World', '!']
matcher('Hello World.') => ['Hello ', 'World', '.']
matcher('Hello, World.') => ParseException

Multiplication:

test/matcher/test_matcher_multiplication.py
import string
matcher = (Chars(string.ascii_letters) + Literal(' ')) * 3
matcher('why hello there ') => [['why', ' '], ['hello', ' '], ['there', ' ']]
matcher('not enough spaces') => ParseException

NMatches

NMatches is not an intuitively named class, but its child classes are, and you'll probably use them a lot.

ZeroOrMore:

test/matcher/test_zero_or_more_matcher.py
matcher = ZeroOrMore(Literal('hi'))
matcher('') => []
matcher('hi') => ['hi']
matcher('hihi') => ['hi', 'hi']

OneOrMore:

test/matcher/test_one_or_more_matcher.py
matcher = OneOrMore(Literal('hi'))
matcher('hi') => ['hi']
matcher('hihi') => ['hi', 'hi']
matcher('') => ParseException

Optional:

test/matcher/test_optional_matcher.py
matcher = Literal('Hello') + Optional(Literal(',')) + Literal(' ') + Literal('World')
matcher('Hello World') => ['Hello', [], ' ', 'World']
matcher('Hello, World') => ['Hello', [','], ' ', 'World']
matcher('Hello, Bozo') => ParseException

NMatches:

test/matcher/test_nmatcher.py
matcher = NMatches(Literal('hi'), min=2, max=3)
matcher('hi') => ParseException
matcher('hihi') => ['hi', 'hi']
matcher('hihihi') => ['hi', 'hi', 'hi']
matcher('hihihihi') => ['hi', 'hi', 'hi']  # only 3 matches

Any

Given a list of Matchers, any of them can match (tested in order left-to-right). The first to match is returned.

test/matcher/test_any_matcher.py
matcher = Any(Literal('Joey'), Literal('Bob'), Literal('Bill'))
matcher('Bob') => 'Bob'
matcher('Jane') => ParseException

Look-ahead and Behind

Looking-ahead is simple and low-cost. The NextIs matcher makes sure that the Matcher would pass, but then rolls back the cursor and does not return a Result. If the Matcher fails, an exception is raised.

Looking behind is much more expensive, because the number of characters to look at is not known before hand. A "best guess" can be made by PrevIs by using `minimum_length` and `maximum_length` methods that the Matcher classes all implement (the base class returns 0 and float('inf')). A Literal, for example, has a definite length that must be present - no more, and no less characters. The other classes also provide this min/max length calculation. But this provides only a modest performance increase.

The PrevIs matcher does not require that the previous token be an instance of the specified matcher, only that the buffer previous to the current location match. The buffer is rolled back until a match is found, or until the beginning of the buffer is reached. Sound resource intensive? Consider PrevIsNot! It looks backwards, hoping that the buffer never matches, no matter how far back it goes.

NextIs:

test/matcher/test_nextis_matcher.py
matcher = '-' + NextIs(Chars('123456789')) + Chars('1234567890')
matcher('1') => [[], '1']
matcher('-1') => [['-'], '1']
matcher('-123') => [['-'], '123']
matcher('-0') => ParseException

NextIsNot:

test/matcher/test_nextis_matcher.py
matcher = '-' + NextIsNot('0') + Chars('1234567890')
matcher('1') => [[], '1']
matcher('-1') => [['-'], '1']
matcher('-123') => [['-'], '123']
matcher('-0') => ParseException

PrevIs:

test/matcher/test_nextis_matcher.py
matcher = Chars('-.') + PrevIs('-') + Chars('1234567890')
matcher('-1') => [['-'], '1']
matcher('.123') => ParseException

PrevIsNot:

test/matcher/test_nextis_matcher.py
matcher = Chars('abc') + PrevIsNot('c') + Chars('abc')
matcher('ab') => ['a', 'b']
matcher('abc') => ['ab', 'c']
matcher('abcabc') => ['abcab', 'c']
matcher('cc') => ParseException

Grammars

Grammar objects are what you will want to work with if you are building a language grammar. They are composed of Mathcer classes (and other Grammar classes), but the objects they return are instances of the Grammar, not simple strings and lists.

The built-in Grammar-s are meant to help you understand how they work, and to use in your own language.

Numbers

Integer:

test/matcher/test_nextis_matcher.py
matcher = '-' + NextIsNot('0') + Chars('1234567890')
matcher('1') => [[], '1']
matcher('-1') => [['-'], '1']
matcher('-123') => [['-'], '123']
matcher('-0') => ParseException

Todo

QuotedString, Number, Integer, Float, Hexadecimal, Octal, Binary
LineComment, BlockComment, Block, IndentedBlock

TEST

$ pip install pytest
$ py.test

LICENSE

Copyright (c) 2012, Colin T.A. Gray All rights reserved.

author: Colin T.A. Gray
copyright: 2012 Colin T.A. Gray <http://colinta.com/>
license: simplified BSD, see LICENSE for more details.

RELEASE

python setup.py sdist
twine upload dist/*