Hypothesis-Grammar
(pre-alpha... the stuff I've tried all works, not well tested yet though)
What is it?
Hypothesis-Grammar is a "reverse parser" - given a grammar it will generate examples of that grammar.
It is implemented as a Hypothesis strategy.
(If you are looking to generate text from a grammar for purposes other than testing with Hypothesis then this lib can still be useful, but I stongly recommend looking at the tools provided with NLTK instead.)
Usage
So, how does this look?
First you need a grammar. Our grammar format is based on that used by the Lark parser library. You can see our grammar-parsing grammar here. More details of our grammar format below.
Here is an example of using Hypothesis-Grammar:
from hypothesis_grammar import strategy_from_grammar
st = strategy_from_grammar(
grammar="""
DET: "the" | "a"
N: "man" | "park" | "dog"
P: "in" | "with"
s: np vp
np: DET N
pp: P np
vp: "slept" | "saw" np | "walked" pp
""",
start="s",
)
st.example()
# ['a', 'dog', 'saw', 'the', 'man']
st.example()
# ['a', 'park', 'saw', 'a', 'man']
st.example()
# ['the', 'man', 'slept']
or as a test...
from hypothesis import given
from hypothesis_grammar import strategy_from_grammar
@given(
strategy_from_grammar(
grammar="""
DET: "the" | "a"
N: "man" | "park" | "dog"
P: "in" | "with"
s: np vp
np: DET N
pp: P np
vp: "slept" | "saw" np | "walked" pp
""",
start="s",
)
)
def test_grammar(example):
nouns = {"man", "park", "dog"}
assert any(noun in example for noun in nouns)
The grammar is taken from an example in the NLTK docs and converted into our "simplified Lark" format.
start="s"
tells the parser that the start rule is s
.
As you can see, we have produced a Hypothesis strategy which is able to generate examples which match the grammar (in this case, short sentences which sometimes makes sense).
The output will always be a flat list of token strings. If you want a sentence you can just " ".join(example)
.
But the grammar doesn't have to describe text, it might represent a sequence of actions for example. In that case you might want to convert your result tokens into object instances, which could be done via a lookup table.
(But if you're generating action sequences for tests then probably you should check out Hypothesis' stateful testing features first)
Grammar details
- Whitespace is ignored
- 'Terminals' must be named all-caps (terminals only reference literals, not other rules), e.g.
DET
- 'Rules' must be named all-lowercase, e.g.
np
- LHS (name) and RHS are separated by
:
- String literals must be quoted with double-quotes e.g.
"man"
- You can also use regex literals, they are delimited with forward-slash, e.g.
/the[a-z]{0,2}/
. Content for the regex token is generated using Hypothesis'from_regex
strategy, withfullmatch=True
. - Adjacent tokens are concatenated, i.e.
DET N
means aDET
followed by aN
. -
|
is alternation, so"in" | "with"
means one-of"in"
or"with"
-
?
means optional, i.e."in"?
means"in"
is expected zero-or-one time. -
*
i.e."in"*
means"in"
is expected zero-or-many times. -
+
i.e."in"+
means"in"
is expected one-or-many times. -
~ <num>
means exactly-<num> times. -
~ <min>..<max>
is a range, expected between-<min>-and-<max> times. -
(
and)
are for grouping, the group can be quantified using any of the modifiers above.