Python wrapper for libparsing, a PEG-based parsing library written in C


Keywords
parsing, PEG, grammar, libparsing
License
BSD-3-Clause
Install
pip install libparsing==0.9.2

Documentation

libparsing

C & Python Parsing Elements Grammar Library

Version :  0.7.0
URL     :  http://github.com/sebastien/parsing
README  :  https://cdn.rawgit.com/sebastien/libparsing/master/README.html

libparsing is a parsing element grammar (PEG) library written in C with Python bindings. It offers decent performance while allowing for a lot of flexibility. It is mainly intended to be used to create programming languages and software engineering tools.

As opposed to more traditional parsing techniques, the grammar is not compiled but constructed using an API that allows dynamic update of the grammar.

The parser does not do any tokeninzation, the instead input stream is consumed and parsing elements are dynamically asked to match the next element of it. Once parsing elements match, the resulting matched input is processed and an action is triggered.

libparsing supports the following features:

  • backtracking, ie. going back in the input stream if a match is not found
  • cherry-picking, ie. skipping unrecognized input
  • contextual rules, ie. a rule that will match or not depending on external variables

Parsing elements are usually slower than compiled or FSM-based parsers as they trade performance for flexibility. It's probably not a great idea to use libparsing if the parsing has to happen as fast as possible (ie. a protocol implementation), but it is a great use for programming languages, as it opens up the door to dynamic syntax plug-ins and multiple language embedding.

If you're interested in PEG, you can start reading Brian Ford's original article. Projects such as PEG/LEG by Ian Piumarta http://piumarta.com/software/peg/ ,OMeta by Alessandro Warth http://www.tinlizzie.org/ometa/ or Haskell's Parsec library https://www.haskell.org/haskellwiki/Parsec are of particular interest in the field.

Here is a short example of what creating a simple grammar looks like in Python:

g = Grammar()
s = g.symbols
g.token("WS",       "\s+")
g.token("NUMBER",   "\d+(\.\d+)?")
g.token("VARIABLE", "\w+")
g.token("OPERATOR", "[\/\+\-\*]")
g.group("Value",     s.NUMBER, s.VARIABLE)
g.rule("Suffix",     s.OPERATOR._as("operator"), s.Value._as("value"))
g.rule("Expression", s.Value, s.Suffix.zeroOrMore())
g.axiom(s.Expression)
g.skip(s.WS)
match = g.parseString("10 + 20 / 5")

and the equivalent code in C

Grammar* g = Grammar_new()
SYMBOL(WS,         TOKEN("\\s+"))
SYMBOL(NUMBER,     TOKEN("\\d+(\\.\\d+)?"))
SYMBOL(VARIABLE,   TOKEN("\\w+"))
SYMBOL(OPERATOR,   GROUP("[\\/\\+\\-\\*]"))
SYMBOL(Value,      GOUP(_S(NUMBER), _S(VARIABLE)))
SYMBOL(Suffix,     RULE(_AS(_S(OPERATOR), "operator"), _AS(_S(Value), "value")))
SYMBOL(Expression, RULE(_S(Value), _MO(Suffix))
g->axiom = s_Expression;
g->skip(s_WS);
Grammar_prepare(g);
Match* match = Grammar_parseString(g, "10 + 20 / 5")

Installing

To install the Python parsing module:

easy_install libparsing    # From Setuptools
pip install  libparsing    # From PIP

Note that for the above to work, you'll need a C compiler libffi-dev and libpcre-dev. On Ubuntu, do sudo apt install build-essential libffi-dev libprcre-dev.

To compile the C parsing module:

git clone http://github.com/sebastien/libparsing
cd libparsing
make
make install               # You can set PREFIX

libparsing works with GCC4 and Clang and is written following the c11 standard.

C API

Input data

The parsing library is configured at compile-time to iterate on specific elements of input, typically char. You can redefine the macro ITERATION_UNIT to the type you'd like to iterate on.

By default, the ITERATION_UNIT is a char, which works both for ASCII and UTF8. On the topic of Unicode/UTF8, the parsing library only uses functions that are UTF8-savvy.

#ifndef ITERATION_UNIT
#define ITERATION_UNIT char
#endif

typedef ITERATION_UNIT iterated_t;

Input data is acquired through iterators. Iterators wrap an input source (the default input is a FileInput) and a move callback that updates the iterator's offset. The iterator will build a buffer of the acquired input and maintain a pointer for the current offset within the data acquired from the input stream.

You can get an iterator on a file by doing:

Iterator* iterator = Iterator_Open("example.txt");

type Iterator

typedef struct Iterator {
	char           status;    // The status of the iterator, one of STATUS_{INIT|PROCESSING|INPUT_ENDED|ENDED}
	char*          buffer;    // The buffer to the read data, note how it is a (void*) and not an `iterated_t`
	iterated_t*    current;   // The pointer current offset within the buffer
	iterated_t     separator; // The character for line separator, `\n` by default.
	size_t         offset;    // Offset in input (in bytes), might be different from `current - buffer` if some input was freed.
	size_t         lines;     // Counter for lines that have been encountered
	size_t         capacity;  // Content capacity (in bytes), might be bigger than the data acquired from the input
	size_t         available; // Available data in buffer (in bytes), always `<= capacity`
	bool           freeBuffer;
	void*          input;     // Pointer to the input source
	bool          (*move) (struct Iterator*, int n); // Plug-in function to move to the previous/next positions
} Iterator;

type FileInput

The file input wraps information about the input file, such as the FILE object and the path.

typedef struct FileInput {
	FILE*        file;
	const char*  path;
} FileInput;

shared EOL

The EOL character used to count lines in an iterator context.

extern iterated_t         EOL;

operation Iterator_Open

Returns a new iterator instance with the given open file as input

Iterator* Iterator_Open(const char* path);

operation Iterator_FromString

Returns a new iterator instance with the text

Iterator* Iterator_FromString(const char* text);

constructor Iterator

Iterator* Iterator_new(void);

destructor Iterator_free

void      Iterator_free(Iterator* this);

method Iterator_open

Makes the given iterator open the file at the given path. This will automatically assign a FileInput to the iterator as an input source.

bool Iterator_open( Iterator* this, const char *path );

method Iterator_hasMore

Tells if the iterator has more available data. This means that there is available data after the current offset.

bool Iterator_hasMore( Iterator* this );

method Iterator_remaining

Returns the number of bytes available from the current iterator's position up to the last available data. For dynamic streams, where the length is unknown, this should be lesser or equalt to ITERATOR_BUFFER_AHEAD.

size_t Iterator_remaining( Iterator* this );

method Iterator_moveTo

Moves the iterator to the given offset

bool Iterator_moveTo ( Iterator* this, size_t offset );

method String_move

bool String_move ( Iterator* this, int offset );

define ITERATOR_BUFFER_AHEAD

The number of iterated_t that should be loaded after the iterator's current position. This limits the numbers of iterated_t that a Token could match.

#define ITERATOR_BUFFER_AHEAD 64000

constructor FileInput

FileInput* FileInput_new(const char* path );

destructor FileInput_free

void       FileInput_free(FileInput* this);

method FileInput_preload

Preloads data from the input source so that the buffer has up to ITERATOR_BUFFER_AHEAD characters ahead.

size_t FileInput_preload( Iterator* this );

method FileInput_move

Advances/rewinds the given iterator, loading new data from the file input whenever there is not ITERATOR_BUFFER_AHEAD data elements ahead of the iterator's current position.

bool FileInput_move   ( Iterator* this, int n );

Grammar

The Grammar is the concrete definition of the language you're going to parse. It is defined by an axiom and input data that can be skipped, such as white space.

The axiom and skip properties are both references to parsing elements.

typedef struct ParsingContext ParsingContext;
typedef struct ParsingElement ParsingElement;
typedef struct ParsingResult  ParsingResult;
typedef struct Reference      Reference;
typedef struct Match          Match;
typedef void                  Element;

typedef struct Element { char type; // Type is used du differentiate ParsingElement from Reference int id; // The ID, assigned by the grammar, as the relative distance to the axiom } Element;

type Grammar

typedef struct Grammar {
	ParsingElement*  axiom;       // The axiom
	ParsingElement*  skip;        // The skipped element
	int              axiomCount;  // The count of parsing elemetns in axiom
	int              skipCount;   // The count of parsing elements in skip
	Element**        elements;    // The set of all elements in the grammar
	bool             isVerbose;
} Grammar;

constructor Grammar

Grammar* Grammar_new(void);

destructor Grammar_free

void Grammar_free(Grammar* this);

method Grammar_prepare

void Grammar_prepare ( Grammar* this );

method Grammar_symbolsCount

int Grammar_symbolsCount ( Grammar* this );

method Grammar_parseIterator

ParsingResult* Grammar_parseIterator( Grammar* this, Iterator* iterator );

method Grammar_parsePath

ParsingResult* Grammar_parsePath( Grammar* this, const char* path );

method Grammar_parseString

ParsingResult* Grammar_parseString( Grammar* this, const char* text );

method Grammar_freeElements

void Grammar_freeElements(Grammar* this);

Elements

callback WalkingCallback

typedef int (*WalkingCallback)(Element* this, int step, void* context);

method Element_walk

int Element_walk( Element* this, WalkingCallback callback, void* context);

method Element__walk

int Element__walk( Element* this, WalkingCallback callback, int step, void* context);

Parsing Elements

Parsing elements are the core elements that recognize and process input data. There are 4 basic types: Work, Token, Group and Rule.

Parsing elements offer two main operations: recognize and process. The recognize method generates a Match object (that might be the FAILURE singleton if the data was not recognized). The process method tranforms corresponds to a user-defined action that transforms the Match object and returns the generated value.

Parsing element are assigned an id that corresponds to their breadth-first distance to the axiom. Before parsing, the grammar will re-assign the parsing element's id accordingly.

type Match

typedef struct Match {
	char            status;     // The status of the match (see STATUS_XXX)
	size_t          offset;     // The offset of `iterated_t` matched
	size_t          length;     // The number of `iterated_t` matched
	Element*        element;
	ParsingContext* context;
	void*           data;      // The matched data (usually a subset of the input stream)
	struct Match*   next;      // A pointer to the next  match (see `References`)
	struct Match*   children;  // A pointer to the child match (see `References`)
	void*           result;    // A pointer to the result of the match
} Match;

define STATUS_INIT

The different values for a match (or iterator)'s status

#define STATUS_INIT        '-'

define STATUS_PROCESSING

#define STATUS_PROCESSING  '~'

define STATUS_MATCHED

#define STATUS_MATCHED     'M'

define STATUS_SUCCESS

#define STATUS_SUCCESS     'S'

define STATUS_PARTIAL

#define STATUS_PARTIAL     's'

define STATUS_FAILED

#define STATUS_FAILED      'X'

define STATUS_INPUT_ENDED

#define STATUS_INPUT_ENDED '.'

define STATUS_ENDED

#define STATUS_ENDED       'E'

define TYPE_ELEMENT

#define TYPE_ELEMENT    'E'

define TYPE_WORD

#define TYPE_WORD       'W'

define TYPE_TOKEN

#define TYPE_TOKEN      'T'

define TYPE_GROUP

#define TYPE_GROUP      'G'

define TYPE_RULE

#define TYPE_RULE       'R'

define TYPE_CONDITION

#define TYPE_CONDITION  'c'

define TYPE_PROCEDURE

#define TYPE_PROCEDURE  'p'

define TYPE_REFERENCE

#define TYPE_REFERENCE  '#'

define ID_UNBOUND

A parsing element that is not bound to a grammar will have ID_UNBOUND by default.

#define ID_UNBOUND      -10

define ID_BINDING

A parsing element that being bound to a grammar (see Grammar_prepare) will have an id of ID_BINDING temporarily.

#define ID_BINDING       -1

singleton FAILURE_S

A specific match that indicates a failure

extern Match FAILURE_S;

shared FAILURE

extern Match* FAILURE;

operation Match_Empty

Creates new empty (successful) match

Match* Match_Empty(Element* element, ParsingContext* context);

operation Match_Success

Creates a new successful match of the given length

Match* Match_Success(size_t length, Element* element, ParsingContext* context);

constructor Match

Match* Match_new(void);

destructor Match_free

Frees the given match. If the match is FAILURE, then it won't be feed. This means that most of the times you won't need to free a failed match, as it's likely to be the FAILURE singleton.

void Match_free(Match* this);

method Match_isSuccess

bool Match_isSuccess(Match* this);

method Match_getOffset

int Match_getOffset(Match* this);

method Match_getLength

int Match_getLength(Match* this);

method Match__walk

int Match__walk(Match* this, WalkingCallback callback, int step, void* context );

method Match_countAll

int Match_countAll(Match* this);

type ParsingElement

typedef struct ParsingElement {
	char           type;       // Type is used du differentiate ParsingElement from Reference
	int            id;         // The ID, assigned by the grammar, as the relative distance to the axiom
	const char*    name;       // The parsing element's name, for debugging
	void*          config;     // The configuration of the parsing element
	struct Reference*     children;   // The parsing element's children, if any
	struct Match*         (*recognize) (struct ParsingElement*, ParsingContext*);
	struct Match*         (*process)   (struct ParsingElement*, ParsingContext*, Match*);
	void                  (*freeMatch) (Match*);
} ParsingElement;

operation ParsingElement_Is

Tells if the given pointer is a pointer to a ParsingElement.

bool         ParsingElement_Is(void *);

constructor ParsingElement

Creates a new parsing element and adds the given referenced parsing elements as children. Note that this is an internal constructor, and you should use the specialized versions instead.

ParsingElement* ParsingElement_new(Reference* children[]);

destructor ParsingElement_free

void ParsingElement_free(ParsingElement* this);

method ParsingElement_add

Adds a new reference as child of this parsing element. This will only be effective for composite parsing elements such as Rule or Token.

ParsingElement* ParsingElement_add(ParsingElement *this, Reference *child);

method ParsingElement_clear

ParsingElement* ParsingElement_clear(ParsingElement *this);

method ParsingElement_clear

Returns the match for this parsing element for the given iterator's state. inline Match* ParsingElement_recognize( ParsingElement* this, ParsingContext* context );

method ParsingElement_process

Processes the given match once the parsing element has fully succeeded. This is where user-bound actions will be applied, and where you're most likely to do things such as construct an AST.

Match* ParsingElement_process( ParsingElement* this, Match* match );

method ParsingElement_name

Transparently sets the name of the element

ParsingElement* ParsingElement_name( ParsingElement* this, const char* name );

Word

Words recognize a static string at the current iterator location.

type WordConfig

The parsing element configuration information that is used by the Token methods.

typedef struct WordConfig {
	const char* word;
	size_t      length;
} WordConfig;

constructor ParsingElement

ParsingElement* Word_new(const char* word);

destructor Word_free

void Word_free(ParsingElement* this);

method Word_recognize

The specialized match function for token parsing elements.

Match*          Word_recognize(ParsingElement* this, ParsingContext* context);

method Word_word

const char* Word_word(ParsingElement* this);

method WordMatch_group

const char* WordMatch_group(Match* match);

Tokens

Tokens are regular expression based parsing elements. They do not have any children and test if the regular expression matches exactly at the iterator's current location.

type TokenConfig

The parsing element configuration information that is used by the Token methods.

typedef struct TokenConfig {
	const char* expr;
#ifdef WITH_PCRE
	pcre*       regexp;
	pcre_extra* extra;
#endif
} TokenConfig;

type TokenMatch

typedef struct TokenMatch {
	int             count;
	const char**    groups;
} TokenMatch;

method Token_new

Creates a new token with the given POSIX extended regular expression

ParsingElement* Token_new(const char* expr);

destructor Token_free

void Token_free(ParsingElement*);

method Token_recognize

The specialized match function for token parsing elements.

Match*          Token_recognize(ParsingElement* this, ParsingContext* context);

method Token_expr

const char* Token_expr(ParsingElement* this);

method TokenMatch_free

Frees the TokenMatch created in Token_recognize

void TokenMatch_free(Match* match);

method TokenMatch_group

const char* TokenMatch_group(Match* match, int index);

method TokenMatch_count

int TokenMatch_count(Match* match);

References

We've seen that parsing elements can have children. However, a parsing element's children are not directly parsing elements but rather parsing elements' References. This is why the ParsingElement_add takes a Reference object as parameter.

References allow to share a single parsing element between many different composite parsing elements, while decorating them with additional information such as their cardinality (ONE, OPTIONAL, MANY and MANY_OPTIONAL) and a name that will allow process actions to easily access specific parts of the parsing element.

type Reference

typedef struct Reference {
	char            type;            // Set to Reference_T, to disambiguate with ParsingElement
	int             id;              // The ID, assigned by the grammar, as the relative distance to the axiom
	char            cardinality;     // Either ONE (default), OPTIONAL, MANY or MANY_OPTIONAL
	const char*     name;            // The name of the reference (optional)
	struct ParsingElement* element;  // The reference to the parsing element
	struct Reference*      next;     // The next child reference in the parsing elements
} Reference;

define CARDINALITY_OPTIONAL

The different values for the Reference cardinality.

#define CARDINALITY_OPTIONAL      '?'

define CARDINALITY_ONE

#define CARDINALITY_ONE           '1'

define CARDINALITY_MANY_OPTIONAL

#define CARDINALITY_MANY_OPTIONAL '*'

define CARDINALITY_MANY

#define CARDINALITY_MANY          '+'

operation Reference_Is

Tells if the given pointer is a pointer to Reference

bool Reference_Is(void * this);

operation Reference_Ensure

Ensures that the given element (or reference) is a reference.

Reference* Reference_Ensure(void* elementOrReference);

operation Reference_FromElement

Returns a new reference wrapping the given parsing element

Reference* Reference_FromElement(ParsingElement* element);

constructor Reference

References are typically owned by their single parent composite element.

Reference* Reference_new(void);

destructor Reference_free

void Reference_free(Reference* this);

method Reference_cardinality

Sets the cardinality of this reference, returning it transprently.

Reference* Reference_cardinality(Reference* this, char cardinality);

method Reference_name

Reference* Reference_name(Reference* this, const char* name);

method Reference_hasNext

bool Reference_hasNext(Reference* this);

method Reference_hasElement

bool Reference_hasElement(Reference* this);

method Reference__walk

int Reference__walk( Reference* this, WalkingCallback callback, int step, void* nothing );

method Reference_recognize

Returns the matched value corresponding to the first match of this reference. OPTIONAL references might return EMPTY, ONE references will return a match with a next=NULL while MANY may return a match with a next pointing to the next match.

Match* Reference_recognize(Reference* this, ParsingContext* context);

Groups

Groups are composite parsing elements that will return the first matching reference's match. Think of it as a logical or.

constructor ParsingElement

ParsingElement* Group_new(Reference* children[]);

method Group_recognize

Match*          Group_recognize(ParsingElement* this, ParsingContext* context);

Rules

Groups are composite parsing elements that only succeed if all their matching reference's.

constructor ParsingElement

ParsingElement* Rule_new(Reference* children[]);

method Rule_recognize

Match*          Rule_recognize(ParsingElement* this, ParsingContext* context);

Procedures

Procedures are parsing elements that do not consume any input, always succeed and usually have a side effect, such as setting a variable in the parsing context.

callback ProcedureCallback

typedef void (*ProcedureCallback)(ParsingElement* this, ParsingContext* context);

callback MatchCallback

typedef void (*MatchCallback)(Match* m);

constructor ParsingElement

ParsingElement* Procedure_new(ProcedureCallback c);

method Procedure_recognize

Match*          Procedure_recognize(ParsingElement* this, ParsingContext* context);

Conditions

Conditions, like procedures, execute arbitrary code when executed, but they might return a FAILURE.

callback ConditionCallback

typedef Match* (*ConditionCallback)(ParsingElement*, ParsingContext*);

constructor ParsingElement

ParsingElement* Condition_new(ConditionCallback c);

method Condition_recognize

Match*          Condition_recognize(ParsingElement* this, ParsingContext* context);

The parsing process

The parsing itself is the process of taking a grammar and applying it to an input stream of data, represented by the iterator.

The grammar's axiom will be matched against the iterator's current position, and if necessary, the grammar's skip parsing element will be applied to advance the iterator.

typedef struct ParsingStep    ParsingStep;
typedef struct ParsingOffset  ParsingOffset;

type ParsingStats

typedef struct ParsingStats {
	size_t   bytesRead;
	double   parseTime;
	size_t   symbolsCount;
	size_t*  successBySymbol;
	size_t*  failureBySymbol;
	size_t   failureOffset;   // A reference to the deepest failure
	size_t   matchOffset;
	size_t   matchLength;
	Element* failureElement;  // A reference to the failure element
} ParsingStats;

constructor ParsingStats

ParsingStats* ParsingStats_new(void);

destructor ParsingStats_free

void ParsingStats_free(ParsingStats* this);

method ParsingStats_setSymbolsCount

void ParsingStats_setSymbolsCount(ParsingStats* this, size_t t);

method ParsingStats_registerMatch

Match* ParsingStats_registerMatch(ParsingStats* this, Element* e, Match* m);

type ParsingContext

typedef struct ParsingContext {
	struct Grammar*              grammar;      // The grammar used to parse
	struct Iterator*             iterator;     // Iterator on the input data
	struct ParsingOffset* offsets;      // The parsing offsets, starting at 0
	struct ParsingOffset* current;      // The current parsing offset
	struct ParsingStats*  stats;
} ParsingContext;

constructor ParsingContext

ParsingContext* ParsingContext_new( Grammar* g, Iterator* iterator );

method ParsingContext_text

iterated_t* ParsingContext_text( ParsingContext* this );

destructor ParsingContext_free

void ParsingContext_free( ParsingContext* this );

type ParsingResult

typedef struct ParsingResult {
	char            status;
	Match*          match;
	ParsingContext* context;
} ParsingResult;

constructor ParsingResult

ParsingResult* ParsingResult_new(Match* match, ParsingContext* context);

method ParsingResult_free

Frees this parsing result instance as well as all the matches it referes to.

void ParsingResult_free(ParsingResult* this);

method ParsingResult_isFailure

bool ParsingResult_isFailure(ParsingResult* this);

method ParsingResult_isPartial

bool ParsingResult_isPartial(ParsingResult* this);

method ParsingResult_isComplete

bool ParsingResult_isComplete(ParsingResult* this);

method ParsingResult_text

iterated_t* ParsingResult_text(ParsingResult* this);

method ParsingResult_textOffset

int ParsingResult_textOffset(ParsingResult* this);

method ParsingResult_remaining

size_t ParsingResult_remaining(ParsingResult* this);

The result of recognizing parsing elements at given offsets within the input stream is stored in ParsingOffset. Each parsing offset is a stack of ParsingStep, corresponding to successive attempts at matching parsing elements at the current position.

The parsing offset is a stack of parsing steps, where the tail is the most specific parsing step. By following the tail's previous parsing step, you can unwind the stack.

The parsing steps each have an offset within the iterated stream. Offsets where data has been fully extracted (ie, a leaf parsing element has matched and processing returned a NOTHING) can be freed as they are not necessary any more.

type ParsingOffset

typedef struct ParsingOffset {
	size_t                offset; // The offset matched in the input stream
	ParsingStep*          last;   // The last matched parsing step (ie. corresponding to the most specialized parsing element)
	struct ParsingOffset* next;   // The link to the next offset (if any)
} ParsingOffset;

constructor ParsingOffset

ParsingOffset* ParsingOffset_new( size_t offset );

destructor ParsingOffset_free

void ParsingOffset_free( ParsingOffset* this );

The parsing step allows to memoize the state of a parsing element at a given offset. This is the data structure that will be manipulated and created/destroyed the most during the parsing process.

typedef struct ParsingStep {
	ParsingElement*     element;       // The parsing elemnt we're matching
	char                step;          // The step corresponds to current child's index (0 for token/word)
	unsigned int        iteration;     // The current iteration (on the step)
	char                status;        // Match status `STATUS_{INIT|PROCESSING|FAILED}`
	Match*              match;         // The corresponding match, if any.
	struct ParsingStep* previous;      // The previous parsing step on the parsing offset's stack
} ParsingStep;

constructor ParsingStep

ParsingStep* ParsingStep_new( ParsingElement* element );

destructor ParsingStep_free

void ParsingStep_free( ParsingStep* this );

Processor

typedef struct Processor Processor;

callback ProcessorCallback

typedef void (*ProcessorCallback)(Processor* processor, Match* match);

typedef struct Processor {
	ProcessorCallback   fallback;
	ProcessorCallback*  callbacks;
	int                 callbacksCount;
} Processor;

constructor Processor

Processor* Processor_new( );

method Processor_free

void Processor_free(Processor* this);

method Processor_register

void Processor_register (Processor* this, int symbolID, ProcessorCallback callback) ;

method Processor_process

int Processor_process (Processor* this, Match* match, int step);

Utilities

method Utilities_indent

void Utilities_indent( ParsingElement* this, ParsingContext* context );

method Utilities_dedent

void Utilities_dedent( ParsingElement* this, ParsingContext* context );

method Utilites_checkIndent

Match* Utilites_checkIndent( ParsingElement *this, ParsingContext* context );

Syntax Sugar

The parsing library provides a set of macros that make defining grammars a much easier task. A grammar is usually defined in the following way:

  • leaf symbols (words & tokens) are defined ;
  • compound symbolds (rules & groups) are defined.

Let's take as simple grammar and define it with the straight API:

// Leaf symbols
ParsingElement* s_NUMBER   = Token_new("\\d+");
ParsingElement* s_VARIABLE = Token_new("\\w+");
ParsingElement* s_OPERATOR = Token_new("[\\+\\-\\*\\/]");

// We also attach names to the symbols so that debugging will be easier
ParsingElement_name(s_NUMBER,   "NUMBER");
ParsingElement_name(s_VARIABLE, "VARIABLE");
ParsingElement_name(s_OPERATOR, "OPERATOR");

// Now we defined the compound symbols
ParsingElement* s_Value    = Group_new((Reference*[3]),{
Reference_cardinality(Reference_Ensure(s_NUMBER),   CARDINALITY_ONE),
Reference_cardinality(Reference_Ensure(s_VARIABLE), CARDINALITY_ONE)
NULL
});
ParsingElement* s_Suffix    = Rule_new((Reference*[3]),{
Reference_cardinality(Reference_Ensure(s_OPERATOR),  CARDINALITY_ONE),
Reference_cardinality(Reference_Ensure(s_Value),     CARDINALITY_ONE)
NULL
});
* ParsingElement* s_Expr    = Rule_new((Reference*[3]),{
Reference_cardinality(Reference_Ensure(s_Value),  CARDINALITY_ONE),
Reference_cardinality(Reference_Ensure(s_Suffix), CARDINALITY_MANY_OPTIONAL)
NULL
});

// We define the names as well
ParsingElement_name(s_Value,  "Value");
ParsingElement_name(s_Suffix, "Suffix");
ParsingElement_name(s_Expr, "Expr");

As you can see, this is quite verbose and makes reading the grammar declaration a difficult task. Let's introduce a set of macros that will make expressing grammars much easier.

Symbol declaration & creation

macro SYMBOL

Declares a symbol of name n as being parsing element e.

#define SYMBOL(n,e)       ParsingElement* s_ ## n = ParsingElement_name(e, #n);

macro WORD

Creates a Word parsing element with the given regular expression

#define WORD(v)           Word_new(v)

macro TOKEN

Creates a Token parsing element with the given regular expression

#define TOKEN(v)          Token_new(v)

macro RULE

Creates a Rule parsing element with the references or parsing elements as children.

#define RULE(...)         Rule_new((Reference*[(VA_ARGS_COUNT(__VA_ARGS__)+1)]){__VA_ARGS__,NULL})

macro GROUP

Creates a Group parsing element with the references or parsing elements as children.

#define GROUP(...)        Group_new((Reference*[(VA_ARGS_COUNT(__VA_ARGS__)+1)]){__VA_ARGS__,NULL})

macro PROCEDURE

Creates a Procedure parsing element

#define PROCEDURE(f)      Procedure_new(f)

macro CONDITION

Creates a Condition parsing element

#define CONDITION(f)      Condition_new(f)

Symbol reference & cardinality

macro _S

Refers to symbol n, wrapping it in a CARDINALITY_ONE reference

#define _S(n)             ONE(s_ ## n)

macro _O

Refers to symbol n, wrapping it in a CARDINALITY_OPTIONAL reference

#define _O(n)             OPTIONAL(s_ ## n)

macro _M

Refers to symbol n, wrapping it in a CARDINALITY_MANY reference

#define _M(n)             MANY(s_ ## n)

macro _MO

Refers to symbol n, wrapping it in a CARDINALITY_MANY_OPTIONAL reference

#define _MO(n)            MANY_OPTIONAL(s_ ## n)

macro _AS

Sets the name of reference r to be v

#define _AS(r,v)          Reference_name(Reference_Ensure(r), v)

Supporting macros

The following set of macros is mostly used by the set of macros above. You probably won't need to use them directly.

macro NAME

Sets the name of the given parsing element e to be the name n.

#define NAME(n,e)         ParsingElement_name(e,n)

macro ONE

Sets the given reference or parsing element's reference to CARDINALITY_ONE If a parsing element is given, it will be automatically wrapped in a reference.

#define ONE(v)            Reference_cardinality(Reference_Ensure(v), CARDINALITY_ONE)

macro OPTIONAL

Sets the given reference or parsing element's reference to CARDINALITY_OPTIONAL If a parsing element is given, it will be automatically wrapped in a reference.

#define OPTIONAL(v)       Reference_cardinality(Reference_Ensure(v), CARDINALITY_OPTIONAL)

macro MANY

Sets the given reference or parsing element's reference to CARDINALITY_MANY If a parsing element is given, it will be automatically wrapped in a reference.

#define MANY(v)           Reference_cardinality(Reference_Ensure(v), CARDINALITY_MANY)

macro MANY_OPTIONAL

Sets the given reference or parsing element's reference to CARDINALITY_MANY_OPTIONAL If a parsing element is given, it will be automatically wrapped in a reference.

#define MANY_OPTIONAL(v)  Reference_cardinality(Reference_Ensure(v), CARDINALITY_MANY_OPTIONAL)

Grammar declaration with macros

The same grammar that we defined previously can now be expressed in the following way:

SYMBOL(NUMBER,   TOKEN("\\d+"))
SYMBOL(VAR,      TOKEN("\\w+"))
SYMBOL(OPERATOR, TOKEN("[\\+\\-\\*\\/]"))

SYMBOL(Value,  GROUP( _S(NUMBER),   _S(VAR)     ))
SYMBOL(Suffix, RULE(  _S(OPERATOR), _S(Value)   ))
SYMBOL(Expr,   RULE(  _S(Value),    _MO(Suffix) ))

All symbols will be define as s_XXX, so that you can do:

ParsingGrammar* g = Grammar_new();
g->axiom = s_Expr;

License

Revised BSD License Copyright (c) 2014, FFunction inc (1165373771 Quebec inc) All rights reserved.

Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:

Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution. Neither the name of the FFunction inc (CANADA) nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission.

THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.