Object Oriented Regular Expressions. Syntax the same as .NET Regular Expressions.


Keywords
.NET, Regex, ORegex, C#
License
BSD-3-Clause
Install
Install-Package ORegex -Version 2.0.1

Documentation

ORegex

Object oriented Regular Expressions implementation.

This implementation based on original Microsoft Regular Expression Syntax and will follow it as much as possible. To declare predicate in pattern you type:

{myPredicateName}

...and feed ORegex predicate table.

PredicateTable is simple key value dictionary for predicates. Predicate tables can accept lambda's and comparer's (IEqualityComparer) with values. Each lambda or value should have unique name inside pattern.

Features

  • Default regex engine support;
  • Capture groups support;
  • Greedy/Lazy support;
  • Exact begin/end match support;
  • RE2 algorithm with modifications (backtracking seriously cutted down);
  • Reverse pattern/input support. Also, this includes RightToLeft option support;
  • Negative/Positive look ahead/behind support.

Syntax

Concatenation:

{a}{b}{c}

Repetition operators:

{a}?            - match zero or one times.
{a}*            - match zero or any number of times.
{a}+            - match one or any number of times.
{a}{n,}         - match at least 'n' times.
{a}{n,m}        - match between 'n' and 'm' times.
{a}{n,n}        - match exactly 'n' times.

Important to mention, all of this operator's support 'lazy' modifier:

{a}+?
{a}{n,}?
...

Groups and Capturing:

({a}{b})?{c}
(?<groupName>{a}{b})?{c}

Look somewhere:

(?={a})         - positive lookahead.
(?!{a})         - negative lookahead.
(?<={a})        - positive lookbehind.
(?<!{a})        - negative lookbehind.

Example

Simple prime sequence test:

    public void PrimeTest()
    {
        var oregex = new ORegex<int>("{0}(.{0})*", IsPrime);
        var input = new int[] {1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13};
        foreach (var match in oregex.Matches(input))
        {
            Trace.WriteLine(string.Join(",", match.Values));
        }

        //OUTPUT:
        //2
        //3,4,5,6,7
        //11,12,13
    }

    private static bool IsPrime(int number)
    {
        int boundary = (int)Math.Floor(Math.Sqrt(number));
        if (number == 1) return false;
        if (number == 2) return true;
        for (int i = 2; i <= boundary; ++i)
        {
            if (number % i == 0) return false;
        }
        return true;
    }

Or more complex test from NLP field:

    public void PersonSelectionTest()
    {
        //INPUT_TEXT: ΠŸΡΡ‚ΠΎΡ‡ΠΊΠΎΠ²Π° Π’Π°ΠΌΠ°Ρ€Π° Ρ€Π΅ΡˆΠΈΠ»Π° Π²Ρ‹Π³ΡƒΠ»ΡΡ‚ΡŒ Π”ΠΆΠ΅ΠΊΠ° ΠΈ Π²ΡΡ‚Ρ€Π΅Ρ‚ΠΈΠ»Π°ΡΡŒ с ΠœΠΈΡ…Π°ΠΈΠ»ΠΎΠΌ А.М.
        var sentence = new Word[]
        {
            new Word("ΠŸΡΡ‚ΠΎΡ‡ΠΊΠΎΠ²Π°", SemanticType.FamilyName),
            new Word("Π’Π°ΠΌΠ°Ρ€Π°", SemanticType.Name),
            new Word("Ρ€Π΅ΡˆΠΈΠ»Π°", SemanticType.Other),
            new Word("Π²Ρ‹Π³ΡƒΠ»ΡΡ‚ΡŒ", SemanticType.Other),
            new Word("Π”ΠΆΠ΅ΠΊΠ°", SemanticType.Name),
            new Word("ΠΈ", SemanticType.Other),
            new Word("Π²ΡΡ‚Ρ€Π΅Ρ‚ΠΈΠ»Π°ΡΡŒ", SemanticType.Other),
            new Word("с", SemanticType.Other),
            new Word("ΠœΠΈΡ…Π°ΠΈΠ»ΠΎΠΌ", SemanticType.Name),
            new Word("А.", SemanticType.Other),
            new Word("М", SemanticType.Other),
        };

        //Creating table which will contain our predicates.
        var pTable = new PredicateTable<Word>();
        pTable.AddPredicate("Ѐамилия", x => x.SemType == SemanticType.FamilyName);  //Check if word is FamilyName.
        pTable.AddPredicate("Имя", x => x.SemType == SemanticType.Name);            //Check if word is simple Name.
        pTable.AddPredicate("Π˜Π½ΠΈΡ†ΠΈΠ°Π»", x => IsInitial(x.Value));                    //Complex check if Value is Inital character.

        var oregex = new ORegex<Word>(@"
            {Ѐамилия}(?<name>{Имя})                     //Comments can be written inside pattern...
            |
            (?<name>{Имя})({Ѐамилия}|{Π˜Π½ΠΈΡ†ΠΈΠ°Π»}{1,2})?  /*...even complex ones.*/
        ", pTable);

        var persons = oregex.Matches(sentence).Select(x => new Person(x)).ToArray();

        foreach (var person in persons)
        {
            Console.WriteLine("Person found: {0}, length: {1}", person.Name, person.Words.Length);
        }

        //OUTPUT:
        //Person found: Π’Π°ΠΌΠ°Ρ€Π°, length: 2
        //Person found: Π”ΠΆΠ΅ΠΊΠ°, length: 1
        //Person found: ΠœΠΈΡ…Π°ΠΈΠ»ΠΎΠΌ, length: 3
    }

    public enum SemanticType
    {
        Name,
        FamilyName,
        Other,
    }

    public class Word
    {
        public readonly string Value;
        public readonly SemanticType SemType;

        public Word(string value, SemanticType semType)
        {
            Value = value;
            SemType = semType;
        }
    }

    public class Person
    {
        public readonly Word[] Words;
        public readonly string Name;
        public Person(OMatch<Word> match)
        {
            Words = match.Values.ToArray();
            Name = match.OCaptures["name"].First().Values.First().Value;
            //Now just normalize this name and you are good.
        }
    }

    private static bool IsInitial(string str)
    {
        var inp = str.Trim(new[] { '.', ' ', '\t', '\n', '\r' });
        return inp.Length == 1 && char.IsUpper(inp[0]);
    }

You can start from viewing Unit Test project to see how you can use it, by time there will be more examples. Also, you can find there test utility and see how things work inside engine.

Performance

  • ORegex is 2-3 times slower than original .NET Regex, however it is ~2 times faster on simple patterns without many repetitions.
  • Greedy exhausting test (x+x+y+ pattern on a 'xxxxxxxxxxxxxxxxxxxx' string) is ~20 times faster than Regex engine. This result achieved due to double finite state automaton implementation (fast dfa lookup, slow nfa command flow on captures) so backtracking seriously cutted down.

Future

  • C/C++ macros definition support;
  • Overlap capture support;