ORegex
Object oriented Regular Expressions implementation.
This implementation based on original Microsoft Regular Expression Syntax and will follow it as much as possible. To declare predicate in pattern you type:
{myPredicateName}
...and feed ORegex predicate table.
PredicateTable is simple key value dictionary for predicates. Predicate tables can accept lambda's and comparer's (IEqualityComparer) with values. Each lambda or value should have unique name inside pattern.
Features
- Default regex engine support;
- Capture groups support;
- Greedy/Lazy support;
- Exact begin/end match support;
- RE2 algorithm with modifications (backtracking seriously cutted down);
- Reverse pattern/input support. Also, this includes RightToLeft option support;
- Negative/Positive look ahead/behind support.
Syntax
Concatenation:
{a}{b}{c}
Repetition operators:
{a}? - match zero or one times.
{a}* - match zero or any number of times.
{a}+ - match one or any number of times.
{a}{n,} - match at least 'n' times.
{a}{n,m} - match between 'n' and 'm' times.
{a}{n,n} - match exactly 'n' times.
Important to mention, all of this operator's support 'lazy' modifier:
{a}+?
{a}{n,}?
...
Groups and Capturing:
({a}{b})?{c}
(?<groupName>{a}{b})?{c}
Look somewhere:
(?={a}) - positive lookahead.
(?!{a}) - negative lookahead.
(?<={a}) - positive lookbehind.
(?<!{a}) - negative lookbehind.
Example
Simple prime sequence test:
public void PrimeTest()
{
var oregex = new ORegex<int>("{0}(.{0})*", IsPrime);
var input = new int[] {1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13};
foreach (var match in oregex.Matches(input))
{
Trace.WriteLine(string.Join(",", match.Values));
}
//OUTPUT:
//2
//3,4,5,6,7
//11,12,13
}
private static bool IsPrime(int number)
{
int boundary = (int)Math.Floor(Math.Sqrt(number));
if (number == 1) return false;
if (number == 2) return true;
for (int i = 2; i <= boundary; ++i)
{
if (number % i == 0) return false;
}
return true;
}
Or more complex test from NLP field:
public void PersonSelectionTest()
{
//INPUT_TEXT: ΠΡΡΠΎΡΠΊΠΎΠ²Π° Π’Π°ΠΌΠ°ΡΠ° ΡΠ΅ΡΠΈΠ»Π° Π²ΡΠ³ΡΠ»ΡΡΡ ΠΠΆΠ΅ΠΊΠ° ΠΈ Π²ΡΡΡΠ΅ΡΠΈΠ»Π°ΡΡ Ρ ΠΠΈΡ
Π°ΠΈΠ»ΠΎΠΌ Π.Π.
var sentence = new Word[]
{
new Word("ΠΡΡΠΎΡΠΊΠΎΠ²Π°", SemanticType.FamilyName),
new Word("Π’Π°ΠΌΠ°ΡΠ°", SemanticType.Name),
new Word("ΡΠ΅ΡΠΈΠ»Π°", SemanticType.Other),
new Word("Π²ΡΠ³ΡΠ»ΡΡΡ", SemanticType.Other),
new Word("ΠΠΆΠ΅ΠΊΠ°", SemanticType.Name),
new Word("ΠΈ", SemanticType.Other),
new Word("Π²ΡΡΡΠ΅ΡΠΈΠ»Π°ΡΡ", SemanticType.Other),
new Word("Ρ", SemanticType.Other),
new Word("ΠΠΈΡ
Π°ΠΈΠ»ΠΎΠΌ", SemanticType.Name),
new Word("Π.", SemanticType.Other),
new Word("Π", SemanticType.Other),
};
//Creating table which will contain our predicates.
var pTable = new PredicateTable<Word>();
pTable.AddPredicate("Π€Π°ΠΌΠΈΠ»ΠΈΡ", x => x.SemType == SemanticType.FamilyName); //Check if word is FamilyName.
pTable.AddPredicate("ΠΠΌΡ", x => x.SemType == SemanticType.Name); //Check if word is simple Name.
pTable.AddPredicate("ΠΠ½ΠΈΡΠΈΠ°Π»", x => IsInitial(x.Value)); //Complex check if Value is Inital character.
var oregex = new ORegex<Word>(@"
{Π€Π°ΠΌΠΈΠ»ΠΈΡ}(?<name>{ΠΠΌΡ}) //Comments can be written inside pattern...
|
(?<name>{ΠΠΌΡ})({Π€Π°ΠΌΠΈΠ»ΠΈΡ}|{ΠΠ½ΠΈΡΠΈΠ°Π»}{1,2})? /*...even complex ones.*/
", pTable);
var persons = oregex.Matches(sentence).Select(x => new Person(x)).ToArray();
foreach (var person in persons)
{
Console.WriteLine("Person found: {0}, length: {1}", person.Name, person.Words.Length);
}
//OUTPUT:
//Person found: Π’Π°ΠΌΠ°ΡΠ°, length: 2
//Person found: ΠΠΆΠ΅ΠΊΠ°, length: 1
//Person found: ΠΠΈΡ
Π°ΠΈΠ»ΠΎΠΌ, length: 3
}
public enum SemanticType
{
Name,
FamilyName,
Other,
}
public class Word
{
public readonly string Value;
public readonly SemanticType SemType;
public Word(string value, SemanticType semType)
{
Value = value;
SemType = semType;
}
}
public class Person
{
public readonly Word[] Words;
public readonly string Name;
public Person(OMatch<Word> match)
{
Words = match.Values.ToArray();
Name = match.OCaptures["name"].First().Values.First().Value;
//Now just normalize this name and you are good.
}
}
private static bool IsInitial(string str)
{
var inp = str.Trim(new[] { '.', ' ', '\t', '\n', '\r' });
return inp.Length == 1 && char.IsUpper(inp[0]);
}
You can start from viewing Unit Test project to see how you can use it, by time there will be more examples. Also, you can find there test utility and see how things work inside engine.
Performance
- ORegex is 2-3 times slower than original .NET Regex, however it is ~2 times faster on simple patterns without many repetitions.
- Greedy exhausting test (x+x+y+ pattern on a 'xxxxxxxxxxxxxxxxxxxx' string) is ~20 times faster than Regex engine. This result achieved due to double finite state automaton implementation (fast dfa lookup, slow nfa command flow on captures) so backtracking seriously cutted down.
Future
- C/C++ macros definition support;
- Overlap capture support;