Apply the unfold transform to a collection of terminal nodes in the parse tree. This program is part of the Trash toolkit.


Keywords
antlr, antlr4, refactoring, transformation, xpath
License
MIT
Install
Install-Package trunfold -Version 0.22.0

Documentation

Trash

Build

Status: The toolset is undergoing a large rewrite due to the way parse trees are represented. Some tools are have not been rewritten yet.

The repo g4-scripts contains a collections of Bash scripts. which use Trash. to check or find properties about Antlr grammars and parse trees. It is the best place to see Trash in action. You can also read about Trash details in my blog.

Trash is a collection of ~40 command-line tools to analyze and transform Antlr parse trees and grammars. The toolkit can: generate a parser application for an Antlr4 grammar for any target and any OS; analyze the grammar for common problems; automate changes applied to a grammar scraped from a specification; transform parse trees for transpilating and proprocessing source code. With the Antlr toolkit and the collection of Antlr grammars, one can write programming language tools quickly and easily.

The toolkit is designed around a JSON representation of parse trees and command-line tools that read, modify, and write those tree via standard input and output. Complex refactorings can be achieved by chaining different commands together.

Each app in Trash is implemented as a Dotnet Tool console application, and can be used on Windows, Linux, or Mac. No prerequisites are required other than installing the NET SDK, and the toolchains for any other targets you want to use.

The toolkit uses Antlr and XPath2. The code is implemented in C#.

An application of the toolkit was used to scrape and refactor the Dart2 grammar from spec. See this script.

Installation

Requirements

Install Dotnet 7.0.x

Install

Copy this script and execute it in a command-line prompt.

dotnet tool install -g trcaret
dotnet tool install -g trcombine
dotnet tool install -g trconvert
dotnet tool install -g trcover
dotnet tool install -g trdelete
dotnet tool install -g trdeltree
dotnet tool install -g trfoldlit
dotnet tool install -g trgen
dotnet tool install -g trglob
dotnet tool install -g triconv
dotnet tool install -g trinsert
dotnet tool install -g trjson
dotnet tool install -g trparse
dotnet tool install -g trperf
dotnet tool install -g trrename
dotnet tool install -g trreplace
dotnet tool install -g trsplit
dotnet tool install -g trsponge
dotnet tool install -g trstrip
dotnet tool install -g trtext
dotnet tool install -g trtokens
dotnet tool install -g trtree
dotnet tool install -g trunfold
dotnet tool install -g trwdog
dotnet tool install -g trxgrep
dotnet tool install -g trxml
dotnet tool install -g trxml2

Uninstall

dotnet tool uninstall -g trcaret
dotnet tool uninstall -g trcombine
dotnet tool uninstall -g trconvert
dotnet tool uninstall -g trcover
dotnet tool uninstall -g trdelete
dotnet tool uninstall -g trfoldlit
dotnet tool uninstall -g trgen
dotnet tool uninstall -g triconv
dotnet tool uninstall -g trinsert
dotnet tool uninstall -g trjson
dotnet tool uninstall -g trparse
dotnet tool uninstall -g trperf
dotnet tool uninstall -g trrename
dotnet tool uninstall -g trreplace
dotnet tool uninstall -g trsplit
dotnet tool uninstall -g trsponge
dotnet tool uninstall -g trstrip
dotnet tool uninstall -g trtext
dotnet tool uninstall -g trtokens
dotnet tool uninstall -g trtree
dotnet tool uninstall -g trunfold
dotnet tool uninstall -g trwdog
dotnet tool uninstall -g trxgrep
dotnet tool uninstall -g trxml
dotnet tool uninstall -g trxml2

List of commands

NB: Out of date

  1. tranalyze -- Analyze a grammar
  2. trcombine -- Combine a split Antlr4 grammar
  3. trconvert -- Convert a grammar from one for to another
  4. trdelabel -- Remove labels from an Antlr4 grammar
  5. trdelete -- Delete nodes in a parse tree
  6. trdot -- Print a parse tree in Graphvis Dot format
  7. trenum -- Not functional, to enumerate strings from grammar.
  8. trfirst -- Outputs first sets of a grammar
  9. trfold -- Perform fold transform on a grammar
  10. trfoldlit -- Perform fold transform on grammar with literals
  11. trformat -- Format a grammar
  12. trgen -- Generate an Antlr4 parser for a given target language
  13. trgen2 -- Generate files from template and XML doc list.
  14. trgroup -- Perform a group transform on a grammar
  15. trinsert -- Insert string into points in a parse tree
  16. tritext -- Get strings from a PDF file
  17. trjson -- Print a parse tree in JSON structured format
  18. trkleene -- Perform a Kleene transform of a grammar
  19. trmove -- Move nodes in a parse tree
  20. trparse -- Parse a grammar or use generated parse to parse input
  21. trperf -- Perform performance analysis of an Antlr grammar parse
  22. trpiggy -- Perform a parse tree rewrite
  23. trprint -- Print a parse tree, including off-token characters
  24. trrename -- Rename symbols in a grammar
  25. trreplace -- Replace nodes in a parse tree with text
  26. trrr -- (No description.)
  27. trrup -- Remove useless parentheses in a grammar
  28. trsem -- Read static semantics and generate code
  29. trsort -- Sort rules in a grammar
  30. trsplit -- Split a combined Antlr4 grammar
  31. trsponge -- Extract parsing results output of Trash command into files
  32. trst -- Print a parse tree in Antlr4 ToStringTree()
  33. trstrip -- Strip a grammar of all actions, labels, etc.
  34. trtext -- Print a parse tree with a specific interval
  35. trthompson -- (No description.)
  36. trtokens -- Print tokens in a parse tree
  37. trtree -- Print a parse tree in a human-readable format
  38. trull -- Transform a grammar with upper- and lowercase string literals
  39. trunfold -- Perform an unfold transform on a grammar
  40. trungroup -- Perform an ungroup transform on a grammar
  41. trwdog -- Kill a program that runs too long
  42. trxgrep -- "Grep" for nodes in a parse tree using XPath
  43. trxml -- Print a parse tree in XML structured format
  44. trxml2 -- Print an enumeration of all paths in a parse tree to leaves

Examples

Parse a grammar, create a parser for the grammar, build, and test

git clone https://github.com/antlr/grammars-v4
cd grammars-v4/python/python
trparse *.g4 | trxgrep ' //grammarDecl' | trtext
# Output:
# PythonLexer.g4:lexer grammar PythonLexer;
# PythonParser.g4:parser grammar PythonParser;
trgen
cd Generated
dotnet build
cat - <<EOF | trparse | trxgrep ' //test' | trtext
x == y
x == y if z == b else a == u
lambda: a
lambda x, y: a
EOF
# Output:
# a
# lambda x, y: a
# a
# lambda: a
# a == u
# x == y if z == b else a == u
# x == y

Display parse tree

trparse -i "a == b" | trtree

trtree is only one of several ways to view parse tree data. Other programs for different output are trjson for JSON output, trxml for XML output, trst for Antlr runtime ToStringTree output, trdot, trprint for input text for the parse, and tragl.

Convert grammars to Antlr4

trparse ada.g2 | trconvert | trprint | less

This command parses an old Antlr2 grammar using trparse, converts the parse tree data to Antlr4 syntax using trconvert and finally prints out the converted parse tree data, ada.g4 using trprint. Other grammar that can be converted are Antlr3, Bison, and ISO EBNF. In order to use the grammar to parse data, you will need to convert it to an Antlr4 grammar.

Generate an Arithmetic parser application

mkdir foobar; cd foobar; trgen

This command creates a parser application for the C# target. If executed in an empty directory, which is done in the example shown above, trgen creates an application using the Arithmetic grammar. If executed in a directory containing a Antlr Maven plugin (pom.xml), trgen will create a program according to the information specified in the pom.xml file. Either way, it creates a directory Generated/, and places the source code there.

trgen has many options to generate a parser from any Antlr4 grammar, for any target. But, if a parser is generated for the C# target, built using the NET SDK, then trparse can execute the generated parser, and can be used with all the other tools in Trash. _NB: In order to use the generate parser application, you must first build it:

dotnet restore Generated/Test.csproj
dotnet build Generated/Test.csproj

Run the generated parser application

trparse -i "1+2+3" | trtree

After using trgen to generate a parser program in C#, shown previously, and after building the program, you can run the parser using trparse. This program looks for the generated parser in directory Generated/. If it exists, it will run the parser application in the directory. You can pass as command-line arguments an input string or input file. If no command-line arguments are supplied, the program will read stdin. The output of trparse, as with most tools of Trash, is parse tree data.

Find nodes in the parse tree using XPath

mkdir empty; cd empty; trgen; dotnet build Generated/Test.csproj; \
    trparse -i "1+2+3" | trxgrep " //SCIENTIFIC_NUMBER" | trst

With this command, a directory is created, the Arithmetic grammar generated, build, and then run using trparse. The trparse tool unifies all parsing, whether it's parsing a grammar or parsing input using a generated parser application. The output from the trparse tool is a parse tree which you can search. Trxgrep is the generalized search program for parse trees. Trxgrep uses XPath expressions to precisely identify nodes in the parse tree.

XPath was added to Antlr4, but Trash takes the idea further with the addition of an XPath2 engine ported from the Eclipse Web toolkit. XPath is a well-defined language that should be used more often in compiler construction.

Rename a symbol in a grammar, generate a parser for new grammar

trparse Arithmetic.g4 | trrename "//parserRuleSpec//labeledAlt//RULE_REF[text() = 'expression']" "xxx" | trtext > new-source.g4
trparse Arithmetic.g4 | trrename -r "expression,expression_;atom,atom_;scientific,scientific_" | trprint

In these two examples, the Arithmetic grammar is parsed. trrename reads the parse tree data and modifies it by renaming the expression symbol two ways: first by XPath expression identifying the LHS terminal symbol of the expression symbol, and the second by assumption that the tree is an Antlr4 parse tree, then renaming a semi-colon-separated list of paired renames. The resulting code is reconstructed and saved. trrename does not rename symbols in actions, nor does it rename identifiers corresponding to the grammar symbols in any support source code (but it could if the tool is extended).

Count method declarations in a Java source file

git clone https://github.com/antlr/grammars-v4.git; \
    cd grammars-v4/java/java9; \
    trgen; dotnet build Generated/Test.csproj;\
    trparse examples/AllInOne8.java | trxgrep " //methodDeclaration" | trst | wc

This command clones the Antlr4 grammars-v4 repo, generates a parser for the Java9 grammar, then runs the parser on examples/AllInOne8.java. The parse tree is then piped to trxgrep to find all parse tree nodes that are a methodDeclaration type, converts it to a simple string, and counts the result using wc.

Strip a grammar of all non-essential CFG

trparse Java9.g4 | trstrip | trtext > Essential-Java9.g4

Split a grammar

Since Antlr2, one can written a combined parser/lexer in one file, or a split parser/lexer in two files. While it's not hard to split or combine a grammar, it's tedious. For automating transformations, it's necessary because Antlr4 requires the grammars to be split when super classes are needed for different targets.

trcombine ArithmeticLexer.g4 ArithmeticParser.g4 | trprint > Arithmetic.g4

This command calls trcombine which parses two split grammar files ArithmeticLexer.g4 and ArithmeticParser.g4, and creates a combined grammar for the two.

trparse Arithmetic.g4 | trsplit | trsponge -o true

This command calls trsplit which splits the grammar into two parse tree results, one that defines ArithmeticLexer.g4 and the other that defines ArithmeticParser.g4. The tool trsponge is similar to the tee in Linux: the parse tree data is split and placed in files.

Parsing Result Sets -- the data passed between commands

A parsing result set is a JSON serialization of an array of:

  • A set of parse tree nodes.
  • Parser information related to the parse tree nodes.
  • Lexer information related to the parse tree nodes.
  • The name of the input corresponding to the parse tree nodes.
  • The input text corresponding to the parse tree nodes.

Most commands in Trash read and/or write parsing result sets.

Supported grammars

Grammars File suffix
Antlr4 .g4
Antlr3 .g3
Antlr2 .g2
Bison .y
LBNF .cf
W3C EBNF .ebnf
ISO 14977 .iso14977, .iso

Analysis

Recursion

Refactoring

Trash provides a number of transformations that can help to make grammars cleaner (reformatting), more readable (reducing the length of the RHS of a rule), and more efficient (reducing the number of non-terminals) for Antlr.

Some of these refactorings are very specific for Antlr due to the way the parser works, e.g., converting a prioritized chain of productions recognizing an arithmetic expression to a recursive alternate form. The refactorings implemented are:

Raw tree editing

Reordering

Changing rules

Splitting and combining

Conversion


The source code for the extension is open source, free of charge, and free of ads. For the latest developments on the extension, check out my blog.

Building

git clone https://github.com/kaby76/Domemtech.Trash
cd Domemtech.Trash
make clean; make; make install

You must have the NET SDK installed to build and run.

Current release

0.21.12

Prior Releases

0.18.1 -- Nov 12, 2022

Re-adding CI tests and stabilizing the tools.

0.18.0 -- Nov 7, 2022

  • Adding Xalan code.
  • Fix #180.
  • Fix crash in trgen antlr/grammars-v4#2818.
  • Fix #134.
  • Add -e option to trrename.
  • Update Antlr4BuildTasks version.
  • Fix #197, #198.
  • Fix trparse exit code.
  • Add --quiet option to trparse to just get exit code.
  • Change trgen templates to remove -file option, make file name parsing the default.

Roadmap/Goals

Trash is a long-term project (already going on 3 years). I'm envisioning for the "first" release to support:

  • grep utility that finds data in parse trees ✓
  • print a parse tree in various formats ✓
  • sponge (converts parse tree data into files) ✓
  • be able to specify analyses and refactorings via high-level specifications
    • basic refactorings (insert, delete, rename, reorder, split, combine, fold, unfold) ✓
    • piggy -- a parse tree rewriter
  • basic analyses (indirect and direct recursion, infinite recursion, LL(1), LR(1), LALR(1), SLR(1), LR(0), etc)
  • grammar extraction from pdfs and text
  • extract context-free grammars directly from source code via machine learning
  • reading and conversion of ABNF, Antlr2/3/4, Bison, Coco/R, ISO14977, JavaCC, Lark, LBNF, Pegen, Peg.js, Pest, Rex, W3C EBNF, XText ~✓

If you have any questions, email me at ken.domino gmail.com