Where a Gopher wanders around and meets lots of 🌳 Sitters...
First of all, giving credits where they are due:
This repository started as a fork of @smacker's go-tree-sitter repo
until I realized I don't want to also handle the bindings library itself
in the same project (i.e. the stuff in the root of the repo, exposing sitter.Language
type
itself & co.), I just want a (big) collection of all the tree-sitter
parsers I can add.
So here it is: started with the parsers and the automation from the above mentioned repo then added a bunch more parsers on top of it and updated automation (to support more parsers and also to automatically update the PARSERS.md file, git tags, etc.).
The credits for the parsers go to all the parsers' respective authors (see grammars.json for the source of each and all of the parsers).
This repository does NOT implement any parsers at all, it simply automates
pulling them in from upstream, re-generating them from grammar.js
and
providing the Go bindings with it.
See PARSERS.md for the list of supported parsers. The end goal is to maintain parity with nvim_treesitter and to add any other parsers I find, so that it becomes as complete as possible.
For contributing (or just to see how the automation works) see CONTRIBUTING.md.
- ~490 parsers in this repo vs. ~30 in the parent repo;
- all (but 7) are regenerated from
grammar.js
(viatree-sitter generate
) instead of copying the pre-generated files from the parser repo; - end-to-end "tree-sitter version alignment", see below,
- queries fetching mechanism, so that you can get the queries along with the parsers;
- filetype detection mechanism that allows to quickly determine which parser and/or query you should pull for a given file (see filetype.json);
- 3 different ways you can use parsers;
- kept up to date pretty much on a daily basis with the parsers' changes;
- kept up to date with tree-sitter;
- constantly adding new parsers as they become available, even new/experimental ones;
- tests suite to ensure all the parsers in the repo can actually parse content (can build & run successfully).
This library is designed to ensure that we are aligned end-to-end with the latest tree-sitter version:
- the bindings library go-tree-sitter-bare (itself forked from the same repo - see it's own README for why and/or differences),
- the
tree-sitter-cli
in package.json and consequently - the generated parsers included, ALL use the same tree-sitter version (generally the latest).
Contrast that with the parent repo where bindings lag quite a bit
behind tree-sitter (was last updated to v0.22.5) and then on top of that,
the parsers' files (parser.c
, parser.h
, etc.) are copied from the
parser repos, meaning that they are each built with whatever version
happened to be last built by the parsers' repo maintainers, so it
can vary greatly from each other AND from the version the bindings
are built with.
Just as an example, in the parent repo, toml parser
was generated with https://github.com/ikatyang/tree-sitter/tree/fc5a692
which is v0.19.3~10
of tree-sitter, vs. in here where it was
last rebuilt with v0.24.2
(actually, with 0.24.4
but there were no
changes in parser.c
so nothing was committed).
The language name used is the same as TreeSitter language name (the name
exported
by grammar.js) and the same as the query folder name in nvim_treesitter
(where
applicable).
This keeps things simple and consistent.
In rare cases, the Go package name differs from the language name:
-
go
actually has the package nameGo
becausepackage go
does not go well in Go (pun intended) but otherwise the language name remains "go"; -
func
language, same problem as above, so package name is actuallyFunC
(but everything else isfunc
as normal: folder, language name, etc.); -
context
language, same problem (conflict with stdlibcontext
package) so it uses the nameConTeXt
; -
COBOL
language is namedCOBOL
in grammar.js but we expose it ascobol
(for aligning with the rest of the parsers); -
dotenv
language is namedenv
in grammar.js but we expose it asdotenv
; -
walnut
language is namedcwal
in grammar.js but we retain it aswalnut
; -
janet
language is namedjanet_simple
in grammar.js but in here is simply namedjanet
; -
cgsql
is namedcql
in grammar.js but we expose it ascgsql
; -
verilog
fromhttps://github.com/gmlarumbe/tree-sitter-systemverilog
is renamed tosystemverilog
to disambiguate it from the plainverilog
grammar.
Also, some languages may have names that are not very straightforward acronyms.
In those cases, an altName
field will be populated, i.e. requirements
language
has an altName
of Pip requirements
, query
has Tree-Sitter Query Language
and so on. Search grammars.json for your grammar of interest.
See the README in go-tree-sitter-bare,
as well as the example_*.go
files in this repo.
This repo only gives you the GetLanguage()
function, you will still use the sibling
repo for all your interactions with the tree.
You can use the parsers in this repo in several ways:
You can use the parsers one (or more) at a time, as you'd use any other Go package:
package main
import (
"context"
"fmt"
"github.com/alexaandru/go-sitter-forest/risor"
sitter "github.com/alexaandru/go-tree-sitter-bare"
)
func main() {
content := []byte("print('It works!')\n")
node, err := sitter.Parse(context.TODO(), content, sitter.NewLanguage(risor.GetLanguage()))
if err != nil {
panic(err)
}
// Do something interesting with the parsed tree...
fmt.Println(node)
}
Please note that in standalone mode, the GetLanguage()
returns an unsafe.Pointer
rather than a *sitter.Language
. To pass it to sitter.Parse()
or parser.SetLanguage()
you need to wrap it in sitter.NewLanguage()
as above.
The rationale is to enable end users to use the parsers with the bindings library of their choice, whether it's @smacker's library or go-tree-sitter-bare.
If (and only IF) you want to use ALL (or most of) the parsers (beware, your binary
size will be huge, as in 350MB+ huge) then you can use the root (forest
) package:
package main
import (
"context"
"fmt"
forest "github.com/alexaandru/go-sitter-forest"
sitter "github.com/alexaandru/go-tree-sitter-bare"
)
func main() {
content := []byte("print('It works!')\n")
parser := sitter.NewParser()
parser.SetLanguage(forest.GetLanguage("risor"))
tree, err := parser.Parse(context.TODO(), nil, content)
if err != nil {
panic(err)
}
// Do something interesting with the parsed tree...
fmt.Println(tree.RootNode())
}
this way you can fetch and use any of the parsers dynamically, without having to manually import them. You should rarely need this though, unless you're writing a text editor or something.
Note that unlike individual mode, the forest.GetLanguage()
returns
a *sitter.Language
which can be passed directly to parser.SetLanguage()
.
A third way, and perhaps the most convenient (no, it's not, it's ~300MB with all
parsers built into the binary whereas all parsers built as plugins took ~1400MB
for all 354 parsers), is to use the included Plugins.make
makefile, which allows easy creation of any and all plugins. Simply copy it to
your repo, and then you can easily make -f Plugins.make plugin-risor
, etc. or
use the plugin-all
target which creates all the plugins.
Then you can selectively use them in your app using the plugins mechanism.
IMPORTANT: You MUST use -trimpath
when building your app, when using plugins
(the Plugins.make file already includes it, but the app that uses them also needs it).
You can mix and match the above, obviously.
Probably the best approach would be to build your own "mini-forest", using the forest package as a template but only including the languages you are interested in.
I'm not excluding offering "mini forests" in the future, guarded by build tags, if I ever figure some subsets that make sense (most used/popular/known/whatever).
Each individual parser (as well as the bulk loader) offers an Info()
function
which can be used to retrieve information about a parser. It exposes it's entry
from grammars.json
either raw (as a string holding the JSON encoded entry)
or as an object (only available in bulk mode).
The returned Grammar
type implements Stringer
so it should give a nice summary
when printed (to screen or logs, etc.).
The root package not only includes the "parsers forest" but also the corresponding queries. The queries are compiled from two sources:
-
nvim_treesitter
project and - the individual sitter repos' own
queries
folders.
The queries are embedded in the packages (at the time of writing this, for 359 parsers,
the queries are only 11MB) and they can be fetched exactly the same as languages,
just replace GetLanguage()
with GetQuery(kind)
or forest.GetQuery(lang, kind)
.
They are available for standalone packages, plugins as well as forest itself.
The kind is one of {highlights
, indent
, folds
, etc.} (preferably without the
".scm" extension, but will work with it included as well). I.e. to get the highlights
query for Go, one would call forest.GetQuery("go", "highlights")
.
You can optionally pass the query lookup preference, see the NvimFirst
, NativeFirst
,
etc. in forest.go
for details, as in: forest.GetQuery("go", "highlights", forest.NvimOnly)
.
The queries respect the "inherits:" directive (nvim_treesitter specific), recursively,
returning the final query with all inherited queries included, at the forest level.
The individual packages' own GetQuery() obviously cannot do that, since they do not
have access to other parsers' own queries, only the forest has that. See forest.GetQuery()
on how to replicate that on your end if using the individual packages.
The root package also includes a file type detector:
forest.DetectLanguage(<abs path|rel path|filename>)
. For best results, the absolute
path to the file should be provided as that enables all the available detectors, in
order of priority:
- shebang or vim modeline - whichever is available on the 1st 255 bytes of the file;
-
glob matching against the path tail
(i.e.
*/*/foo.txt
will match.../a/b/foo.txt
regardless of the rest of the path), - file name;
- file extension.
The language name is obviously the same as parser and query name.
You can optionally register your own "patterns" (only for languages that are part of the
forest, as they are validated against it) or override existing patterns (particularly
useful where there is file extension clashing, like both V and Verilog using .v
file
extension - you can opt for one or the other, etc.). See forest.RegisterLanguage()
for
details.
You can inspect the mapping in the filetype.json file.
For transparency, any and all changes made to the parsers' (and, to be clear, I include in this term ALL the files coming from parsers, not just parser.c) files are documented below.
For one thing ALL changes are fully automated (any exceptions are noted below), no change is ever made manually, so inspecting the automation should give you a clear picture of all the changes performed to the code, changes which are detailed below:
- the include paths are rewritten to use a flat structure (i.e.
"tree_sitter/parser.h"
becomes"parser.h"
); This is needed so that the files are part of the same package, plus it also makes automation simpler; - for
unison
thescanner
file includesmaybe.c
which causescgo
to include the file twice and throw duplicate symbols error. The solution chosen was to copy the content of the included file into the scanner file and set the included file to zero bytes; this way all the code is in one file and the compilation is possible; - similar to
unison
,comment
andperl
also use the same technique of combining C files; - for parsers that include a
tag.h
file: theTAG_TYPES_BY_TAG_NAME
variable clashes between them (when those parsers are all included into one app). The solution chosen was to rename the variable by adding the_<lang>
suffix, i.e., we currently have:-
TAG_TYPES_BY_TAG_NAME_astro
; -
TAG_TYPES_BY_TAG_NAME_html
; -
TAG_TYPES_BY_TAG_NAME_svelte
; -
TAG_TYPES_BY_TAG_NAME_vue
;
-
- for parsers that define
serialize()
,deserialize()
,scan()
(and a few others) (i.e.org
,beancount
,html
& a few others): the offending identifiers are renamed by appending the_<lang>
suffix to them (i.e.serialize
->serialize_org
, etc.); See theputFile()
function ininternal/automation/main.go
for details; - some parsers'
grammar.js
files were not yet updated to work with latest TreeSitter, in which case we hot patch them before regenerating the parser. See thereplMap
indownloadGrammar()
function.