Query the Hack AST in a light and portable manner.
Just want linters? look here :)
Want to build your own tools on top of the AST?:
composer require hershel-theodore-layton/portable-hack-ast hershel-theodore-layton/portable-hack-ast-extras
- Read this README
- Familiarize yourself with Node kinds and Members
- Autocomplete your way through the API
- ???
- Profit
An AST is how Hack and HHVM see and reason about your source code. If you see this piece of code, you can immediately identify and name parts.
$var = Math\minva(1, 2 + 3 * 4);
This is a statement, let's identify some parts:
-
minva
is a name, but it belongs together withMath
. -
Math\minva
is a qualified name, which probably resolves toHH\Lib\Math\minva
. -
Math\minva(1, 2 + 3 * 4)
is a function call, which has arguments. -
1
is a simple literal number expression. -
2 + 3 * 4
is a complex expression, the3 * 4
binds more tightly. -
3 * 4
is a binary expression and the operand is theKIND_STAR
(*
) token -
2 + 3 * 4
is one too, where the whole part3 * 4
is the right hand side. -
$var = Math\minva(1, 2 + 3 * 4)
is an assignment expression.
If you have been programming for a couple of years (or decades), you do this classification almost without thinking about it1. Computers also reason about code this way for a while, before they translate it to something you can run.
Let's say this function was a widely used function in a massive codebase.
namespace MyNamespace\Rendering;
function to_html(
Renderable $render,
bool $be_unsafe = false,
bool $use_cache = false,
): string {
// Code...
}
// In a completely different file...
$rendered = Rendering\to_html($something_untrusted, true);
Oh no, you thought you were enabling the cache, but you turned off safety checks!
Luckily this was caught in code review, but this could have ended badly.
Let's grep
around to see if there are other instances of api confusion.
You will quickly hit a stumbling block, Rendering\to_html(...)
is called from
hundreds of thousands of places, many of which only pass a Renderable
.
Oh, let's make the regex more complex to find results with multiple arguments.
And while we are at it, let's also exclude to_html(..., false, ...)
.
Good luck with that! If the first argument is complex, you are stuck.
You will spend most your regex trying to skip it, but that is almost impossible.
Take a step back and use the right too for the job. There should be a tool for this, but pfff got archived in 2017. Let's build our own, it can't be that much work, can it? The example is a little large for a README, but very illustrative.
function check_for_unsafe_render_calls(
Pha\Script $script,
Pha\SyntaxIndex $syntax_index,
Pha\Resolver $resolver,
)[]: vec<shape('code' => string, 'line' => int)> {
$get_thing_being_called =
Pha\create_member_accessor($script, Pha\MEMBER_FUNCTION_CALL_RECEIVER);
$get_arguments =
Pha\create_member_accessor($script, Pha\MEMBER_FUNCTION_CALL_ARGUMENT_LIST);
$to_txt = $n ==> Pha\node_get_code($script, $n);
$to_short_txt = $n ==> Pha\node_get_code_compressed($script, $n);
$to_function_name = $n ==> Pha\resolve_name($resolver, $script, $n);
$to_line = $n ==>
Pha\node_get_line_and_column_numbers($script, $n)->getStartLineOneBased();
$is_calling_to_html = $call ==> $get_thing_being_called($call)
|> $to_function_name($$) === 'MyNamespace\Rendering\to_html';
$is_unsafe = $call ==> $get_arguments($call)
|> Pha\as_syntax($$)
|> Pha\list_get_items_of_children($script, $$)
|> C\count($$) > 1 && $to_short_txt($$[1]) !== 'false';
return Pha\index_get_nodes_by_kind(
$syntax_index,
Pha\KIND_FUNCTION_CALL_EXPRESSION,
)
|> Vec\filter($$, $call ==> $is_calling_to_html($call) && $is_unsafe($call))
|> Vec\map(
$$,
$call ==> shape('code' => $to_txt($call), 'line' => $to_line($call)),
);
}
We start this function with a Script
, a SyntaxIndex
, and a Resolver
:
-
Script
respresents your source code. We can query it to get insights. -
SyntaxIndex
is used for getting a list of function calls in the Script. -
Resolver
will figure out which function you are calling, because namespaces.
Let's define some functions for accessing the data we need:
-
$get_thing_being_called
getsRendering\to_html
fromRendering\to_html(...)
. -
$get_arguments
gets the$r, true
fromRendering\to_html($r, true)
. -
$to_txt
gets the full text with comments and spaces for the final result. -
$to_short_txt
gets thefalse
without spaces or comments to compare with. -
$to_function_name
gets the function name including the whole namespace. -
$to_line
gets the line number at which the code started.
Combine these parts step by step to get $is_calling_to_html
and $is_unsafe
.
Pha\index_get_nodes_by_kind(...)
picks all the function calls from a file.
We filter out only those ones we are interested in.
Transform the output for easy viewing, and you have your codebase wide search ready.
We have just written our very own super specialized linter.
For some more examples of what you can do with an ast, see HTL\PhaLinters
$code = ''; // your source code goes here...
// You want to reuse $ctx if you can.
$ctx = Pha\create_context();
// The $script is what you are after, the $ctx is the updated $ctx.
// The object in $ctx is never changed, so use this `list()` assignment to get the new one.
list($script, $ctx) = Pha\parse($code, $ctx);
// Some tools...
// These indexes allow you to use Pha\index_get_nodes_by_kind()
$syntax_index = Pha\create_syntax_kind_index($script);
$token_index = Pha\create_token_kind_index($script);
$trivium_index = Pha\create_trivium_kind_index($script);
// Store your work in a cache (sqlite, apc, files on disk).
$ready_to_serialize = Pha\dematerialize_script($script);
// And get them back.
$deserialized = $ready_to_serialize;
$ctx = Pha\materialize_context($deserialized['context']);
Pha\materialize_script($deserialized['script'], $ctx);
// resolver, and pragma_map require you install portable-hack-ast-extras
// This allows you to resolve names to the namespace they belong in.
$resolver = Pha\create_name_resolver($script, $syntax_index, $token_index);
// This gives you all the `pragma()` declarations and `<<Pragma()>>` annotations.
$pragma_map = Pha\create_pragma_map($script, $syntax_index);
The full API for interacting with all these values can be found in node_functions.hack. There are about 50 functions at the time of writing, so with some auto complete, you should get the hang of it quite quickly.
For definitions of kinds and members, see Kind.hack and Member.hack. If the definitions are incomplete for your hhvm version, you can create them at runtime.
Pha\syntax_kind_from_string(...)
Pha\token_kind_from_string(...)
Pha\trivium_kind_from_string(...)
Pha\member_from_tuple(...)
This library pulls out all the stops in the name of performance. You can parse very large codebases and keep all the Scripts in memory, no sweat2. HHAST is the target of this benchmark, since it contains a lot of codegenned definitions, which adequately represent codebases with large classes.
Parsing: ../vendor/hhvm/hhast with hhast
1063.71 megabytes used.
Parsing: ../vendor/hhvm/hhast with pha
29.2615 megabytes used.
Runtime is more difficult to pin a number on, but Pha is blazingly fast.
I am able to lint everything in the HTL\
namespace in < 400 milliseconds.
When adding parsing to it (not caching anything), the results still outshine
HHAST, even with HHAST's .var/cache/hhvm/hhast/parser-cache
mechanism enabled.
"Portable" Hack AST, what does portable mean?
This codebase is portable between Hack AST versions (read hhvm versions).
Everything in the HTL\
namespace supports a wide range of hhvm versions.
In order to do that with an AST library, you can't hardcode definitions.
The structure and layout of the AST is dynamically learned at runtime.
The parsed structure is portable between different invocations of the program.
It is represented as two vec<Node ~ int>
, and a dict<Kind ~ string, int>
.
When you dematerialize and serialize a Script, you encode arrays of value types.
They can be deserialized and materialized without the loss of information.
This operation is very quick, and may even be used to "swap" large Scripts
to disk if memory pressure becomes too large.
This code is simple enough to be ported to a different language all together. 95% of the code performs simple operations, which would translate 1-to-1 to any other programming language which would perform better than Hack on HHVM. The performance of Pha on HHVM suffices for codebases I work with (for now). When the amount of code I ingest grows another 20×, I know there is a path forward I can take to achieve more performance in a couple of days.
x_from_y
: (also xs_from_ys)
- X is a newtype and Y is the underlying type.
- The value is not checked in any way.
- It is an unchecked downcast.
x_to_y
: (also xs_to_ys)
- X is a newtype and Y is the underlying type.
- This function strips the newtype away.
as_x
(where x is not an array):
- X is a newtype.
- The unnamed argument is a less specific newtype.
- Throws if the runtime value does not match.
- This is a checked downcast.
as_x
(where x is an array):
- Restores the runtime value after serialization.
- Will perform array kind conversion if needed.
x_hide
:
- X is an object type.
- This function "hides" (removes) the methods.
- No check needed to be performed.
x_reveal
:
- X is an object type.
- The argument is a newtype created by x_hide.
- This function "reveals" the methods again.
cast_away_nil
:
- Perform an unchecked cast from Nillable<T> to T.
- Pha uses a lot less memory and is faster than HHAST.
- This means you can write large, complex, whole codebase analysis tools.
- Pha can represent an invalid Hack file.
- This makes it particularly suitable for as-you-type tooling.
- HHAST will break type invariants when your code is not syntactically correct.
- Pha is portable between different versions/builds of hhvm.
- This unshackles the linters you get from the version of hhvm you are running.
- Pha doesn't suppress errors or perform unsafe casts.
- Sound types for the win!!!
- Pha runs in the pure context, read
[]
.- This makes code easier to reason about and predictable.
- It makes auditing code easier, because pure code can't do funny business3
without type system breakers like HH_FIXME and
Coeffects\backdoor
This library contains a lot of functions that take a Script
and one or more
Node
parameters. Every function expects that the Node
is part of the Script
or NIL
. If you hand an unrelated Script
and a Node
to a function,
the result of the operation is undefined4.
The caching mechanism is very fast and small. This is important, because most parsing is actually reparsing. Uncached performance only matters when:
- Checking out a new repository for the first time.
- Switching branches to one that you have never seen before.
- Pulling changes / syncing with HEAD.
As noted in performance, uncached parse performance is still good, but cached performance is a lot better still. In order to cache Scripts effectively, you must dematerialize them first. You may not observe the dematerialized representation, it is subject to change. The following table doesn't account for serialization overhead. The serialized size of a given script is often about 10 × the source size in bytes5.
Key name | Type | Approximate size |
---|---|---|
VERSION | int | 8 bytes |
SOURCE_ORDER | vec<int> | node count × 8 bytes |
SIBLINGS | vec<int> | syntax&token count × 8 bytes |
LIST_SIZES | dict<int, int> | list count where size > 253 × 16 bytes |
SOURCE_TEXT | string | Str\length() of source in bytes |
CONTEXT_ID | string | 40 bytes |
The Context will also need to be serialized, but a Context is rarely unique.
If you deduplicate them by context_hash
, the storage requirements fade away.
Footnotes
-
Some some cases, you do think about this, when the precedence is weird. ↩
-
In order to verify these claims, you will have to
git checkout
this commit and run mem_usage.hack in repo auth mode. ↩ -
I am not saying HHAST does, has done, or will do funny business. I am just saying that auditing for funny business is easier in pure well-typed code. ↩
-
This doesn't mean it is unsafe, but you may get incorrect results or exceptions. ↩
-
A quick utility is included in
bin/serializer.hack
to verify serialized sizes. ↩