code_tokenizers
This library is built on top of the awesome transformers and tree-sitter libraries. It provides a simple interface to align the tokens produced by a BPE tokenizer with the tokens produced by a tree-sitter parser.
Install
pip install code_tokenizers
How to use
The main interface of code_tokenizers
is the
CodeTokenizer
class. You can use a pretrained BPE tokenizer from the popular
transformers
library, and a tree-sitter parser from the
tree-sitter
library.
To specify a
CodeTokenizer
using the gpt2
BPE tokenizer and the python
tree-sitter parser, you
can do:
from code_tokenizers.core import CodeTokenizer
py_tokenizer = CodeTokenizer.from_pretrained("gpt2", "python")
None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.
You can specify any pretrained BPE tokenizer from the huggingface hub or a local directory and the language to parse the AST for.
Now, we can tokenize some code:
from pprint import pprint
code = """
def foo():
print("Hello world!")
"""
encoding = py_tokenizer(code)
pprint(encoding, depth=1)
{'ast_ids': [...],
'attention_mask': [...],
'input_ids': [...],
'is_builtins': [...],
'is_internal_methods': [...],
'merged_ast': [...],
'offset_mapping': [...],
'parent_ast_ids': [...]}
And we can print out the associated AST types:
Note
Note: Here the N/As are the tokens that are not part of the AST, such as the spaces and the newline characters. Their IDs are set to -1.
for ast_id, parent_ast_id in zip(encoding["ast_ids"], encoding["parent_ast_ids"]):
if ast_id != -1:
print(py_tokenizer.node_types[parent_ast_id], py_tokenizer.node_types[ast_id])
else:
print("N/A")
N/A
function_definition def
function_definition identifier
parameters (
N/A
N/A
N/A
N/A
call identifier
argument_list (
argument_list string
argument_list string
argument_list string
argument_list )
N/A