Tree-sitter is an incremental parsing library in C and C++, intended to be used via bindings to higher-level languages. It allows documents to be efficiently re-parsed after localized edits, making it suitable for use in performance-intensive text-editing programs.
Tree-sitter uses a sentential-form incremental LR parsing algorithm, as described in the paper Efficient and Flexible Incremental Parsing by Tim Wagner. It handles ambiguity at compile-time via precedence annotations, and at run-time via the GLR algorithm. This allows it to generate a fast parser for any context-free grammar.
script/configure.sh # Generate a Makefile using gyp
make # Build static libraries for the compiler and runtime
Tree-sitter's interface for creating grammars is a C++ library, libcompiler
.
This allows grammars and rules to be defined, manipulated and
extended as simple values in high-level languages like javascript,
and then converted into tree-sitter's native representation and compiled to C
parsers. These parsers can then be used from any language that has a binding to
tree-sitter's runtime library, libruntime
.
Here's a simple example that uses libcompiler
directly:
// arithmetic_grammar.cc
#include <assert.h>
#include <stdio.h>
#include "tree_sitter/compiler.h"
using namespace tree_sitter;
int main() {
auto arithmetic_grammar = Grammar({
// The first rule listed in a grammar becomes the 'start rule'.
{ "expression", choice({
sym("sum"),
sym("product"),
sym("number"),
sym("variable"),
// Error recovery is controlled by wrapping rule subtrees with `err`.
seq({
str("("),
err(sym("expression")),
str(")") }) }) },
// Tokens like '+' and '*' are described directly within the grammar's rules,
// as opposed to in a seperate lexer description.
{ "sum", prec_left(1, seq({
sym("expression"),
str("+"),
sym("expression") })) },
// Ambiguities can be resolved at compile time by assigning precedence
// values to rule subtrees.
{ "product", prec_left(2, seq({
sym("expression"),
str("*"),
sym("expression") })) },
// Tokens can be specified using ECMAScript regexps.
{ "number", pattern("\\d+") },
{ "variable", pattern("[a-zA-Z]+\\w*") },
{ "comment", pattern("//.*") },
}).ubiquitous_tokens({
// Things that can appear anywhere in the language are expressed as
// 'ubiquitous tokens'.
sym("comment"),
pattern("\\s+")
});
// Generate C code for parsing this language.
auto output = compile(arithmetic_grammar, "arithmetic");
std::string c_code = output.first;
const GrammarError *error = output.second;
assert(!error);
puts(c_code.c_str());
return 0;
}
To create a parser for this language, compile and run this grammar like this:
clang++ -stdlib=libc++ -std=c++11 \
-I tree-sitter/include -L tree-sitter/out/Debug -l compiler \
arithmetic_grammar.cc -o arithmetic_grammar
./arithmetic_grammar > arithmetic_parser.c
The tree_sitter/runtime
C library exposes a DOM-style interface for inspecting
documents.
Functions like ts_node_child(node, index)
and ts_node_next_sibling(node)
expose every node in the concrete syntax tree. This is useful for operations
like syntax-highlighting, that operate on a token-by-token basis. You can also
traverse the tree in a more abstract way by using functions like
ts_node_named_child(node, index)
and ts_node_next_named_sibling(node)
. These
functions don't expose nodes that were specified in the grammar as anonymous
tokens, like (
and +
. This is useful when analyzing the meaning of a document.
#include <stdio.h>
#include "tree_sitter/runtime.h"
// Declare the language constructor that was generated from your grammar.
TSLanguage *ts_language_arithmetic();
int main() {
TSDocument *document = ts_document_make();
ts_document_set_language(document, ts_language_arithmetic());
// Usually, you would use the more general `ts_document_set_input`, which
// takes a struct with function pointers for seeking to positions in the text,
// and reading chunks of text. This allows you to efficiently parse text
// stored in your own data structure.
ts_document_set_input_string(document, "a + b * 5");
ts_document_parse(document);
TSNode root_node = ts_document_root_node(document);
printf(
"Root name: %s, start: %lu, end: %lu\n",
ts_node_name(root_node, document),
ts_node_start_char(root_node),
ts_node_end_char(root_node)
);
TSNode product_node = ts_node_named_child(ts_node_child(root_node, 0), 1);
printf(
"Child name: %s, start: %lu, end: %lu\n",
ts_node_name(product_node, document),
ts_node_start_char(product_node),
ts_node_end_char(product_node)
);
ts_document_free(document);
return 0;
}
To demo this parser's capabilities, compile this program like this:
clang \
-I tree-sitter/include -L tree-sitter/out/Debug -l runtime \
arithmetic_parser.c test_parser.c -o test_parser
./test_parser