Skip to content

An incremental parsing system for programming tools

Notifications You must be signed in to change notification settings

Baael/tree-sitter

 
 

Repository files navigation

tree-sitter

Build Status

Tree-sitter is an incremental parsing library in C and C++, intended to be used via bindings to higher-level languages. It allows documents to be efficiently re-parsed after localized edits, making it suitable for use in performance-intensive text-editing programs.

Tree-sitter uses a sentential-form incremental LR parsing algorithm, as described in the paper Efficient and Flexible Incremental Parsing by Tim Wagner. It handles ambiguity at compile-time via precedence annotations, and at run-time via the GLR algorithm. This allows it to generate a fast parser for any context-free grammar.

Installation

script/configure.sh  # Generate a Makefile using gyp
make                 # Build static libraries for the compiler and runtime

Writing a grammar

Tree-sitter's interface for creating grammars is a C++ library, libcompiler. This allows grammars and rules to be defined, manipulated and extended as simple values in high-level languages like javascript, and then converted into tree-sitter's native representation and compiled to C parsers. These parsers can then be used from any language that has a binding to tree-sitter's runtime library, libruntime.

Here's a simple example that uses libcompiler directly:

// arithmetic_grammar.cc

#include <assert.h>
#include <stdio.h>
#include "tree_sitter/compiler.h"

using namespace tree_sitter;

int main() {
  auto arithmetic_grammar = Grammar({

    // The first rule listed in a grammar becomes the 'start rule'.
    { "expression", choice({
      sym("sum"),
      sym("product"),
      sym("number"),
      sym("variable"),

      // Error recovery is controlled by wrapping rule subtrees with `err`.
      seq({
        str("("),
        err(sym("expression")),
        str(")") }) }) },

    // Tokens like '+' and '*' are described directly within the grammar's rules,
    // as opposed to in a seperate lexer description.
    { "sum", prec_left(1, seq({
      sym("expression"),
      str("+"),
      sym("expression") })) },

    // Ambiguities can be resolved at compile time by assigning precedence
    // values to rule subtrees.
    { "product", prec_left(2, seq({
      sym("expression"),
      str("*"),
      sym("expression") })) },

    // Tokens can be specified using ECMAScript regexps.
    { "number", pattern("\\d+") },
    { "variable", pattern("[a-zA-Z]+\\w*") },
    { "comment", pattern("//.*") },

  }).ubiquitous_tokens({

    // Things that can appear anywhere in the language are expressed as
    // 'ubiquitous tokens'.
    sym("comment"),
    pattern("\\s+")
  });

  // Generate C code for parsing this language.
  auto output = compile(arithmetic_grammar, "arithmetic");
  std::string c_code = output.first;
  const GrammarError *error = output.second;

  assert(!error);
  puts(c_code.c_str());

  return 0;
}

To create a parser for this language, compile and run this grammar like this:

clang++ -stdlib=libc++ -std=c++11                             \
  -I tree-sitter/include -L tree-sitter/out/Debug -l compiler \
  arithmetic_grammar.cc -o arithmetic_grammar

./arithmetic_grammar > arithmetic_parser.c

Using the parser

The tree_sitter/runtime C library exposes a DOM-style interface for inspecting documents.

Functions like ts_node_child(node, index) and ts_node_next_sibling(node) expose every node in the concrete syntax tree. This is useful for operations like syntax-highlighting, that operate on a token-by-token basis. You can also traverse the tree in a more abstract way by using functions like ts_node_named_child(node, index) and ts_node_next_named_sibling(node). These functions don't expose nodes that were specified in the grammar as anonymous tokens, like ( and +. This is useful when analyzing the meaning of a document.

#include <stdio.h>
#include "tree_sitter/runtime.h"

// Declare the language constructor that was generated from your grammar.
TSLanguage *ts_language_arithmetic();

int main() {
  TSDocument *document = ts_document_make();
  ts_document_set_language(document, ts_language_arithmetic());

  // Usually, you would use the more general `ts_document_set_input`, which
  // takes a struct with function pointers for seeking to positions in the text,
  // and reading chunks of text. This allows you to efficiently parse text
  // stored in your own data structure.
  ts_document_set_input_string(document, "a + b * 5");
  ts_document_parse(document);

  TSNode root_node = ts_document_root_node(document);
  printf(
    "Root name: %s, start: %lu, end: %lu\n",
    ts_node_name(root_node, document),
    ts_node_start_char(root_node),
    ts_node_end_char(root_node)
  );

  TSNode product_node = ts_node_named_child(ts_node_child(root_node, 0), 1);
  printf(
    "Child name: %s, start: %lu, end: %lu\n",
    ts_node_name(product_node, document),
    ts_node_start_char(product_node),
    ts_node_end_char(product_node)
  );

  ts_document_free(document);
  return 0;
}

To demo this parser's capabilities, compile this program like this:

clang                                                        \
  -I tree-sitter/include -L tree-sitter/out/Debug -l runtime \
  arithmetic_parser.c test_parser.c -o test_parser

./test_parser

References

About

An incremental parsing system for programming tools

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • C 78.7%
  • C++ 20.6%
  • Other 0.7%