Skip to content

Tree-sitter: fix empty match, fix comments, word token, makefile#533

Merged
andreasabel merged 6 commits intoBNFC:masterfrom
rina-forks:matches-empty-merge
Mar 18, 2026
Merged

Tree-sitter: fix empty match, fix comments, word token, makefile#533
andreasabel merged 6 commits intoBNFC:masterfrom
rina-forks:matches-empty-merge

Conversation

@katrinafyi
Copy link
Copy Markdown
Contributor

@katrinafyi katrinafyi commented Feb 23, 2026

Broad improvements to tree-sitter backend.

  • Most importantly, transforms rules which match empty as needed by tree-sitter, and do so with a recursive fixed point. This should work correctly for all cases. (except for nullable regex tokens. but that could be added easily if wanted).
  • Also important, the backend was previously emitting // .* as a single line comment, rather than // .*\n. This led to state space explosion in the generated tree-sitter parser.
  • Add Makefile generation so make can be used to build the generated grammar. THen, make parse will accept input from stdin and print the parse tree.
  • Change tree-sitter to generate files in a directory called ./tree-sitter-lang-name. This helps organise things, because the tree-sitter commands like tree-sitter generate will create a src folder next to the grammar.js
  • Change word token to default to the built-in Ident but is also user customisable. Old one didn't work because tree-sitter requires the word token to be a direct rule reference like $.identifier, not choice or other function.

You can also test this by using the Makefile in source/test/BNFC/Backend/TreeSitter. Just run

cd source/test/BNFC/Backend/TreeSitter
make generate-all

and this will try to generate tree-sitter for each .cf in that folder, and run it through tree-sitter generate. You can also see a copy of the output in the .expected.js files.

If you want to run it on a custom CF file, you can just use

cabal run bnfc -- --tree-sitter --makefile x.cf
make -C tree-sitter-x

You can also read the docs for this change at https://matches-empty.lychee-docs-katrinafyi.pages.dev/BNFC-Backend-TreeSitter-MatchesEmpty !

Broad improvements to tree-sitter backend.

- Most importantly, transforms rules which match empty as needed by
  tree-sitter, and do so with a recursive fixed point to account for all
  cases.
- Also important, the backend was previously emitting `// .*` as a single
  line comment, rather than `// .*\n`. This led to state space explosion
  in the generated tree-sitter parser.
- Add Makefile generation so `make` can be used to build the generated
  grammar. THen, `make parse` will accept input from stdin and print the
  parse tree.
- Change tree-sitter to generate files in a directory called
  `./tree-sitter-lang-name`. This helps organise things, because the
  tree-sitter commands like `tree-sitter` generate will create a `src`
  folder next to the grammar.js
- Change word token to default to the built-in Ident but is also user
  customisable. Old one didn't work because tree-sitter requires the
  word token to be a direct rule reference like `$.identifier`, not
  choice or other function.

You can also test this by using the Makefile in
source/test/BNFC/Backend/TreeSitter. Just run
```
cd source/test/BNFC/Backend/TreeSitter
make generate-all
```
and this will try to generate tree-sitter for each .cf in that folder,
and run it through tree-sitter generate. You can also see a copy of the
output in the .expected.js files.

If you want to run it on a custom CF file, you can just use
```
cabal run bnfc -- --tree-sitter --makefile x.cf
make -C tree-sitter-x
```
@andreasabel andreasabel added this to the 2.9.7 milestone Mar 12, 2026
@andreasabel andreasabel added the tree-sitter Concerning the tree-sitter backend label Mar 12, 2026
@andreasabel andreasabel added the pr: squash PR needs squashing label Mar 12, 2026
@andreasabel
Copy link
Copy Markdown
Member

Thanks for the PR, @katrinafyi !
The tree-sitter backend does not have an active maintainer, so thanks for contributing.

If you would like to become the maintainer, I'd be happy to accept your help.
@chaserhkj was previously in charge but I think he did not have time recently.

Apologies about the nuisance with GHC < 8.4. I think we should drop these (see #548), but since you have already put in the work to be compatible with these versions, I think this PR isn't affected by this choice (anymore).

I'd squash the commits by default because the later ones are fixups, but if you prefer to do some manual reorganization of commits to keep things separate, please let me know.

Copy link
Copy Markdown
Member

@andreasabel andreasabel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

(I do not review this carefully since the tree-sitter backend is experimental so I do not worry about backwards-compatibility.)

@katrinafyi
Copy link
Copy Markdown
Contributor Author

Hi Andreas, thanks for looking over it!

I would be fine with squash merging the commits. The later commits are just extra changes I discovered after using it for a bit.

I would also be happy to help with maintaining the backend as long as I have the time. I'd need some direction on what you'd like to see for documentation, testing, and code quality. I see there's some recent work on running system tests with Docker? But overall, I'm happy to help :)

Thanks again!

@andreasabel
Copy link
Copy Markdown
Member

andreasabel commented Mar 12, 2026

Since I haven't used tree-sitter, I have no idea how things are supposed to work and thus do not qualify much as maintainer.
Improvements to the tree-sitter backend should be directed what you and the other users need of it.

My ideal for BNFC (that I worked on a lot in 2020/21) is that all backends should have the same feature set. This isn't a must, though. They should however all interpret a BNFC grammar in the same way, which is already a challenge and also not realized 100%. For instance, ANTLR is LL and not LR, so the Java/Antlr backend generates a LL parser and not the usual LALR(1) that the other parser generator produce.
Also, some lexer generators do not seem to implement the usual priority of longest-match, which was a problem with the Python backend (PR #485).

In essence, I have aimed at a testsuite containing an example set derived from case studies and issues that all backends process with the same results. Some exceptions have remained.

Concerning the future development of BNFC, I would not want a lot of backend-specific pragmas in the LBNF language. It should remain universal.
Historically, many pragmas were only implemented in the Haskell backend. I have tried to port most to the other backends (e.g. the define pragma). Some, like the layout pragma, have not been ported, since their design was to immature.

This is roughly the framework of BNFC as I conceive it.

When it comes to documentation, there is some technical debt concerning the backends. (The C backends are still undocumented, for instance.)
A user guide for tree-sitter would be very welcome.

@Commelina is doing great work currently, she added the testsuite to CI as you noted.

@katrinafyi
Copy link
Copy Markdown
Contributor Author

Thanks for the info, it's helpful to know!

Tree-sitter is LR(1) using a GLR parser so hopefully that lines up well. I'll have to look closer at the lexer behaviour.

However, for the system tests, I noticed that they seem to expect a pretty-printer to display the parsed AST, and then the test suite checks its output matches what is expected. This might be hard for tree-sitter because it's a standalone grammar, not necessarily with a pretty-printer (similar to the Pygments backend in this way). We'll have to think about how to test this, maybe in a future PR.

I would definitely like to write some docs! Also, I have an idea for implementing layout, but that would be further down the line.

Anyway, this PR is ready to merge now I think, and I'll keep working on follow-ups as I have time.

@andreasabel andreasabel merged commit 24e6165 into BNFC:master Mar 18, 2026
27 checks passed
@andreasabel
Copy link
Copy Markdown
Member

@katrinafyi I merge this, but now the bnfc-system-tests CI reports errors: https://github.com/BNFC/bnfc/actions/runs/23257141559/job/67615195934#step:7:6496

shelly did not find tree-sitter in the PATH

It would be great if you could find a solution for this.
Maybe you can patch the Docker image so that tree-sitter is available.

@katrinafyi

for the system tests, I noticed that they seem to expect a pretty-printer to display the parsed AST, and then the test suite checks its output matches what is expected. This might be hard for tree-sitter because it's a standalone grammar, not necessarily with a pretty-printer

A solution could be to disable this "golden value check" for the tree-sitter backend.

@katrinafyi
Copy link
Copy Markdown
Contributor Author

Oh yeah I'll get onto that

@Commelina
Copy link
Copy Markdown
Member

@katrinafyi Something may help: Currently the latest Haskell docker image is based on debian 12 "Bookworm" (libc6 2.36), while the recent binary release of tree-sitter-cli requires libc6>=2.39.

It can be resolved by either building from source with cargo or use a older version (it seems that the newest runnable one is 0.24.7). I suggest to try the second one first unless you need some recent features of tree-sitter-cli, because building from source can be quite slow (and, setting caches with GitHub Actions is a little annoying, especially with a total 10GB limit).

@katrinafyi
Copy link
Copy Markdown
Contributor Author

Thanks! Also feel free to disable the tree sitter system tests for now to fix the CI.. sorry about that :/

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

pr: squash PR needs squashing tree-sitter Concerning the tree-sitter backend

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants