Tree-sitter: fix empty match, fix comments, word token, makefile#533
Tree-sitter: fix empty match, fix comments, word token, makefile#533andreasabel merged 6 commits intoBNFC:masterfrom
Conversation
Broad improvements to tree-sitter backend. - Most importantly, transforms rules which match empty as needed by tree-sitter, and do so with a recursive fixed point to account for all cases. - Also important, the backend was previously emitting `// .*` as a single line comment, rather than `// .*\n`. This led to state space explosion in the generated tree-sitter parser. - Add Makefile generation so `make` can be used to build the generated grammar. THen, `make parse` will accept input from stdin and print the parse tree. - Change tree-sitter to generate files in a directory called `./tree-sitter-lang-name`. This helps organise things, because the tree-sitter commands like `tree-sitter` generate will create a `src` folder next to the grammar.js - Change word token to default to the built-in Ident but is also user customisable. Old one didn't work because tree-sitter requires the word token to be a direct rule reference like `$.identifier`, not choice or other function. You can also test this by using the Makefile in source/test/BNFC/Backend/TreeSitter. Just run ``` cd source/test/BNFC/Backend/TreeSitter make generate-all ``` and this will try to generate tree-sitter for each .cf in that folder, and run it through tree-sitter generate. You can also see a copy of the output in the .expected.js files. If you want to run it on a custom CF file, you can just use ``` cabal run bnfc -- --tree-sitter --makefile x.cf make -C tree-sitter-x ```
|
Thanks for the PR, @katrinafyi ! If you would like to become the maintainer, I'd be happy to accept your help. Apologies about the nuisance with GHC < 8.4. I think we should drop these (see #548), but since you have already put in the work to be compatible with these versions, I think this PR isn't affected by this choice (anymore). I'd squash the commits by default because the later ones are fixups, but if you prefer to do some manual reorganization of commits to keep things separate, please let me know. |
andreasabel
left a comment
There was a problem hiding this comment.
Thanks!
(I do not review this carefully since the tree-sitter backend is experimental so I do not worry about backwards-compatibility.)
|
Hi Andreas, thanks for looking over it! I would be fine with squash merging the commits. The later commits are just extra changes I discovered after using it for a bit. I would also be happy to help with maintaining the backend as long as I have the time. I'd need some direction on what you'd like to see for documentation, testing, and code quality. I see there's some recent work on running system tests with Docker? But overall, I'm happy to help :) Thanks again! |
|
Since I haven't used tree-sitter, I have no idea how things are supposed to work and thus do not qualify much as maintainer. My ideal for BNFC (that I worked on a lot in 2020/21) is that all backends should have the same feature set. This isn't a must, though. They should however all interpret a BNFC grammar in the same way, which is already a challenge and also not realized 100%. For instance, ANTLR is LL and not LR, so the Java/Antlr backend generates a LL parser and not the usual LALR(1) that the other parser generator produce. In essence, I have aimed at a testsuite containing an example set derived from case studies and issues that all backends process with the same results. Some exceptions have remained. Concerning the future development of BNFC, I would not want a lot of backend-specific pragmas in the LBNF language. It should remain universal. This is roughly the framework of BNFC as I conceive it. When it comes to documentation, there is some technical debt concerning the backends. (The C backends are still undocumented, for instance.) @Commelina is doing great work currently, she added the testsuite to CI as you noted. |
|
Thanks for the info, it's helpful to know! Tree-sitter is LR(1) using a GLR parser so hopefully that lines up well. I'll have to look closer at the lexer behaviour. However, for the system tests, I noticed that they seem to expect a pretty-printer to display the parsed AST, and then the test suite checks its output matches what is expected. This might be hard for tree-sitter because it's a standalone grammar, not necessarily with a pretty-printer (similar to the Pygments backend in this way). We'll have to think about how to test this, maybe in a future PR. I would definitely like to write some docs! Also, I have an idea for implementing layout, but that would be further down the line. Anyway, this PR is ready to merge now I think, and I'll keep working on follow-ups as I have time. |
|
@katrinafyi I merge this, but now the bnfc-system-tests CI reports errors: https://github.com/BNFC/bnfc/actions/runs/23257141559/job/67615195934#step:7:6496
It would be great if you could find a solution for this.
A solution could be to disable this "golden value check" for the tree-sitter backend. |
|
Oh yeah I'll get onto that |
|
@katrinafyi Something may help: Currently the latest Haskell docker image is based on debian 12 "Bookworm" (libc6 2.36), while the recent binary release of It can be resolved by either building from source with |
|
Thanks! Also feel free to disable the tree sitter system tests for now to fix the CI.. sorry about that :/ |
Broad improvements to tree-sitter backend.
// .*as a single line comment, rather than// .*\n. This led to state space explosion in the generated tree-sitter parser.makecan be used to build the generated grammar. THen,make parsewill accept input from stdin and print the parse tree../tree-sitter-lang-name. This helps organise things, because the tree-sitter commands liketree-sittergenerate will create asrcfolder next to the grammar.js$.identifier, not choice or other function.You can also test this by using the Makefile in source/test/BNFC/Backend/TreeSitter. Just run
and this will try to generate tree-sitter for each .cf in that folder, and run it through tree-sitter generate. You can also see a copy of the output in the .expected.js files.
If you want to run it on a custom CF file, you can just use
You can also read the docs for this change at https://matches-empty.lychee-docs-katrinafyi.pages.dev/BNFC-Backend-TreeSitter-MatchesEmpty !