-
Notifications
You must be signed in to change notification settings - Fork 641
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Migrate to ANTLR v4 in Lucene.Net.Expressions, #977 #996
Conversation
…TLRv3 transitive deps
@NightOwl888 Rebased off latest master to resolve a conflict, ready for review. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR.
This seems to work as good as it did previously, so nice work. The only thing we are missing is the attribution in the LICENSE.txt
file, which we need to do each time a file or group of files that is under a different license is added.
…script.g4. Normalized dividers between different license attributions.
Migrate ANTLRv3 to v4, and automate lexer/parser generation.
Fixes #977
Description
The ANTLRv3 NuGet package we depend on has not been maintained, and targets .NET Standard 1.6 which is reporting some vulnerabilities. Additionally, our current codebase has what appears to be hand-ported lexer and parser code from the original Java, which in some cases has been updated manually as well. The original ANTLRv3 grammar is not in our repo. Ideally, we would generate the lexer and parser using ANTLR from the grammar. On top of this, we've had to change some assertions from the original Java due to discrepancies in the implementation.
This adds the 4.8 grammar to the repo, but updates its syntax to use the v4 format. Notably, ANTLRv4 removes the old v3 AST generation and instead lets you walk the parse tree however you see fit. This means that things like "root nodes" (indicated by
!
) in the old grammar are no longer necessary or supported, and thus require a different approach to walking the syntax tree. Additionally, empty lexer tokens likeAT_CALL
are no longer supported, so that use was changed into a newcall
rule instead, with correspondingLUCENENET-specific
callout. Error handling is done another way now, resulting in not needing to add inline C# to the grammar. Other than these changes, the grammar is highly similar to the upstream 4.8 v3 grammar. I decided to stick with the 4.8 grammar instead of updating to the most recent grammar which is already in v4 format to keep it as close as possible to the 4.8 code.The added Antlr4BuildTasks NuGet package (MIT-licensed) generates the lexer and parser code at compile time now, which reduces manual error and shrinks the size of the code we have to maintain. By removing the lexer and parser from the codebase, this PR results in a net reduction of about 3k lines of code. This also is configured to generate a Listener base class, which is the v4 approach that is closest to the existing 4.8 code (as opposed to a Visitor that returns nodes, as the latest Lucene code uses).
The JavascriptCompiler C# code now looks a bit different as a result of implementing the visitor, but this seemed like a cleaner and more maintainable approach than recursively walking every parse tree child. The
Context
classes that are generated for each rule are now strongly-typed, so you get the benefit of not having to compare rule/token indices. Another goal of this PR is to upgrade this dependency and change the approach in code without changing any existing tests or public API surface. As a result, the listener implementation is a private nested class instead of public, and this also helps keep the logic mostly in line with where it was before (just a little out of order). Note that we are not using the ParseTreeWalker because we want to control the order of entering nodes, so this might look a little different than other implementations of the Listener in ANTLR.This also fixes some assertions that were changed from the original Java code, because this implementation produces the same results as the Java Lucene 4.8 code. Notably this is due to casting operands to longs before performing shift and bitwise operations, whereas the previous code was erroneously converting all numbers to int/long if there were any shifts or bitwise operations in the source text. This worked fine for some unit test examples, but could quickly break down for anything more complex than that. All numbers are treated as System.Double except for when shifts or bitwise operations are involved, in which case they are casted to Int64/I8 which truncates any decimal place values.