Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Link based parse memoization #2100

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

PieterOlivier
Copy link
Contributor

During conversion of the parse graph to a parse forest, AbstractNode was used as a memoization key. This turned out to be incorrect even with the "no memoization within cycles" hack.

The memoCycleBug test in this PR tests for this problem.

The solution turned out to be to use the Link objects as memoization key instead. This ensures correct conversion from parse graph to parse forest and as a bonus improved performance when a lot of cycles are present as memoization can occur normally within cycles.

Note that this means all of the CycleMark related code has also been removed as it was only used to disable memoization within cycles.

Copy link

codecov bot commented Dec 14, 2024

Codecov Report

Attention: Patch coverage is 67.07317% with 27 lines in your changes missing coverage. Please review.

Project coverage is 49%. Comparing base (d83ff92) to head (539cace).
Report is 4 commits behind head on main.

Files with missing lines Patch % Lines
...ser/gtd/result/out/ListContainerNodeFlattener.java 68% 11 Missing and 3 partials ⚠️
...ser/gtd/result/out/SortContainerNodeFlattener.java 64% 10 Missing and 2 partials ⚠️
...pl/parser/gtd/result/out/DefaultNodeFlattener.java 75% 1 Missing ⚠️
Additional details and impacted files
@@           Coverage Diff           @@
##              main   #2100   +/-   ##
=======================================
  Coverage       49%     49%           
- Complexity    6318    6334   +16     
=======================================
  Files          666     665    -1     
  Lines        59695   59672   -23     
  Branches      8670    8663    -7     
=======================================
+ Hits         29601   29605    +4     
+ Misses       27860   27834   -26     
+ Partials      2234    2233    -1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@jurgenvinju
Copy link
Member

Perhaps @arnoldlankamp would also have a look?

Arnold, this fix was very well regression tested; but it would be great if you can interpret these changes and tell us what you think. The issues that were solved were strictly related to cycles on the parse graph (i.e. reductions without accepting any characters, arriving at an earlier node again).

@arnoldlankamp
Copy link
Contributor

@jurgenvinju Sure, I'll have a look when I have some time.

@arnoldlankamp
Copy link
Contributor

Looking back at the original implementation, the 'no memoization within cycles' code indeed does not seem to work correctly in this case.
However, while the proposed solution works for the included test case, unfortunately it will not do so in general.

I'll try to describe the problem I encountered while writing the original implementation:

Cycles are tracked by a stack that is maintained during the traversal of the parse forest, since nodes/links can be encountered more than once (due to ambiguities), via different paths through the parse forest. You can run into a situation where a node or link is both part of a cycle via one path and not part of a cycle via another one. In which case caching the sub-tree will lead to incorrect results, since they will be different for different paths. Caching sub-trees containing cycles may not always be safe, since sub-trees containing cycle nodes essentially contain 'references' to nodes encountered outside of itself and are thus not nicely self contained. Conversely, caching sub-trees which do not contain cycles, but for which a variant containing a cycle may exist, may also not be safe either. Both should be done conditionally.
For this reason, the original implementation excluded sub-trees containing cycles from being cached. But, as you noticed, if the non-cycle containing version of the sub-tree is encountered first, it would be cached and reused for the path which should contain a cycle; leading to incorrect results (at least that's what I assume is what you were encountering, by just looking at the code, since I didn't have time to get Rascal back up and running again to check).
The proposed solution seems to now use links as keys instead of nodes for the sub-tree cache, which are (at the very least) unique at the tail end of results for all alternatives at a specific offset + length combination (this resolves the issue for the included test case, since it reduces the amount of caching opportunities and also happens to excludes the one which was originally displaying invalid behavior). However it still might cause problems in combination with prefix shared alternatives (e.g.: ABC | ABD like rules, where part of the links are shared between different alternatives) and will often still lead to invalid results for parse forests which contain the aforementioned issue of cycles in sub-trees being conditional on the traversal path through the forest.

The easiest way to fix all these issues is to remove sub-tree caching entirely, but this will likely incur a performance penalty for parse forests containing a lot of ambiguities. The actual performance impact is however something you could test.
Not having to check the cache at ever step, reducing code complexity and reducing memory usage & GC pressure (by not having to maintain the sub-tree cache), will improve performance in the general case (i.e.: parse forests with limited to no ambiguity) and may even offset the potential performance impact for flattening highly ambiguous parse forests enough for it not to be too much of an issue, if at all.

Either way, this issue will likely not be easy to resolve correctly in any other way 🤔 .

If a better solution comes to me I'll let you know.

P.S.: If you want to try your hand at constructing an alternative solution it would be nice to have a test case for it. You'll need a grammar which produces a parse forest that bifurcates and merges and in which only one of the to alternatives of this bifurcation produces a cycle with a node later on in the parse forest. And you'll need one for both orderings, cycle first and no cycle first (so you may need do some fiddling with the grammar to get the parse forest to come the right way out of the parser to get both variants). Also it needs to break for both the original and the proposed implementation for at least one of the orderings of the tree, to be a valid starting point, which may take some fiddling too.
Either that or you need write a unit test with a manually constructed parse forest, which exhibits the problem (which is a headache as well 😞 ).

I wish I could help more, but I currently can't dedicate to much time on this unfortunately.

@PieterOlivier
Copy link
Contributor Author

PieterOlivier commented Dec 28, 2024 via email

@arnoldlankamp
Copy link
Contributor

Ok, so disabling memoization entirely is not an option.

After sleeping on this. We could also selectively disable memoization, since we know the conditions under which cycles can occur. Which are:

  • Results containing only nullables.
  • Results containing only one non-nullable sort (optionally surrounded by nullables).

This means we can always safely cache:

  1. Any non-zero length node that is part of a link with a non-zero length prefix.
  2. Any node, which is a part of a prefix for which the above above holds true.
  3. Any nullable node that is part of a link for which all prefixes adhere to rule one at some point.

I think modifying the proposed solution to take rule one and two into account shouldn't be too difficult. Rule one is a fairly simple check and rule two is just a flag you can pass down while traversing a link's prefixes to indicate whether rule one held true for the tail at some point.
Enabling caching for rule three is slightly more involved, since it will require keeping track of state while traversing all prefixes. I would only attempt adding this if enabling caching for nodes that adhere to rule one and two proves to be inadequate.

Implementing these specific caching rules should resolve the issue I described, while still providing memoization for most nodes.

I would also suggest to keep using the links as keys for the cache instead of nodes, like you do now. This is easier to get right, since nodes can have interactions with nesting restrictions (this feature is used by the parser to model priorities and associativity, for example, so nodes in the parse forest with the same label and offset + length do not always contain equivalent content), by using links instead of nodes for caching you avoid having to deal with this problem.

I hope this helps.

@arnoldlankamp
Copy link
Contributor

I would also suggest to keep using the links as keys for the cache instead of nodes, like you do now. This is easier to get right, since nodes can have interactions with nesting restrictions (this feature is used by the parser to model priorities and associativity, for example, so nodes in the parse forest with the same label and offset + length do not always contain equivalent content), by using links instead of nodes for caching you avoid having to deal with this problem.

Having said that. Using the original implementation as starting point (with the cycle mark stuff removed) will provide a more optimal solution, as it will provide more sharing opportunities, compared using links as keys.
The original implementation should also share identical nodes between different productions and alternatives with the same content (e.g.: The A in X::=AB and Y::=AC and some of the Es in rules like E::=E*E | E/E > E+E | E-E ...) and identical nodes with different (length) prefixes (e.g.: ambiguous results).

@PieterOlivier
Copy link
Contributor Author

This sounds like a good approach. Before my "link caching" approach I was looking for a way to disable caching in cycles but keep caching everything branching of the "cycle stems". It seems you are suggesting an approach that does just that.

However your description is not completely clear to me, especially the first condition:

You wrote: "Any non-zero length node"
A non-zero length node just means calling isEmpty() on the node returns false right?

You wrote: "that is part of a link with a non-zero length prefix."
This part is unclear to me. Do you mean the link has only one prefix link and the node in that prefix link is a non-zero length node?

But does that not mean we will only cache nodes that have two non-nullable nodes in succession in their prefix chain?

@arnoldlankamp
Copy link
Contributor

arnoldlankamp commented Jan 3, 2025

This sounds like a good approach. Before my "link caching" approach I was looking for a way to disable caching in cycles but keep caching everything branching of the "cycle stems". It seems you are suggesting an approach that does just that.

That is exactly the idea.
This was also what the "cycle mark" thingy was supposed to achieve (preventing the caching of sub-trees containing cycles nodes up till their respective roots, so the cycle nodes don't reference a node outside of a cached sub tree). But apparently this solution didn't always work correctly.

However your description is not completely clear to me, especially the first condition:

You wrote: "Any non-zero length node" A non-zero length node just means calling isEmpty() on the node returns false right?

Correct.

You wrote: "that is part of a link with a non-zero length prefix." This part is unclear to me. Do you mean the link has only one prefix link and the node in that prefix link is a non-zero length node?

But does that not mean we will only cache nodes that have two non-nullable nodes in succession in their prefix chain?

I was referring to the entire prefix. All prefixes associated with a node always start at the same offset. Any node at the head of a prefix chain will give you the start location for all alternatives of the node one level higher in the forest.

A whiteboard be very handy to explain this in a step by step way, but to answer both questions clearly, I'll attempt to give a short description of how the parse forest is structured (1) and how we can recognize situations that could potentially contain cycles (2).
You can skip down to the second part in case any of this information is superfluous.

1. Parse forest
Basically the parse forest is a tree containing multi-headed, head-shared linked list. It's structured this way, since the parser produces binarized results (long story, lets leave it at that for now 😉 ).
Every node in the forest relates to a location in the input stream and can have one or more prefixes. Every prefix associated with a specific node has the same start location. Together these represent the collection of all matched alternatives (usually only one, unless the result is ambiguous). The links are the chains that make up the list of alternatives and each contains a node and a list of prefixes; each prefix is a link in the chain.
Nodes that represent non-terminals contain a list of alternatives for the next level down. Each of these alternatives starts with a link containing the tail node of the alternative.
Also note, that since each link can have more than one prefix, determining if a result is ambiguous can only be established by traversing all the links of all the prefixes.

To give an example: S ::= AAB; A ::= a | aa; B ::= b for input aaab, would give the following (flattened) tree:
*The numbers indicate the matched input string location.

amb([
  S0-4(A0-1('a'), A1-3('aa'), B3-4('b')),
  S0-4(A0-2('aa'), A2-3('a'), B3-4('b'))
)]

The binarized parse forest (with terminals omitted for readability's sake) would look like:

                     S0-4
A0-1 <- A1-3 <-       /
                \    /
A0-2 <- A2-3 <---  B3-4

I.e.: S has one child (B), B has two prefixes to As that match different locations in the input stream which both have their own unique A prefix, since they reference a different input location and consequently will have different children.

2. Cycles
Now as for recognizing potential cycle candidates. There can only be cycles between nodes that represent the same input string location. E.g. : S ::= T; T ::= S | a (full one-to-one match) or similarly when rules contain nullables: S ::= UTU; T ::= S | a; U ::= ε, in which case S and T would also match the same input.

So while traversing the parse forest, as soon as we enter an alternative which represents a smaller part of the input stream than the parent alternative, we know no cycles can occur on any direct line between this sub-tree and the root of the parse forest.
If this holds true for all related alternatives, we know we can safely cache the parent node.
So basically if all alternatives for a node contain at least two non-nullable nodes we're good.

@arnoldlankamp
Copy link
Contributor

arnoldlankamp commented Jan 3, 2025

The alternative would be that I have a look at the cycle mark thingy; figure out why it goes wrong and fix that. The code is rather hard to follow, but I still remember how it is supposed to work, even though must be about 14 years since I wrote it 😅 .

Do you know what the original issue was?

I assume the included test case reproduces the problem?

Unfortunately I have very little time I can devote to this at the moment beyond answering some questions, otherwise I'd already have tried to fix the issue 😞 . Bugs always keep bugging me.

@arnoldlankamp
Copy link
Contributor

arnoldlankamp commented Jan 8, 2025

Since it seems we could use some additional tests related to cycles and their edge cases, it might be good to add one which resembles the following grammar. It produces two cycles to the same root node at different depths. If caching happens at the wrong level we'd end up with an incorrect result.

S ::= A | B | a
A ::= B
B ::= S

For input "a", the expected result should look something like:

amb([
  S(A(B(cycle(S, 3)))),
  S(B(cycle(S, 2))),
  S("a")
])

@jurgenvinju
Copy link
Member

Very happy to see this collaboration. Thanks both! Have you considered caching only the nodes that are ambiguous?

In the end only the ambiguous nodes can cause super lineair flattening times due to the Cartesian product effect of nested ambiguity, and the non ambiguous nodes can not cause sharing anyway, without an ambiguous parent, due to the position information on every node. All cycles have an ambiguous parent (i.o.w. contain an ambiguous link), so this strategy also covers the cycles. It's not an optimal solution in terms of the size of the resulting forest, but it might be optimal in terms of the time it takes to construct it.

Groetjes!

@PieterOlivier
Copy link
Contributor Author

It turns out pure link-based memoization indeed does not work. I will abandon this PR and my plan is to replace it with a better one based on Arnold's ideas.
The current status is that I have implemented node memoization based on all three conditions and initially this seemed to work great, including some counter-examples of earlier approaches. The example above yields exactly the forest it should.
Unfortunately, during extensive error recovery tests, 5 out of almost a million tests failed, all in a single input file with errors introduced around the same location. "fails" in this context means that the forest with memoization is not equal to the forest constructed without memoization. I am currently investigating this issue.

The original node memoization code failed over 14.000 times in the same test set so I guess there is some progress there.

@jurgenvinju I have briefly tried to only cache ambiguous nodes but that also resulted in differences between memoized and non-memoized trees. I have not investiged this further.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants