-
Notifications
You must be signed in to change notification settings - Fork 78
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Link based parse memoization #2100
base: main
Are you sure you want to change the base?
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #2100 +/- ##
=======================================
Coverage 49% 49%
- Complexity 6318 6334 +16
=======================================
Files 666 665 -1
Lines 59695 59672 -23
Branches 8670 8663 -7
=======================================
+ Hits 29601 29605 +4
+ Misses 27860 27834 -26
+ Partials 2234 2233 -1 ☔ View full report in Codecov by Sentry. |
Perhaps @arnoldlankamp would also have a look? Arnold, this fix was very well regression tested; but it would be great if you can interpret these changes and tell us what you think. The issues that were solved were strictly related to cycles on the parse graph (i.e. reductions without accepting any characters, arriving at an earlier node again). |
@jurgenvinju Sure, I'll have a look when I have some time. |
Looking back at the original implementation, the 'no memoization within cycles' code indeed does not seem to work correctly in this case. I'll try to describe the problem I encountered while writing the original implementation: Cycles are tracked by a stack that is maintained during the traversal of the parse forest, since nodes/links can be encountered more than once (due to ambiguities), via different paths through the parse forest. You can run into a situation where a node or link is both part of a cycle via one path and not part of a cycle via another one. In which case caching the sub-tree will lead to incorrect results, since they will be different for different paths. Caching sub-trees containing cycles may not always be safe, since sub-trees containing cycle nodes essentially contain 'references' to nodes encountered outside of itself and are thus not nicely self contained. Conversely, caching sub-trees which do not contain cycles, but for which a variant containing a cycle may exist, may also not be safe either. Both should be done conditionally. The easiest way to fix all these issues is to remove sub-tree caching entirely, but this will likely incur a performance penalty for parse forests containing a lot of ambiguities. The actual performance impact is however something you could test. Either way, this issue will likely not be easy to resolve correctly in any other way 🤔 . If a better solution comes to me I'll let you know. P.S.: If you want to try your hand at constructing an alternative solution it would be nice to have a test case for it. You'll need a grammar which produces a parse forest that bifurcates and merges and in which only one of the to alternatives of this bifurcation produces a cycle with a node later on in the parse forest. And you'll need one for both orderings, cycle first and no cycle first (so you may need do some fiddling with the grammar to get the parse forest to come the right way out of the parser to get both variants). Also it needs to break for both the original and the proposed implementation for at least one of the orderings of the tree, to be a valid starting point, which may take some fiddling too. I wish I could help more, but I currently can't dedicate to much time on this unfortunately. |
Thank you for having a look into this!
Unfortunately, disabling caching in the flattener is not an option in the
context of error-recovery. Parse forest generated by error recovery tend to
be highly ambiguous and full of cycles. So ambiguous in fact that any
algorithm that traverses ambiguous trees (like the flattener) must use some
form of memoization or it might take forever.
We have stress-tested the flattener memoization options by using
error-recovery on all Rascal source files in the Rascal repo, running close
to a million different recovery tests with different modifications of the
original Rascal source files. We ran each of these tests three times:
- Without memoization (cut off if the flattener took longer than 2 seconds
which was about 3% of the time)
- With the old "node memoization"
- With the new "link memoization"
Each time comparing the results of the two memoization options with the
result of the "no memoization" test if that yielded a result within 2
seconds. Note that in those cases the "node memoization" solution also
often did not deliver a result within 2 seconds, which is understable given
that memoization is disabled within cycles in this case.
With the old "node memoization" in place, over 14000 (out of a million) of
these tests failed because the memoized result differed from the
no-memoization result.
With the "link memoization" enabled, none of the tests failed, the forests
with and without link memoization where always identical.
This of course leaves us with the 3% cases where the test without
memoization took too much time, and with the possibility that the tests
just did not hit the issue you described. It might well be that the Rascal
syntax does not have the required constructs to ever produce a counter
example. It might even be the case that parse graphs produced by error
recovery never exhibit the issue for some reason.
I will have a go at hand-crafting a counter example for the proposed
solution, maybe in combination with generating different inputs.
In the meantime it seems that the link memoization option is at least
"better" (less erroneous trees are produced) than the node memoization
solution, while in all cases delivering decent performance. This is
especially true for error forests.
It is very unsatisfactory that we have a known issue in the flattener
without a clear path to a solution.
…On Sat, Dec 28, 2024 at 2:36 AM arnoldlankamp ***@***.***> wrote:
Looking back at the original implementation, the 'no memoization within
cycles' code indeed does not seem to work correctly in this case.
However, while the proposed solution works for the included test case,
unfortunately it will not do so in general.
I'll try to describe the problem I encountered while writing the original
implementation:
Cycles are tracked by a stack that is maintained during the traversal of
the parse forest, since nodes/links can be encountered more than once (due
to ambiguities), via different paths through the parse forest. You can run
into a situation where a node or link is both part of a cycle via one path
and not part of a cycle via another one. In which case caching the sub-tree
will lead to incorrect results, since they will be different for different
paths. Caching sub-trees containing cycles may not always be safe, since
sub-trees containing cycle nodes essentially contain 'references' to nodes
encountered outside of itself and are thus not nicely self contained.
Conversely, caching sub-trees which do not contain cycles, but for which a
variant containing a cycle may exist, may also not be safe either. Both
should be done conditionally.
For this reason, the original implementation excluded sub-trees containing
cycles from being cached. But, as you noticed, if the non-cycle containing
version of the sub-tree is encountered first, it *would* be cached and
reused for the path which *should* contain a cycle; leading to incorrect
results (at least that's what I assume is what you were encountering, by
just looking at the code, since I didn't have time to get Rascal back up
and running again to check).
The proposed solution seems to now use links as keys instead of nodes for
the sub-tree cache, which are (at the very least) unique at the tail end of
results for all alternatives at a specific offset + length combination
(this resolves the issue for the included test case, since it reduces the
amount of caching opportunities and also happens to excludes the one which
was originally displaying invalid behavior). However it still *might*
cause problems in combination with prefix shared alternatives (e.g.: ABC
| ABD like rules, where part of the links are shared between different
alternatives) and *will* often still lead to invalid results for parse
forests which contain the aforementioned issue of cycles in sub-trees being
conditional on the traversal path through the forest.
The easiest way to fix all these issues is to remove sub-tree caching
entirely, but this will likely incur a performance penalty for parse
forests containing a lot of ambiguities. The actual performance impact is
however something you could test.
Not having to check the cache at ever step, reducing code complexity and
reducing memory usage & GC pressure (by not having to maintain the sub-tree
cache), will improve performance in the general case (i.e.: parse forests
with limited to no ambiguity) and may even offset the potential performance
impact for flattening highly ambiguous parse forests enough for it not to
be too much of an issue, if at all.
Either way, this issue will likely not be easy to resolve correctly in any
other way 🤔 .
If a better solution comes to me I'll let you know.
P.S.: If you want to try your hand at constructing an alternative solution
it would be nice to have a test case for it. You'll need a grammar which
produces a parse forest that bifurcates and merges and in which only one of
the to alternatives of this bifurcation produces a cycle with a node later
on in the parse forest. And you'll need one for both orderings, cycle first
and no cycle first (so you may need do some fiddling with the grammar to
get the parse forest to come the right way out of the parser to get both
variants). Also it needs to break for both the original and the proposed
implementation for at least one of the orderings of the tree, to be a valid
starting point, which may take some fiddling too.
Either that or you need write a unit test with a manually constructed
parse forest, which exhibits the problem (which is a headache as well 😞 ).
I wish I could help more, but I currently can't dedicate to much time on
this unfortunately.
—
Reply to this email directly, view it on GitHub
<#2100 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AYU6HZSVJOO47TOYHRTCR7T2HX6DFAVCNFSM6AAAAABTTNQALOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDKNRUGEZTCNRRHA>
.
You are receiving this because you authored the thread.Message ID:
***@***.***>
|
Ok, so disabling memoization entirely is not an option. After sleeping on this. We could also selectively disable memoization, since we know the conditions under which cycles can occur. Which are:
This means we can always safely cache:
I think modifying the proposed solution to take rule one and two into account shouldn't be too difficult. Rule one is a fairly simple check and rule two is just a flag you can pass down while traversing a link's prefixes to indicate whether rule one held true for the tail at some point. Implementing these specific caching rules should resolve the issue I described, while still providing memoization for most nodes. I would also suggest to keep using the links as keys for the cache instead of nodes, like you do now. This is easier to get right, since nodes can have interactions with nesting restrictions (this feature is used by the parser to model priorities and associativity, for example, so nodes in the parse forest with the same label and offset + length do not always contain equivalent content), by using links instead of nodes for caching you avoid having to deal with this problem. I hope this helps. |
Having said that. Using the original implementation as starting point (with the cycle mark stuff removed) will provide a more optimal solution, as it will provide more sharing opportunities, compared using links as keys. |
This sounds like a good approach. Before my "link caching" approach I was looking for a way to disable caching in cycles but keep caching everything branching of the "cycle stems". It seems you are suggesting an approach that does just that. However your description is not completely clear to me, especially the first condition: You wrote: "Any non-zero length node" You wrote: "that is part of a link with a non-zero length prefix." But does that not mean we will only cache nodes that have two non-nullable nodes in succession in their prefix chain? |
That is exactly the idea.
Correct.
I was referring to the entire A whiteboard be very handy to explain this in a step by step way, but to answer both questions clearly, I'll attempt to give a short description of how the parse forest is structured (1) and how we can recognize situations that could potentially contain cycles (2). 1. Parse forest To give an example:
The binarized parse forest (with terminals omitted for readability's sake) would look like:
I.e.: 2. Cycles So while traversing the parse forest, as soon as we enter an |
The alternative would be that I have a look at the cycle mark thingy; figure out why it goes wrong and fix that. The code is rather hard to follow, but I still remember how it is supposed to work, even though must be about 14 years since I wrote it 😅 . Do you know what the original issue was? I assume the included test case reproduces the problem? Unfortunately I have very little time I can devote to this at the moment beyond answering some questions, otherwise I'd already have tried to fix the issue 😞 . Bugs always keep bugging me. |
Since it seems we could use some additional tests related to cycles and their edge cases, it might be good to add one which resembles the following grammar. It produces two cycles to the same root node at different depths. If caching happens at the wrong level we'd end up with an incorrect result.
For input
|
Very happy to see this collaboration. Thanks both! Have you considered caching only the nodes that are ambiguous? In the end only the ambiguous nodes can cause super lineair flattening times due to the Cartesian product effect of nested ambiguity, and the non ambiguous nodes can not cause sharing anyway, without an ambiguous parent, due to the position information on every node. All cycles have an ambiguous parent (i.o.w. contain an ambiguous link), so this strategy also covers the cycles. It's not an optimal solution in terms of the size of the resulting forest, but it might be optimal in terms of the time it takes to construct it. Groetjes! |
It turns out pure link-based memoization indeed does not work. I will abandon this PR and my plan is to replace it with a better one based on Arnold's ideas. The original node memoization code failed over 14.000 times in the same test set so I guess there is some progress there. @jurgenvinju I have briefly tried to only cache ambiguous nodes but that also resulted in differences between memoized and non-memoized trees. I have not investiged this further. |
During conversion of the parse graph to a parse forest,
AbstractNode
was used as a memoization key. This turned out to be incorrect even with the "no memoization within cycles" hack.The
memoCycleBug
test in this PR tests for this problem.The solution turned out to be to use the
Link
objects as memoization key instead. This ensures correct conversion from parse graph to parse forest and as a bonus improved performance when a lot of cycles are present as memoization can occur normally within cycles.Note that this means all of the
CycleMark
related code has also been removed as it was only used to disable memoization within cycles.