Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement compaction #32

Open
djspiewak opened this issue Feb 17, 2017 · 4 comments
Open

Implement compaction #32

djspiewak opened this issue Feb 17, 2017 · 4 comments

Comments

@djspiewak
Copy link
Owner

djspiewak commented Feb 17, 2017

The original Might, Darias and Spiewak paper reported a 90x speed-up when implementing compaction. Adams, Hollenbeck and Might validated this, as well as providing some improvements to the grammar traversals introduced by repeated compaction.

@djspiewak
Copy link
Owner Author

djspiewak commented Feb 20, 2017

Compaction for everything aside from Union is now implemented in master. The benefits were measurable, but not massive. I saw a roughly 2x improvement.

Two things remain to try:

  • Conservative Union compaction. This is more difficult to implement with our encoding, but I think it may be possible.
  • Filter bubbling. This isn't relevant to the benchmarks right now, but it will be in practice. Filter nodes should be "bubbled up" similar to Apply nodes.

@djspiewak
Copy link
Owner Author

I was able to find a way to implement Union compaction by way of deferred delegation. This is essentially the lazy version of what is done in the Adams Racket implementation (which uses tag mutation instead). I also attempted to implement even more conservative Union compaction: only compacting unions generated by Sequence derivation (since these are guaranteed not to be recursive).

In both cases, I found a serious problem with the whole concept of Union compaction: errors. Aggressive Union compaction makes this very clear with grammars like the following:

lazy val p: Parser[Any] = p ~ "a" | "a"

Apply p to an input string of "b". The error message should be "expected 'a', got 'b'", or something of that sort. Unfortunately, compaction makes this impossible. The right side of the alternate will reduce to a Failure, which will in turn result in a compaction which leaves only the derivative of p ~ "a" w.r.t. b, which in turn will be a compaction of a unionized failure, and thus is simply p ~ "a". The problem is that this is a degenerate form, and the only error it can produce is UnboundedRecursion. The expectation of the next character is completely lost; the information no longer exists in the degenerate form, nor can it be recovered!

More conservative compaction (of just Sequence derivatives) has the same problem, because the form above (p ~ "a" | "a") is in fact derivable from other valid forms. Thus, we cannot safely compact Union in general without throwing away significant error information, which is a non-starter. There may be some very, very specific circumstances wherein it is safe, but I doubt those circumstances form a particularly significant performance optimization. Similarly, it may be possible to carry along "compacted error" information. So rather than compacting to just the right/left of a nulled union, you would compact to the "hypothesized error" of the right/left, where the hypothetical error is merged with any produced errors in the event of failure, or ignored if success. This is unlikely to be substantially more efficient than just Union of a Failure, though it's probably slightly better.

So basically, Union compaction is out. Filter bubbling is still on the table, but unlikely to make a significant impact.

@weihsiu
Copy link

weihsiu commented Jul 18, 2017

does this mean that the goal of coming within the ballpark of parboiled performance is not attainable anymore?

@djspiewak
Copy link
Owner Author

@weihsiu Sorry for the delay in responding…  It's unclear actually what the attainable performance is. Efforts here have stalled out on two extremely stupid factors: obtaining a large (millions of lines) set of C sources to test a realistic grammar, and simply my own lack of spare time over the past few months. It's entirely possible that my current benchmarking (which is based on a simple expr grammar) is hitting an inherent pathology in the algorithm. It's certainly quite similar to a known pathological case, so this seems like a real possibility and not just wishful thinking. The goal would be to benchmark with a real grammar (in this case, ANSII C) to see how things actually work in practice, but that has been held up on fiddling with the C preprocessor in order to generate an appropriate corpus.

In the worst case, the performance will remain bad, and we won't be able to replicate the results from the Adams paper due to the constraints of our problem space (thread safety, producing errors, and parsing rather than recognition). If that is indeed the case, then it calls into question the practical applicability of the algorithm, though more implementation work will be needed to really make that assertion.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants