-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Idea to encode grammar #1
Comments
Hi, thanks. I already do that, even if I use Elias delta encoding instead of interpolative coding. Some more details are in section 3.3.here https://arxiv.org/pdf/1704.08558.pdf (I don't remember the details, a few years have passed). |
If I understood that correctly your algorithm selects the best candidate (only) when there are multiple pairs with the same frequency. Will that produce the best ordering for the final encoding of the whole grammar array? (that wasn't clear to me to be honest, I don't have a black belt in computer science 😄 ) And I proposed to make the remapping of the symbol values (in the grammar and in the output text) at the very end of the re-pair. You have complete freedom how to remap/reorder them to improve the compression rate of the grammar rules array. Could it make sense? |
Ah, right: we sort only when the frequency is the same. Ok, I got your idea. I think there are some issues, but the underlying optimization problem looks interesting. One issue that I see is that if you re-map symbols so that first symbols are increasing, then also the left-hand sides of the rules get remapped, so you could lose that order. For example: 1 -> 3 4 to apply your idea, we need to re-map 3->1 and 1->3 (or any other map such that map(3) < map(1)) but that would give us 3 -> 1 4 so now first symbols are sorted, but left-hand sides are no longer sorted. if you wish, we can continue this discussion via email (you can find it in my page https://www.unive.it/data/persone/24170321) |
You can remap the grammar symbols making the array of pairs sorted.
Imagine you have the rules:
256->[17][23]
257->[10][256]
258->[15][257]
After remapping the grammar symbol values you will have the table sorted by the ([firstSymbol],[secondSymbol]) pairs:
256->[10][258]
257->[15][256]
258->[17][23]
The pairs can be treated as long numbers and the sorted table could be compressed using Interpolative Coding
The text was updated successfully, but these errors were encountered: