-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
generate small/fast table for changes_when_{uppercased,lowercased} #4
base: master
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
went through and pointed at a set of "yep thats completely clownshoes" things.
just to save you the time from actually pointing out the issues in the code.
@@ -48,17 +48,13 @@ jobs: | |||
args: -- --check | |||
|
|||
clippy: | |||
name: Clippy |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i dont even remember touching this stuff but I guess I did.
@@ -0,0 +1,36 @@ | |||
# Table generator for cow-utils-rs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this documentation is probably wrong or incomplete. i'll clean it up later.
@@ -0,0 +1,61 @@ | |||
pub(super) fn changes_when_casemapped_nonascii<const MAP_LOWER: bool>( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there are two copies of this file (one in tabgen/src/search.rs) just to shut rust-analyzer up. its absolutely not needed.
@@ -0,0 +1,56 @@ | |||
mod search; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
all this stuff can be much simplified if i give up on trying to reuse the search.rs file in both the generator and generatee (which is worth doing). like, that's the reason it's 3 files here, it should just be 1.
"#[cfg(test)]\npub(super) const UNICODE_VERSION: (u8, u8, u8) = ({}, {}, {});\n", | ||
genver.0, genver.1, genver.2, | ||
); | ||
let _ = std::fs::write("table.rs", &file); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is hacky too.
@@ -0,0 +1,495 @@ | |||
//! The basic idea is that we segment codepoints into one of a, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yep this one is wrong too. also, i'm aware theres too much redundant stuff in this file, it can be simplified.
Oh yeah so I actually did try to use icu-properties for this couple of days ago, but somehow it made code even slower than it is right now (with manual I'm going to try and review this soon-ish, but, as you said, it's a lot of code so might take some time. Meanwhile, could you post benchmarks for before/after this change? I wonder if it actually speeds things up. |
This a rough pr with finished design/impl but the code is just a mess1, and I wasn't really going to PR this, but here it is . The output combined all into one playground can be seen here: https://play.rust-lang.org/?version=stable&mode=debug&edition=2021&gist=3568421c224d239f574e1eec7a964381. It has a test which shows answers both queries correclty for every char.
The basic notes are:
I could probably get it faster than this but it's already overengineered enough.
The overall approach is to store a list of ranges in a table, and binary search that. The ranges might indicate:
U+00c0..=U+00d6
only change under lowercase (but not uppercase)upper,lower,upper,lower,upper,lower
(or similarlylower,upper,lower,upper,...
). This is very common in Unicode, and special casing this is why the is only 200ish ranges (and not over 1000, most of them for a single character)Runs with lengths that dont fit into 8 bits are split into multiple contiguous smaller ones, and then it's encoded as u32, as
MSB[21 bit start_char | 3 bit range type | 8 bit length]LSB
.This seems likely to work indefinitely since 21 bits can fit any char, and we dont use all the values for the 3 bit range type. That said, it probably won't have to change.
The generator uses a greedyish algorithm to categorize every character into a range. It then filters out ASCII and "no changes" ranges, splits them up (with minimal cleanup) and encodes each range into a u32.
I started this a while ago, then forgot, then came back and banged it all out this weekend. It's a problem space near/dear to my hear tho -- ive spent an unreasonable amount of time on making unicode tables better.
Re: #3 (comment)
Well, the size impact of this on generated binary is very small, and it runs fast too. But it's rather high complexity in the table generator. So the size impact in this repo is high. I don't mind owning/maintaining that, and I'm happy to get the code more cleaned up if you want, but I wouldn't be offended if you don't.
FWIW, it only relies on unicode stable promises (like the number of bits in a codepoint), and handles cases that could change (8 len bits not being enough). That means in theory future unicode updates should not come with any drama.
My general feeling is that with enough elbow grease you can get any of the unicode tables small. It's a compression (and data access) problem, just not a very well-studied one for whatever reason. I had a scratch workspace that could produce all tables regex needed with ~40kb data. It would have required a lot of changes to actually use, so I put it in my own regex engine, which never saw the light of day. So it goes.
Still, it's not ideal that everything has their own tables, but I don't see an alternative really. I don't want load them from the system, and stuff like icu4x feels like the wrong choice for code which cares about footprint.
Footnotes
Like, it's a huge mess aside from proving the approach (the code has tons of debugging stuff and duplication, etc), I'd clean it up a lot if you were interested. ↩