Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

generate small/fast table for changes_when_{uppercased,lowercased} #4

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

thomcc
Copy link
Contributor

@thomcc thomcc commented Oct 11, 2023

This a rough pr with finished design/impl but the code is just a mess1, and I wasn't really going to PR this, but here it is . The output combined all into one playground can be seen here: https://play.rust-lang.org/?version=stable&mode=debug&edition=2021&gist=3568421c224d239f574e1eec7a964381. It has a test which shows answers both queries correclty for every char.

The basic notes are:

  • both props are answered w/ a binary search against the same ~1kbish table. Code size impact is minimal.
  • the logic to search and interpret the table is small and fast, especially for characters that never change under lowercase/uppercase mapping.
  • There's a special case for ASCII inputs so they don't have to hit a table at all.
  • The table encoding/format/generator are totally custom for this purpose. Otherwise it'd be bigger.

I could probably get it faster than this but it's already overengineered enough.


The overall approach is to store a list of ranges in a table, and binary search that. The ranges might indicate:

  • a run chars that all have the same property values for those two properties (range of chars which only change when uppercased, range of chars which change for both, etc). for example U+00c0..=U+00d6 only change under lowercase (but not uppercase)
  • A run of chars which alternate between upper,lower,upper,lower,upper,lower (or similarly lower,upper,lower,upper,...). This is very common in Unicode, and special casing this is why the is only 200ish ranges (and not over 1000, most of them for a single character)

Runs with lengths that dont fit into 8 bits are split into multiple contiguous smaller ones, and then it's encoded as u32, as MSB[21 bit start_char | 3 bit range type | 8 bit length]LSB.

This seems likely to work indefinitely since 21 bits can fit any char, and we dont use all the values for the 3 bit range type. That said, it probably won't have to change.

The generator uses a greedyish algorithm to categorize every character into a range. It then filters out ASCII and "no changes" ranges, splits them up (with minimal cleanup) and encodes each range into a u32.

I started this a while ago, then forgot, then came back and banged it all out this weekend. It's a problem space near/dear to my hear tho -- ive spent an unreasonable amount of time on making unicode tables better.


Re: #3 (comment)

Yeah, as I said, as long as it doesn't blow up the size, it should be fine

Well, the size impact of this on generated binary is very small, and it runs fast too. But it's rather high complexity in the table generator. So the size impact in this repo is high. I don't mind owning/maintaining that, and I'm happy to get the code more cleaned up if you want, but I wouldn't be offended if you don't.

FWIW, it only relies on unicode stable promises (like the number of bits in a codepoint), and handles cases that could change (8 len bits not being enough). That means in theory future unicode updates should not come with any drama.

It's unfortunate that essentially every crate (std, regex, us) has to ship its own version of unicode data though.

My general feeling is that with enough elbow grease you can get any of the unicode tables small. It's a compression (and data access) problem, just not a very well-studied one for whatever reason. I had a scratch workspace that could produce all tables regex needed with ~40kb data. It would have required a lot of changes to actually use, so I put it in my own regex engine, which never saw the light of day. So it goes.

Still, it's not ideal that everything has their own tables, but I don't see an alternative really. I don't want load them from the system, and stuff like icu4x feels like the wrong choice for code which cares about footprint.

Footnotes

  1. Like, it's a huge mess aside from proving the approach (the code has tons of debugging stuff and duplication, etc), I'd clean it up a lot if you were interested.

Copy link
Contributor Author

@thomcc thomcc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

went through and pointed at a set of "yep thats completely clownshoes" things.

just to save you the time from actually pointing out the issues in the code.

@@ -48,17 +48,13 @@ jobs:
args: -- --check

clippy:
name: Clippy
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i dont even remember touching this stuff but I guess I did.

@@ -0,0 +1,36 @@
# Table generator for cow-utils-rs
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this documentation is probably wrong or incomplete. i'll clean it up later.

@@ -0,0 +1,61 @@
pub(super) fn changes_when_casemapped_nonascii<const MAP_LOWER: bool>(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there are two copies of this file (one in tabgen/src/search.rs) just to shut rust-analyzer up. its absolutely not needed.

@@ -0,0 +1,56 @@
mod search;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

all this stuff can be much simplified if i give up on trying to reuse the search.rs file in both the generator and generatee (which is worth doing). like, that's the reason it's 3 files here, it should just be 1.

"#[cfg(test)]\npub(super) const UNICODE_VERSION: (u8, u8, u8) = ({}, {}, {});\n",
genver.0, genver.1, genver.2,
);
let _ = std::fs::write("table.rs", &file);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is hacky too.

@@ -0,0 +1,495 @@
//! The basic idea is that we segment codepoints into one of a,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep this one is wrong too. also, i'm aware theres too much redundant stuff in this file, it can be simplified.

@RReverser
Copy link
Owner

and stuff like icu4x feels like the wrong choice for code which cares about footprint

Oh yeah so I actually did try to use icu-properties for this couple of days ago, but somehow it made code even slower than it is right now (with manual changes_when_uppercased implementation).

I'm going to try and review this soon-ish, but, as you said, it's a lot of code so might take some time.

Meanwhile, could you post benchmarks for before/after this change? I wonder if it actually speeds things up.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants