generate small/fast table for changes_when_{uppercased,lowercased} #4

thomcc · 2023-10-11T01:27:59Z

This a rough pr with finished design/impl but the code is just a mess¹, and I wasn't really going to PR this, but here it is . The output combined all into one playground can be seen here: https://play.rust-lang.org/?version=stable&mode=debug&edition=2021&gist=3568421c224d239f574e1eec7a964381. It has a test which shows answers both queries correclty for every char.

The basic notes are:

both props are answered w/ a binary search against the same ~1kbish table. Code size impact is minimal.
the logic to search and interpret the table is small and fast, especially for characters that never change under lowercase/uppercase mapping.
There's a special case for ASCII inputs so they don't have to hit a table at all.
The table encoding/format/generator are totally custom for this purpose. Otherwise it'd be bigger.

I could probably get it faster than this but it's already overengineered enough.

The overall approach is to store a list of ranges in a table, and binary search that. The ranges might indicate:

a run chars that all have the same property values for those two properties (range of chars which only change when uppercased, range of chars which change for both, etc). for example U+00c0..=U+00d6 only change under lowercase (but not uppercase)
A run of chars which alternate between upper,lower,upper,lower,upper,lower (or similarly lower,upper,lower,upper,...). This is very common in Unicode, and special casing this is why the is only 200ish ranges (and not over 1000, most of them for a single character)

Runs with lengths that dont fit into 8 bits are split into multiple contiguous smaller ones, and then it's encoded as u32, as MSB[21 bit start_char | 3 bit range type | 8 bit length]LSB.

This seems likely to work indefinitely since 21 bits can fit any char, and we dont use all the values for the 3 bit range type. That said, it probably won't have to change.

The generator uses a greedyish algorithm to categorize every character into a range. It then filters out ASCII and "no changes" ranges, splits them up (with minimal cleanup) and encodes each range into a u32.

I started this a while ago, then forgot, then came back and banged it all out this weekend. It's a problem space near/dear to my hear tho -- ive spent an unreasonable amount of time on making unicode tables better.

Re: #3 (comment)

Yeah, as I said, as long as it doesn't blow up the size, it should be fine

Well, the size impact of this on generated binary is very small, and it runs fast too. But it's rather high complexity in the table generator. So the size impact in this repo is high. I don't mind owning/maintaining that, and I'm happy to get the code more cleaned up if you want, but I wouldn't be offended if you don't.

FWIW, it only relies on unicode stable promises (like the number of bits in a codepoint), and handles cases that could change (8 len bits not being enough). That means in theory future unicode updates should not come with any drama.

It's unfortunate that essentially every crate (std, regex, us) has to ship its own version of unicode data though.

My general feeling is that with enough elbow grease you can get any of the unicode tables small. It's a compression (and data access) problem, just not a very well-studied one for whatever reason. I had a scratch workspace that could produce all tables regex needed with ~40kb data. It would have required a lot of changes to actually use, so I put it in my own regex engine, which never saw the light of day. So it goes.

Still, it's not ideal that everything has their own tables, but I don't see an alternative really. I don't want load them from the system, and stuff like icu4x feels like the wrong choice for code which cares about footprint.

Like, it's a huge mess aside from proving the approach (the code has tons of debugging stuff and duplication, etc), I'd clean it up a lot if you were interested. ↩

thomcc

went through and pointed at a set of "yep thats completely clownshoes" things.

just to save you the time from actually pointing out the issues in the code.

thomcc · 2023-10-11T01:28:15Z

.github/workflows/rust.yml

@@ -48,17 +48,13 @@ jobs:
          args: -- --check

  clippy:
-    name: Clippy


i dont even remember touching this stuff but I guess I did.

thomcc · 2023-10-11T01:28:52Z

tabgen/README.md

@@ -0,0 +1,36 @@
+# Table generator for cow-utils-rs


this documentation is probably wrong or incomplete. i'll clean it up later.

thomcc · 2023-10-11T01:29:21Z

src/case/search.rs

@@ -0,0 +1,61 @@
+pub(super) fn changes_when_casemapped_nonascii<const MAP_LOWER: bool>(


there are two copies of this file (one in tabgen/src/search.rs) just to shut rust-analyzer up. its absolutely not needed.

thomcc · 2023-10-11T01:30:22Z

src/case/mod.rs

@@ -0,0 +1,56 @@
+mod search;


all this stuff can be much simplified if i give up on trying to reuse the search.rs file in both the generator and generatee (which is worth doing). like, that's the reason it's 3 files here, it should just be 1.

thomcc · 2023-10-11T01:31:27Z

tabgen/src/main.rs

+        "#[cfg(test)]\npub(super) const UNICODE_VERSION: (u8, u8, u8) = ({}, {}, {});\n",
+        genver.0, genver.1, genver.2,
+    );
+    let _ = std::fs::write("table.rs", &file);


this is hacky too.

thomcc · 2023-10-11T01:33:59Z

tabgen/src/gen.rs

@@ -0,0 +1,495 @@
+//! The basic idea is that we segment codepoints into one of a,


yep this one is wrong too. also, i'm aware theres too much redundant stuff in this file, it can be simplified.

RReverser · 2023-10-11T02:40:18Z

and stuff like icu4x feels like the wrong choice for code which cares about footprint

Oh yeah so I actually did try to use icu-properties for this couple of days ago, but somehow it made code even slower than it is right now (with manual changes_when_uppercased implementation).

I'm going to try and review this soon-ish, but, as you said, it's a lot of code so might take some time.

Meanwhile, could you post benchmarks for before/after this change? I wonder if it actually speeds things up.

thomcc added 2 commits October 10, 2023 18:24

wip

a642c48

code's a nightmare but works perfectly and much faster

3250603

thomcc commented Oct 11, 2023

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

generate small/fast table for changes_when_{uppercased,lowercased} #4

generate small/fast table for changes_when_{uppercased,lowercased} #4

thomcc commented Oct 11, 2023

thomcc left a comment

thomcc Oct 11, 2023

thomcc Oct 11, 2023

thomcc Oct 11, 2023

thomcc Oct 11, 2023

thomcc Oct 11, 2023

thomcc Oct 11, 2023

RReverser commented Oct 11, 2023

		@@ -0,0 +1,61 @@
		pub(super) fn changes_when_casemapped_nonascii<const MAP_LOWER: bool>(

		@@ -0,0 +1,495 @@
		//! The basic idea is that we segment codepoints into one of a,

generate small/fast table for changes_when_{uppercased,lowercased} #4

Are you sure you want to change the base?

generate small/fast table for changes_when_{uppercased,lowercased} #4

Conversation

thomcc commented Oct 11, 2023

Footnotes

thomcc left a comment

Choose a reason for hiding this comment

thomcc Oct 11, 2023

Choose a reason for hiding this comment

thomcc Oct 11, 2023

Choose a reason for hiding this comment

thomcc Oct 11, 2023

Choose a reason for hiding this comment

thomcc Oct 11, 2023

Choose a reason for hiding this comment

thomcc Oct 11, 2023

Choose a reason for hiding this comment

thomcc Oct 11, 2023

Choose a reason for hiding this comment

RReverser commented Oct 11, 2023