Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature request: Better conversions between Utf32String and Utf32Str #50

Open
bugnano opened this issue Sep 6, 2024 · 2 comments
Open

Comments

@bugnano
Copy link

bugnano commented Sep 6, 2024

For maximising performance of matching, I have a Vec<Utf32String> that I want to match against.
It's fairly big (hundreds of thousands of records).
For matching I do something like:

    let mut scores: Vec<(Utf32String, usize, u32, usize)> = entries
        .iter()
        .enumerate()
        .filter_map(|(i, file_name)| {
            let score = pattern.score(file_name.slice(..), matcher);

            score.map(|score| (file_name.clone(), i, score, file_name.len()))
        })
        .collect();

So that I can sort it like so:

    scores.sort_by(|(_file1, i1, score1, len1), (_file2, i2, score2, len2)| {
        score2.cmp(score1).then(len1.cmp(len2)).then(i1.cmp(i2))
    });

And finally, construct a new Vec<Utf32String> with the filtered result:

        scores
            .iter()
            .map(|(file_name, _i, _score, _len)| file_name.clone())
            .collect();

Notice that I do file_name.clone() twice.

Now, to avoid at least 1 clone, I was thinking of refactoring the scoring vector like so:

    let mut scores: Vec<(Utf32Str, usize, u32, usize)> = entries
        .iter()
        .enumerate()
        .filter_map(|(i, file_name)| {
            let slice = file_name.slice(..);
            let score = pattern.score(slice, matcher);

            score.map(|score| (slice, i, score, file_name.len()))
        })
        .collect();

So far so good, but then I found no way of converting an Utf32Str to Utf32String for the final Vec<Utf32String>

It would be nice to have, first of all an AsRef<Utf32Str> for Utf32String, so that I don't have to call .slice(..), next it would be awesome to have a From<Utf32Str> for Utf32String, so that I can do something like Utf32String::from(file_name) in the final Vec<Utf32String>

@alexrutar
Copy link
Contributor

As far as I understand the codebase, I don't think AsRef is sensible to implement. The problem with AsRef is that the trait implementation would have to look something like

pub trait AsRef<Utf32Str<'_>> {
    // Required method
    fn as_ref(&self) -> &Utf32Str<'_> {
        &self.slice(..)
    }
}

but this does not work since the Utf32Str is then immediately dropped.

More generally though, it seems to me that you are treating Utf32String and Utf32Str as literal string types and (despite the suggestive names) this is not really correct in my opinion. This is described in the documentation (though certainly it should be clarified more), but Utf32String is internally an iterator over the first char of each grapheme cluster. This means that the conversion String -> Utf32String is lossy. So it only really makes sense to discuss a Utf32String as a sort of "match object".

Here is an example straight from UAX #29:

let s = "क्षि";
let utf32_s = Utf32String::from(s);

for g in s.graphemes(true) {
    println!("{g} {:?}", g.chars());
}
for ch in utf32_s.slice(..).chars() {
    println!("{ch}");
}

In my opinion, the fact that Utf32String implements Display is semantically wrong: at least to me, it is better to think of it as an "internal String-like matching object". Maybe @pascalkuthe has a good reason for this?

Unfortunately, rendering of grapheme clusters is not handled correctly by many terminals anyway, so actually doing things properly in your own code does not guarantee that things will go as expected...

In your code examples, I would just use your own types and your own Display implementation and never use Utf32String inside your own code. In other words, having a Vec<Utf32String> in the first place is (in my opinion) quite weird. I think in most cases the only time when you should ever be touching a Utf32String is in the mutable closure passed to the Injector::push method.

I hope this is helpful!

@bugnano
Copy link
Author

bugnano commented Nov 11, 2024

It's somewhat helpful, but I'm not satisfied with it.

The conversion String -> Utf32String is technically lossy, but for my usecase I don't really care.

What I was asking is for Utf32String -> Utf32Str to be consistent with the relationship that there exists with String -> str, OsString -> OsStr, and so on, which means, for example that if a function expects &str as its parameter, I can pass &String, because String implements AsRef<str>.

Now, if a function expects Utf32Str, I expect to be able to pass it a reference to Utf32String, which is not the case today.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants