-
-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
quoting_style: Add support for non-UTF-8 bytes #6882
Conversation
GNU testsuite comparison:
|
is the increase of MSRV really necessary? |
Assuming we don't want to use/implement a non-std UTF-8 parser, yes (see the second paragraph of the top post). |
b002ff2
to
b34dd3b
Compare
@cakebaker ok with the MSRV bump ? |
@sylvestre yes, that's fine for me. |
I'm linking this PR to the issue I opened about it #6817 Lastly, you may want to take a look at my implementation here (which I refrained to open a PR for because of the MSRV bump) to see if we have some matching implementation details. I haven't checked your changes (i will when I get the time), but I think I remember my implementation was working, with 250 line changes to |
Whoops, sorry! I checked for issues and PRs, not sure how I missed #6817; this ended up taking me a while, so I maybe started working on this before it was filed. In any case, I definitely should have found it before filing a PR.
So it looks to me that your commit is analogous to my second commit in this patch series, 2f0072e. That commit is +401/-106, where over 200 of those new lines are additional unit tests, and another chunk is new comments on existing code, so comparable to yours in size. It looks like your commit doesn't implement returning literal bytes yet, basically getting things to the point where my I ran my unit test vectors with your branch, and as expected, your commit doesn't yet pass the "literal", "literal-show", "shell-show", or "shell-always-show" tests with non-unicode vectors, and fails the test vectors I add in e894e57. Other than that though, the tests do pass, so that's good news for the expected values I wrote. :) As for the implementation, the strategies are pretty similar, if different in exact implementation: iterate over the UTF-8 chunks, keep the existing behavior if it's valid, do something smarter if it's not. No obvious sources of performance difference between the two versions, though I didn't run any benchmarks (your version does have dynamic dispatch in a potential hot path for long non-unicode names, but neither of us were especially careful about avoiding copies, and that path has a bunch of string construction anyway, so very unlikely to be noticeable). Sorry again for not seeing the issue, but at least now we have two people proposing a similar approach! |
ok, too bad :( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Your implementation is achieved while mine is only an unfinished draft. Aside from the few comments, everything looks good to me 👍
b34dd3b
to
d91aac4
Compare
GNU testsuite comparison:
|
This new functionality is implemented, but not yet exposed here.
This exposes the non-UTF-8 functionality to callers. Support in `argument`, `spec`, and `wc` are implemented, as their usage is simple. A wrapper only returning valid unicode is used in `ls`, since proper handling of OsStrings there is more involved (outputs that escape non-unicode work now though).
d91aac4
to
43229ae
Compare
(first force push is just a rebase on main, second is the suggested changes) |
This adds the `os_str_as_bytes_lossy` function, for when we want infallible conversion across platforms, and improves the doc comments of similar functions to be more accurate and better formatted.
GNU testsuite comparison:
|
Assuming everyone is fine with the new commit, should be ready to go now. |
I'm fine with all the changes 👌. I like the new docstrings of helper functions |
@sylvestre It's good for both of us 👌 |
well done |
oh, i pressed "merge" too quickly |
This adds support for non-UTF-8 bytes in the quoting_style library on Unix platforms. This is necessary for proper support of non-unicode inputs in a few utilities, including
wc
,ls
, andprintf
(as of this PR,wc
should be good,ls
is in a much better state but will need some work to close the final gaps, andprintf
needs @andrewliebenow's #6812, which might conflict this this, but if so, should be a quick fix).The first commit bumps the MSRV, because we need access to Utf8Chunks, since we need to operate on strings and non-unicode bytes in the same OsString (namely, we need to be able to tell if something is invalid unicode, or valid unicode but a control character, and apply the appropriate escaping). Avoiding that would require implementing or using another UTF-8 parser.
The third commit fixes a preexisting bug that was in some sense independent of this patch set (multi-byte control characters weren't being handled properly), but it touches the same code so I'm including it.