quoting_style: Add support for non-UTF-8 bytes #6882

jtracey · 2024-11-23T01:37:27Z

This adds support for non-UTF-8 bytes in the quoting_style library on Unix platforms. This is necessary for proper support of non-unicode inputs in a few utilities, including wc, ls, and printf (as of this PR, wc should be good, ls is in a much better state but will need some work to close the final gaps, and printf needs @andrewliebenow's #6812, which might conflict this this, but if so, should be a quick fix).

The first commit bumps the MSRV, because we need access to Utf8Chunks, since we need to operate on strings and non-unicode bytes in the same OsString (namely, we need to be able to tell if something is invalid unicode, or valid unicode but a control character, and apply the appropriate escaping). Avoiding that would require implementing or using another UTF-8 parser.

The third commit fixes a preexisting bug that was in some sense independent of this patch set (multi-byte control characters weren't being handled properly), but it touches the same code so I'm including it.

github-actions · 2024-11-23T02:03:24Z

GNU testsuite comparison:

Skip an intermittent issue tests/timeout/timeout (fails in this run but passes in the 'main' branch)

sylvestre · 2024-11-29T22:57:52Z

is the increase of MSRV really necessary?

jtracey · 2024-11-29T23:33:48Z

Assuming we don't want to use/implement a non-std UTF-8 parser, yes (see the second paragraph of the top post).

sylvestre · 2024-12-12T12:59:54Z

@cakebaker ok with the MSRV bump ?

cakebaker · 2024-12-12T13:03:45Z

@sylvestre yes, that's fine for me.

src/uu/ls/src/ls.rs

RenjiSann · 2024-12-17T11:43:50Z

I'm linking this PR to the issue I opened about it #6817
I support the bump of MSRV, as it grants us access to siice::utf8_chunks which is a mandatory piece we don't want to bother re-implementing ourselves

Lastly, you may want to take a look at my implementation here (which I refrained to open a PR for because of the MSRV bump) to see if we have some matching implementation details. I haven't checked your changes (i will when I get the time), but I think I remember my implementation was working, with 250 line changes to quoting_style.rs instead of your 650.

jtracey · 2024-12-18T04:39:07Z

@RenjiSann

I'm linking this PR to the issue I opened about it #6817 I support the bump of MSRV, as it grants us access to siice::utf8_chunks which is a mandatory piece we don't want to bother re-implementing ourselves

Lastly, you may want to take a look at my implementation here (which I refrained to open a PR for because of the MSRV bump) to see if we have some matching implementation details.

Whoops, sorry! I checked for issues and PRs, not sure how I missed #6817; this ended up taking me a while, so I maybe started working on this before it was filed. In any case, I definitely should have found it before filing a PR.

I haven't checked your changes (i will when I get the time), but I think I remember my implementation was working, with 250 line changes to quoting_style.rs instead of your 650.

So it looks to me that your commit is analogous to my second commit in this patch series, 2f0072e. That commit is +401/-106, where over 200 of those new lines are additional unit tests, and another chunk is new comments on existing code, so comparable to yours in size. It looks like your commit doesn't implement returning literal bytes yet, basically getting things to the point where my ls usage is at with this PR.

I ran my unit test vectors with your branch, and as expected, your commit doesn't yet pass the "literal", "literal-show", "shell-show", or "shell-always-show" tests with non-unicode vectors, and fails the test vectors I add in e894e57. Other than that though, the tests do pass, so that's good news for the expected values I wrote. :)

As for the implementation, the strategies are pretty similar, if different in exact implementation: iterate over the UTF-8 chunks, keep the existing behavior if it's valid, do something smarter if it's not. No obvious sources of performance difference between the two versions, though I didn't run any benchmarks (your version does have dynamic dispatch in a potential hot path for long non-unicode names, but neither of us were especially careful about avoiding copies, and that path has a bunch of string construction anyway, so very unlikely to be noticeable).

Sorry again for not seeing the issue, but at least now we have two people proposing a similar approach!

sylvestre · 2024-12-18T09:06:17Z

ok, too bad :(
so, could you please work together to find the best solution ? :)

RenjiSann

Your implementation is achieved while mine is only an unfinished draft. Aside from the few comments, everything looks good to me 👍

src/uucore/src/lib/features/quoting_style.rs

src/uu/ls/src/ls.rs

src/uu/wc/src/wc.rs

github-actions · 2024-12-18T19:31:01Z

GNU testsuite comparison:

Skip an intermittent issue tests/timeout/timeout (fails in this run but passes in the 'main' branch)

This new functionality is implemented, but not yet exposed here.

This exposes the non-UTF-8 functionality to callers. Support in `argument`, `spec`, and `wc` are implemented, as their usage is simple. A wrapper only returning valid unicode is used in `ls`, since proper handling of OsStrings there is more involved (outputs that escape non-unicode work now though).

jtracey · 2024-12-18T20:35:21Z

(first force push is just a rebase on main, second is the suggested changes)

src/uucore/src/lib/features/quoting_style.rs

This adds the `os_str_as_bytes_lossy` function, for when we want infallible conversion across platforms, and improves the doc comments of similar functions to be more accurate and better formatted.

github-actions · 2024-12-19T22:38:47Z

GNU testsuite comparison:

Skip an intermittent issue tests/timeout/timeout (fails in this run but passes in the 'main' branch)
Skipping an intermittent issue tests/tail/inotify-dir-recreate (passes in this run but fails in the 'main' branch)

jtracey · 2024-12-19T22:44:01Z

Assuming everyone is fine with the new commit, should be ready to go now.

RenjiSann · 2024-12-20T08:28:08Z

I'm fine with all the changes 👌. I like the new docstrings of helper functions

RenjiSann · 2024-12-21T14:11:16Z

ok, too bad :( so, could you please work together to find the best solution ? :)

@sylvestre It's good for both of us 👌

sylvestre · 2024-12-21T22:17:47Z

well done

sylvestre · 2024-12-21T22:18:43Z

oh, i pressed "merge" too quickly
would it be possible to add some tests in test_ls.rs & test_wc.rs ?
thanks

sylvestre force-pushed the quoting_style_bytes branch from b002ff2 to b34dd3b Compare December 2, 2024 09:00

sylvestre reviewed Dec 12, 2024

View reviewed changes

src/uu/ls/src/ls.rs Show resolved Hide resolved

RenjiSann reviewed Dec 18, 2024

View reviewed changes

Bump MSRV to 1.79

cb3be5e

jtracey force-pushed the quoting_style_bytes branch from b34dd3b to d91aac4 Compare December 18, 2024 19:04

jtracey added 3 commits December 18, 2024 15:28

quoting_style: add support for non-unicode bytes

3551031

This new functionality is implemented, but not yet exposed here.

quoting_style: fix multi-byte control characters

2331600

jtracey force-pushed the quoting_style_bytes branch from d91aac4 to 43229ae Compare December 18, 2024 20:28

RenjiSann reviewed Dec 18, 2024

View reviewed changes

src/uucore/src/lib/features/quoting_style.rs Show resolved Hide resolved

core: improve OsStr(ing) helpers

db1ed4c

This adds the `os_str_as_bytes_lossy` function, for when we want infallible conversion across platforms, and improves the doc comments of similar functions to be more accurate and better formatted.

sylvestre merged commit bb2fb66 into uutils:main Dec 21, 2024
62 checks passed

jtracey mentioned this pull request Dec 23, 2024

wc: fix escaping #6993

Merged

RenjiSann mentioned this pull request Jan 3, 2025

Handle non-UTF-8 in quoting style #6817

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

quoting_style: Add support for non-UTF-8 bytes #6882

quoting_style: Add support for non-UTF-8 bytes #6882

jtracey commented Nov 23, 2024

github-actions bot commented Nov 23, 2024

sylvestre commented Nov 29, 2024

jtracey commented Nov 29, 2024

sylvestre commented Dec 12, 2024

cakebaker commented Dec 12, 2024

RenjiSann commented Dec 17, 2024

jtracey commented Dec 18, 2024

sylvestre commented Dec 18, 2024

RenjiSann left a comment

github-actions bot commented Dec 18, 2024

jtracey commented Dec 18, 2024

github-actions bot commented Dec 19, 2024

jtracey commented Dec 19, 2024

RenjiSann commented Dec 20, 2024

RenjiSann commented Dec 21, 2024

sylvestre commented Dec 21, 2024

sylvestre commented Dec 21, 2024

quoting_style: Add support for non-UTF-8 bytes #6882

quoting_style: Add support for non-UTF-8 bytes #6882

Conversation

jtracey commented Nov 23, 2024

github-actions bot commented Nov 23, 2024

sylvestre commented Nov 29, 2024

jtracey commented Nov 29, 2024

sylvestre commented Dec 12, 2024

cakebaker commented Dec 12, 2024

RenjiSann commented Dec 17, 2024

jtracey commented Dec 18, 2024

sylvestre commented Dec 18, 2024

RenjiSann left a comment

Choose a reason for hiding this comment

github-actions bot commented Dec 18, 2024

jtracey commented Dec 18, 2024

github-actions bot commented Dec 19, 2024

jtracey commented Dec 19, 2024

RenjiSann commented Dec 20, 2024

RenjiSann commented Dec 21, 2024

sylvestre commented Dec 21, 2024

sylvestre commented Dec 21, 2024