[Feature request] Strings as byte arrays #938

Rangi42 · 2021-11-12T16:40:41Z

In most contexts, strings are already just byte sequences. String literals can contain any bytes (except for \0, which currently terminates the string, but using C++ std::string could avoid this). String functions like STRCAT and STRRPL operate on the bytes and do not care about encoding. Even print and println just send the bytes to stdout; things print as UTF-8 iff that is set as the console's locale.

The only functions that warn about strings which aren't valid UTF-8 are STRLEN and STRSUB. I think this is actually a mistake, and we should have STRLENUTF8 and STRSUBUTF8 if that behavior is desired.

You would expect db "{s}" to declare STRLEN("{s}") many bytes, but actually STRLEN undercounts since there are multi-byte UTF-8 characters. STRLEN("héllo") == 5, but db "héllo" declares 6 bytes, 68 c3 a9 6c 6c 6f.

If strings acted as byte arrays, and #885 allowed \0 bytes in strings, then #933 could implement a single READFILE function for both text and binary files. We would not need to implement numeric arrays (#67) just for that one use case (and given all the open questions about how arrays should behave, and the lack of string arrays anyway, I'd rather not have them.)

Changing the behavior of STRLEN and STRSUB would be a potentially breaking change, but I think it would be better than adding "STRBYTELEN" and "STRBYTESUB" functions, since UTF-8 encoding is the unusual special case. Note that rgbds-struct's uses of STRLEN and STRSUB would all be valid even if the definitions were changed; and hypothetical cases that would break should probably be using CHARLEN and CHARSUB anyway.)

(One other useful function would be STRBYTE(str, idx), to get the raw byte value at an index, without going through the charmap. That is, STRSUB("ABCD", 2, 1) and CHARSUB("ABCD", 2) return the string "B" which coerces to the number $42 if you haven't charmapped it; but STRBYTE("ABCD", 2) would return $42 directly.)

(Another nice addition along with this would be to allow \0 as a way to put $00 bytes in strings. It can be inconvenient to have literal null bytes in a file, but all the others are fine.)

We would probably also want to get rid of the "Input string is not valid UTF-8!" warning in charmap.c, which I think is the only other place where UTF-8 encoding matters.

The text was updated successfully, but these errors were encountered:

ISSOtm · 2021-11-12T22:25:18Z

You would expect db "{s}" to declare STRLEN("{s}") many bytes, but actually STRLEN undercounts since there are multi-byte UTF-8 characters. STRLEN("héllo") == 5, but db "héllo" declares 6 bytes, 68 c3 a9 6c 6c 6f.

No, you wouldn't, because charmaps.

Rangi42 · 2021-11-12T22:27:49Z

No, you wouldn't, because charmaps.

That's assuming there are no charmaps involved besides the default one, so STRLEN("{s}") == CHARLEN("{s}").

Rangi42 · 2021-11-13T16:04:12Z

Basically these are what I see as the four ways forward:

The status quo: we have strings for which STRLEN and STRSUB expect UTF-8 encoding. We're going to switch from leaky char *s to ref-counted struct Strings or RAII-with-smart-pointers std::strings to allow unlimited string lengths. We could also add a READFILE function to read the contents of a UTF-8 text file as a string, but can't handle binary files. One of the motivating use cases for even adding file-reading functions was to add an offset to the bytes of a tilemap, which would have to be binary, so I think we should try to allow that.
Add arrays/lists and a READBIN function to return an array for binary files, plus READFILE for strings for UTF-8 text files. Given the uncertainties and tradeoffs we ran into when considering how arrays would be implemented, and how major a feature it would be mostly just for the sake of enabling READBIN, I'd rather not do that.
Let READFILE return a string for binary files too. Define new functions STRBYTELEN, STRBYTESUB, and STRBYTE to get the length, substrings, and individual bytes from a string, without expecting any particular text encoding. This would still require us to allow $00 bytes in strings, but that's not a problem; it's feasible as long as we don't rely on string.h functions for algorithms (which neither the struct Strings nor std::strings need to do).
Let READFILE return a string for binary files too. Change STRLEN and STRSUB to not expect any particular text encoding, and add STRBYTE to get individual bytes from a string. Optionally add STRLENUTF8 and STRSUBUTF8 to allow the current behavior (which I do think is worthwhile, even though in most cases where you care, you should probably be using CHARLEN and CHARSUB).

I could certainly be missing an even better fifth way of allowing users to access binary file contents, so here or #933 is fine for discussing that (or #67 if arrays are the preferred solution).

aaaaaa123456789 · 2021-11-14T02:57:59Z

I'd say #3 is the best option by far.

Rangi42 · 2021-11-14T03:02:09Z

Hm, I would somewhat prefer 4 since I expect (a) non-UTF-8-specific would be the more common use case, and (b) hopefully few/no users are depending on UTF-8 STRLEN and STRSUB so far; but either would be fine with me.

aaaaaa123456789 · 2021-11-14T03:04:13Z

STRLEN and STRSUB have to expect some encoding; there's no meaningful concept of "string length" without one. The encoding where every byte encodes itself is an encoding (ISO-8859-1).

Rangi42 · 2021-11-14T03:08:34Z

Option 3 would make STRLEN behave like C's strlen (except without needing the $00 terminator after we finish PR #885, i.e. STRLEN would return the struct String's size value). STRSUB would likewise act like taking a segment of a char[] array. Neither of those cares about the encoding; the string is just an array of bytes.

True, ISO-8859-1 is an encoding that has single-byte characters, but it's not the only one. And the rgbasm language would not be taking a position on which Unicode code points go with which byte values in strings. So I don't think of option 3 as "switch from UTF-8 to ISO-8859-1", but "switch from UTF-8 to arbitrary unsigned byte values". Even charmaps don't really care about Unicode; the character set only becomes relevant when you print things, and that's up to your console. (Also ISO-8859-1 does not define characters for 00-1F or 7F-9F.)

ISSOtm · 2022-03-28T12:53:51Z

Given https://hsivonen.fi/string-length, I'm for option 4 as well.

Rangi42 added enhancement Typically new features; lesser priority than bugs rgbasm This affects RGBASM labels Nov 12, 2021

Rangi42 added this to the v0.5.3 milestone Nov 12, 2021

Rangi42 mentioned this issue Dec 21, 2021

[Feature request] numeric escape sequences in strings #962

Closed

Rangi42 removed this from the v0.9.0 milestone Aug 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature request] Strings as byte arrays #938

[Feature request] Strings as byte arrays #938

Rangi42 commented Nov 12, 2021 •

edited

Loading

ISSOtm commented Nov 12, 2021

Rangi42 commented Nov 12, 2021

Rangi42 commented Nov 13, 2021 •

edited

Loading

aaaaaa123456789 commented Nov 14, 2021

Rangi42 commented Nov 14, 2021

aaaaaa123456789 commented Nov 14, 2021

Rangi42 commented Nov 14, 2021 •

edited

Loading

ISSOtm commented Mar 28, 2022

[Feature request] Strings as byte arrays #938

[Feature request] Strings as byte arrays #938

Comments

Rangi42 commented Nov 12, 2021 • edited Loading

ISSOtm commented Nov 12, 2021

Rangi42 commented Nov 12, 2021

Rangi42 commented Nov 13, 2021 • edited Loading

aaaaaa123456789 commented Nov 14, 2021

Rangi42 commented Nov 14, 2021

aaaaaa123456789 commented Nov 14, 2021

Rangi42 commented Nov 14, 2021 • edited Loading

ISSOtm commented Mar 28, 2022

Rangi42 commented Nov 12, 2021 •

edited

Loading

Rangi42 commented Nov 13, 2021 •

edited

Loading

Rangi42 commented Nov 14, 2021 •

edited

Loading