Feature request: More UTF-8 string functions #13

ghost · 2016-06-14T10:26:39Z

Would it be possible to add a few more utility functions for dealing with UTF-8 text?

These are some of the functions that already exist in SDL_FontCache:

U8_strlen
U8_charsize
U8_charcpy
U8_next
U8_strinsert
U8_strdel

I would love to also see something like this:

U8_mid
U8_left
U8_right

U8_upper
U8_lower

These kind of functions can be found in a lot programming languages and it is should be fairly obvious what they do. As there are no way to do this with UTF-8 text in C or C++ by default, I think this would be a great additional feature of SDL_FontCache.

I have no idea how difficult it is to convert between upper and lower case UTF-8, but I think at least the mid, left and right functions should not require too much code since you already have functions such as U8_next.

grimfang4 · 2016-06-14T14:13:39Z

left() and right() are special cases of mid(), so to keep the API lean, I'd just implement a kind of substring function like mid().

upper() and lower() are tough ones because the usual locale handling is platform-specific (e.g. setlocale()'s support for UTF-8 is not guaranteed). I'd probably defer to other libraries which specifically handle Unicode, unfortunately.

ghost · 2016-06-14T17:00:56Z

Yeah okay, that sounds reasonable I guess. It would have been great for me personally to have all these functions in one place so I would not have to include a bunch of different files every time I work with UTF-8 text, but I also understand that you do not want to bloat your project too much with these kind of things.

As for the cases, I have been using this myself to convert Russian UTF-8 text between lower and upper case:

// upper case russian letter D
const char* upper = "Д";
int codepoint = FC_GetCodepointFromUTF8(&upper,0);

// add 32 to get codepoint for lower case d
codepoint += 32;

char lower[5];
FC_GetUTF8FromCodepoint(lower,codepoint);

std::cout << "Upper: " << upper << " Lower: " << lower << std::endl;

And to get the upper case you would just do -= 32 instead. But I guess this only works for plain English or Russian UTF-8 letters/characters.

grimfang4 · 2016-06-15T02:14:27Z

I'm not sure about the specification of Unicode and if it always separates upper/lower pairs by 32 for all languages. That'd be really great if it did! One clear exception to that and a few other rules is the German (but not for all German locales) ß, which has no actual capitalization, but can sometimes be rendered in caps as SS. This is a single character that would map to 2. :-/

Regardless, if someone did contribute a robust capitalization scheme, even just for a couple of locales, I'd put it in and accept contributions to fill it out further.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature request: More UTF-8 string functions #13

Feature request: More UTF-8 string functions #13

ghost commented Jun 14, 2016

grimfang4 commented Jun 14, 2016

ghost commented Jun 14, 2016

grimfang4 commented Jun 15, 2016

Feature request: More UTF-8 string functions #13

Feature request: More UTF-8 string functions #13

Comments

ghost commented Jun 14, 2016

grimfang4 commented Jun 14, 2016

ghost commented Jun 14, 2016

grimfang4 commented Jun 15, 2016