Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature request: More UTF-8 string functions #13

Open
ghost opened this issue Jun 14, 2016 · 3 comments
Open

Feature request: More UTF-8 string functions #13

ghost opened this issue Jun 14, 2016 · 3 comments

Comments

@ghost
Copy link

ghost commented Jun 14, 2016

Would it be possible to add a few more utility functions for dealing with UTF-8 text?

These are some of the functions that already exist in SDL_FontCache:

U8_strlen
U8_charsize
U8_charcpy
U8_next
U8_strinsert
U8_strdel

I would love to also see something like this:

U8_mid
U8_left
U8_right

U8_upper
U8_lower

These kind of functions can be found in a lot programming languages and it is should be fairly obvious what they do. As there are no way to do this with UTF-8 text in C or C++ by default, I think this would be a great additional feature of SDL_FontCache.

I have no idea how difficult it is to convert between upper and lower case UTF-8, but I think at least the mid, left and right functions should not require too much code since you already have functions such as U8_next.

@grimfang4
Copy link
Owner

left() and right() are special cases of mid(), so to keep the API lean, I'd just implement a kind of substring function like mid().

upper() and lower() are tough ones because the usual locale handling is platform-specific (e.g. setlocale()'s support for UTF-8 is not guaranteed). I'd probably defer to other libraries which specifically handle Unicode, unfortunately.

@ghost
Copy link
Author

ghost commented Jun 14, 2016

Yeah okay, that sounds reasonable I guess. It would have been great for me personally to have all these functions in one place so I would not have to include a bunch of different files every time I work with UTF-8 text, but I also understand that you do not want to bloat your project too much with these kind of things.

As for the cases, I have been using this myself to convert Russian UTF-8 text between lower and upper case:

// upper case russian letter D
const char* upper = "Д";
int codepoint = FC_GetCodepointFromUTF8(&upper,0);

// add 32 to get codepoint for lower case d
codepoint += 32;

char lower[5];
FC_GetUTF8FromCodepoint(lower,codepoint);

std::cout << "Upper: " << upper << " Lower: " << lower << std::endl;

And to get the upper case you would just do -= 32 instead. But I guess this only works for plain English or Russian UTF-8 letters/characters.

@grimfang4
Copy link
Owner

I'm not sure about the specification of Unicode and if it always separates upper/lower pairs by 32 for all languages. That'd be really great if it did! One clear exception to that and a few other rules is the German (but not for all German locales) ß, which has no actual capitalization, but can sometimes be rendered in caps as SS. This is a single character that would map to 2. :-/

Regardless, if someone did contribute a robust capitalization scheme, even just for a couple of locales, I'd put it in and accept contributions to fill it out further.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant