-
Notifications
You must be signed in to change notification settings - Fork 61
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feature Request] Unicode support #24
Comments
Could D. Bunzli's Uutf be used to iterate over unicode chars? That might also help to parametrize over the input stream (string, bigarray, stream of strings, etc.) for #20 ... |
The main issue to make the implementation generic is that it is table-based. This works well when there are only 256 possible characters, but does not scale to the one million Unicode code points... One thing that should work is to translate regular expressions defined in term of Unicode code points into regular expressions defined in term of bytes and match UTF-8 strings byte by byte. |
Any hope to have unicode supported soon ? |
I don't think @nojb or anyone else is working on it right now, but it could change if someone was motivated. ;) |
Surprising that it wasn't still implemented |
Someone needs to do it, and it's hard™ 🙂 |
As far as I understand from the discussion in #48, the implementation there is viable and could be used as a basis for further work. I can rebase that PR against the current master, but unfortunately I am rather overloaded at the moment so cannot commit to doing the "further work" that may be necessary to get it integrated. |
At a glance, this whole library seems like a very well-thought piece of software (limited scope, defined solution). Unfortunately, it does not support unicode right now. But unicode should be the standard in this millenium. So here is my proposal: Instead of using chars and strings exclusively, abstract the library over the concrete code-point and input representations. Then someone (me) could simply extend the library by providing a suitable unicode support. I understand that this kind of abstraction might yield some performance regressions, but it would yield a whole batch of new usecases.
The text was updated successfully, but these errors were encountered: