Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

U+034F COMBINING GRAPHEME JOINER not working with Windows fonts #1188

Open
moyogo opened this issue Dec 16, 2024 · 3 comments
Open

U+034F COMBINING GRAPHEME JOINER not working with Windows fonts #1188

moyogo opened this issue Dec 16, 2024 · 3 comments

Comments

@moyogo
Copy link

moyogo commented Dec 16, 2024

The Microsoft Windows fonts that have both U+034F COMBINING GRAPHEME JOINER and some combining marks character fail to display them properly when used together.

The Unicode Standard, 16.0, section 23.2.4 defines its use ; particularly:
https://www.unicode.org/versions/Unicode16.0.0/core-spec/chapter-23/#G24326

U+034F COMBINING GRAPHEME JOINER (CGJ) is used to affect the collation of adjacent characters for purposes of language-sensitive collation and searching. It is also used to distinguish sequences that would otherwise be canonically equivalent.

https://www.unicode.org/versions/Unicode16.0.0/core-spec/chapter-23/#G24492

The CGJ has no visible glyph and no other format effect on neighboring characters but simply blocks reordering of combining marks. It can therefore be used as a tool to distinguish two alternative orderings of a sequence of combining marks for some exceptional processing or rendering purpose, whenever normalization would otherwise eliminate the distinction between the two sequences.

https://www.unicode.org/versions/Unicode16.0.0/core-spec/chapter-23/#G24500

The CGJ can also be used in German, for example, to distinguish in sorting between “ü” in the meaning of u-umlaut, which is the more common case and often sorted like <u,e>, and “ü” in the meaning u-diaeresis, which is comparatively rare and sorted like “u” with a secondary key weight. This also requires no tailoring of either the combining grapheme joiner or the sequence. Because CGJ is invisible and has the Default_Ignorable_Code_Point property, data that are marked up with a CGJ should not cause problems for other processes.

See also:

Compare ü (U+00FC) and u͏̈ (U+0075, U+034F, U+0308) in Windows fonts that have glyphs for both:
Screenshot 2024-12-16 at 15 28 03

There probably shouldn’t be any visual differences between ü (U+00FC) and u͏̈ (U+0075, U+034F, U+0308) in those fonts as they have all the glyphs necessary present.

@tiroj
Copy link

tiroj commented Dec 16, 2024

Because CGJ is invisible and has the Default_Ignorable_Code_Point property, data that are marked up with a CGJ should not cause problems for other processes.

So this statement is evidently wrong when those other processes are glyph processes rather than character processes. Inserting CGJ in a mark sequence does cause problems for glyph processing, and it isn’t clear to me whether this is something that needs to be addressed at the font level or at the shaping engine level.

Unicode effectively places no limitations on where CGJ can be inserted in a character string, or even how many instances of CGJ could be inserted in a character string. There are places where one might anticipate it being used to affect sorting or to prevent canonical composition or normalisation reordering, but it could occur anywhere and potentially disrupt both GSUB and GPOS operations.

To solve this at the font level requires every font to make accomodation for filtering CGJ in every lookup. So, for example, if you want a ligature substitution to occur, you need to use a mark filter set that excludes CGJ. If you want marks to be positioned relative to a base or to other marks, you need to use a mark filter set that excludes CGJ. This isn’t an issue of something ‘not working with Windows fonts’: I am not aware of any fonts that accommodate CGJ in this way.

I think it makes more sense to look at solving this at the shaping engine level, where it could be done in a way that would enable correct behaviour for most existing fonts while probably breaking very few of them. Since CGJ, as described in Unicode, is not supposed have a visual impact on glyph strings after normalisation operations at the character level, it seems to me that shaping engines should probably suppress CGJ from glyph strings. It will have performed its standard text functions before glyph processing operations begin, so should be excluded from those operations. Initially, I was going to propose that it be excluded from GPOS operations to avoid the kind of mark positioning disruption that Denis illustrates, but I can’t think of a good reason why it shouldn’t also be excluded from GSUB.

There probably are some fonts in the wild that use CGJ as a hack in GSUB. This is a character whose original intent was rapidly abandoned by Unicode, and then repurposed in the standard, and there is no implementation recommendation for it in OpenType documentation. So I am pretty sure someone somewhere will have looked at it and thought, ‘Oh, I can use this to join graphemes’ or simply to force some non-standard behaviour in a particular font. Such hacks will become non-operational if shaping engines are changed to fix the outcomes for most fonts.

One of the few places I am aware of where CGJ is actively used for the purposes specified in Unicode is Biblical Hebrew, where CGJ is used to prevent reordering of marks occuring due to broken-but-unfixable canonical combining class assignments. So fonts for Biblical Hebrew do make accommodation for filtering CGJ. These should not break if shaping engines suppress the CGJ glyph.

@moyogo
Copy link
Author

moyogo commented Dec 16, 2024

By "with Windows fonts", I meant Windows fonts that have glyphs for CGJ and combining marks, as fallback fonts are used when CGJ is not present.

This should be a font shaper issue for the reasons @tiroj mentionned and as other font shapers tested don’t hit this problem.
It probably shouldn’t matter whether CGJ is in the font either.

@tiroj
Copy link

tiroj commented Dec 17, 2024

It probably shouldn’t matter whether CGJ is in the font either.

I was pondering that idea too.

If CGJ is suppressed at the glyph level, that should happen immediately after the cmap operation establishs the initial glyph string but before GSUB begins.

If CGJ is processed wholly at the character level, then it needs to be taken into account during the cmap operation, where it may prevent the NFC-like normalisation often applied at that stage, but doesn’t actually need to be present in the font as a glyph with a cmap entry for this to work.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants