-
Notifications
You must be signed in to change notification settings - Fork 30
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Longer term idea: reversible glyph file naming scheme #164
Comments
Solution 1 sounds better to me, although I would suggest a different separator. Among a handful of possible characters that is second only to pipes, quotes, spaces, and slashes (okay sixth-ish) of problematic characters for file systems, build systems, URL encoding, etc. Oh yah and anything that's a glob operator should probably be avoided. Why not just |
Whichever separator we'd choose: it will either have to be escaped if it occurs in the glyph name, or the separator will have to be a non-optional part of the file name, so we can say "take the last one". I prefer it to be optional, though, so all-lowercase glyph names won't need a disambiguation code at all. Since I didn't realize
There are two problems that need to be solved, and I proposed one solution for each :) |
The separator and the case sensitivity is not the biggest problem, those are relatively easy to fix (and already fixed in the current implementation by adding an The issue is all those see https://unifiedfontobject.org/versions/ufo3/conventions/#example-implementation Adding those A possible solution is to have marker that the for example:
bonus: this only adds max 2 extra characters to the filename compared to the glyph name. |
Except that is not reversible!
I didn't think the reserved file names issue through yet, but escaping the illegal characters with URL-style Leading periods should also be percent-escaped:
Reserved filenames:
A potential alternate separator character is |
A backtick would make a nightmare for a fencing marker. Don't go there. In fact I don't think the fencing scheme works at all, and as Just noted following capitals with an underscore isn't reversible unless you also have some way to escape underscores where they naturally occur.
Yes, What about |
I'd like to avoid
Visually, I like |
Fair enough on scratching out Semantically, Also while
Okay so GitHub Markdown parsing isn't going to make the exhibits easy. Here are the URLS:
|
Here's a quick test implementation: https://gist.github.com/justvanrossum/c1055da1041f8976a31a93ea838cc05e (Setting it up with These are my test cases so far: _testCases = [
("Aring", "Aring^1.glif"),
("aring", "aring.glif"),
("ABCDEGF", "ABCDEGF^V3.glif"),
("f_i", "f_i.glif"),
("F_I", "F_I^5.glif"),
(".notdef", "%2Enotdef.glif"),
(".null", "%2Enull.glif"),
("CON", "CON^7.glif"),
("con", "con^0.glif"),
("aux", "aux^0.glif"),
("con.alt", "con.alt.glif"),
("A:", "A%3A^1.glif"),
("A^321", "A%5E321^1.glif"),
("a\\", "a%5C.glif"),
("a\t", "a%09.glif"),
# ("a ", "a%20.glif"), # escape space?
("a ", "a .glif"), # or not?
("a\"", "a%22.glif"),
("aaaaaaaaaA", "aaaaaaaaaA^0G.glif"),
("AAAAAAAAAA", "AAAAAAAAAA^VV.glif"),
] |
I wrote:
Except that I was wrong! The "add underscore after uppercase letter" part of the current scheme is reversible, as the underscore itself is doubled. Eg. So, while my base32 scheme would allow longer glyph names than the current one, and keeps the original glyph name in tact as part of the file name (as long as no reserved chars are used), it is otherwise debatable whether my proposal is even an improvement. It's possible we keep the (To repeat: it can't be fully reversible if we don't specify a length limitation on the glyph name.) |
And why not using (no globbing issue for filenames used in shell scripts like with Also the Base32 bitmap for uppercase mapping is not very friendly. I tend to think that we should better just tag invidual characters (and avoid UTF-8 hexadecimal escapes as well, causing more problems for embedding in URLs), for example:
I would call such escaping mechanism a "filename-safe" encoding scheme. It could be generic and not limited to mapping glyph names to filenames, and designed to be safe for filesystems with non significant lettercases. Other schemes are still possible, including the Punycode transform (as used in IDNA for domain names, but without the IDN restrictions for authorized characters and without its lettercase unification, but it is a complex scheme in its trailing part after the |
Another way of avoiding reserved names is to, uhm, err, prepend e.g. |
A leading Other possible solution would be to use the "trigrams" as documented for use in C/C++ preprocessors: in C/C++, except that you would like to use something else than the |
I don't follow. If you prepend every file name with an underscore (reserved or not) and remove it on reading the file, you can use it anywhere else in the name like before. |
The unconditional "_" prefix would then apply to every filename, jsut to solve the problem of possible reserved filenames (which is variable depending on OS/environments/filesystem/versions), but it still does not solve the problem of reserved characters and case insensitivity (e.g. for naming files for glyphs, whose case is significant and that may include restricted characters like "*", "#", "/" or "?"). As well if there are filename lengths restrictions (on FAT without LFN support), we need something else: an extra metadata file containing the mappings between filenames and glyphnames (even if we try to circumvert some restriction using an archive format like ZIP which holds filenames with less restrictions, there would be problems for extracting and archiving the archive in a restricted filesystem). This is in fact general for any development project for naming their source files: usually, programming environments enforces some restrictions for safe source filenames: restricting "/", "", "*", "?", "#", compressing whitespaces, and ".", reserved for filename extensions, not in their basename. I don't see any definitive solution, except by using a mapping metadata file (just like what fonts already do internally in their tables): this mapping could be optional as long as its values can be infered from a limited set of production rules, however these rules could be overriden at anytime in an explicit mapping file, allowing more flexibility for the naming scheme). Some default productions rules are those already defined in Postscript for naming glyphs; but if there are none to reliably name a glyph, using hexadecimal codepoints then a custom extension for contextual or linguistic variants (separated by "_" maybe? fixed number of digits or without leading zeroes). Each font project can then choose how to manage their own namespace using the mapping metadatafile in their project. But if we are building an OTF/TTF font, there should be a Postscript glyph names table which can be part of the generated OTF/TTF font. As well it is possible to automatically generate a suitable mapping file using the basic Postscript glyph names rules (those recognized as well in PDF readers), then a list of hex code codepoints and an extension for contextual glyph variants (default extensions can be just created by numeral increments). If this initial mapping is not the best for font designers, they can rename as they want by updating the name mapping file (which should be a one-to-one bijection, if we want for example PDF readers to be able to infer the Unicode encoding from a list of glyph ids, without needing to use OCR technics: this reverse conversion from list of glyphs to encoded text is something not very easy, given the existence of Unicode Bidi reordering, OpenType reorderings for example with prepended vowels, and other GSUB/GPOS rules that could be used for complex ligatures). |
Oh yes. I mean, prepend an underscore unconditionally and then the reserved name concern goes away (I'm only aware of Windows imposing something here) and you can focus on how to handle special characters and case-insensitivity. Maybe relying on a flaky string storage for fully reversible names is a fool's errand and we do need some kind of mapping file, I don't know.
Already exists in the form of https://unifiedfontobject.org/versions/ufo3/lib.plist/#publicpostscriptnames. |
I like the scheme that Just proposed in #164 (comment) Unfortunately the caret E.g. "Aring.1.glif" or "CON.7.glif" (instead of "Aring^1.glif" and "CON^7.glif") I think imposing a maximum glyph name length of 255 is reasonable if it allows us to devise a reversible filenaming scheme. I doubt one would ever fit a whole Lorem Ipsum paragraph inside a glyph name, e.g. these are 255 characters, and I'd argue that they ought to be enough:
Also, I think that space character should also be escaped, to avoid having to quote glif path names if used in command lines. |
I would like the suffix to be optional, and using
|
well, sorry, forget period |
whatever the separator, it has to be one that can't occur in glyph names themselves, or it must always be there at the end of the filename |
Whatever the separator, if it occurs in the glyph name itself, it will have to be %-escaped. |
you're right. By the way the percent symbol % proposed for URL-style encoding is also a special character in windows command line, but similarly to the caret one can escape it by doubling it (%% or ^^) or wrapping in quotes, so I don't think we should block on this. It's gonna be hard or even impossible to make posix shells, windows command prompt, regular expression syntax, and what have you -- all happy. |
ok so to recap. To make it a reversible naming scheme that doesn't require contents.plist we have two options, both of which can use url-style %-encoding to escape illegal characters:
Option 2) is basically the same as we currently use, with the difference that illegal characters are %-encoded, instead of replaced with "_", so they can be reversed. The underscore notation keeps the readability, and is already familiar to font devs, at the cost perhaps of a longer filename.
this is the bit I don't quite undestand yet. Why is the maximum length on glyph names a requirement for fully reversible, contents.plist-less naming mechanism? I think that imposing a max glyph name length could be reasonable and won't ever be hit in practice, so I'm ok with that. |
Yes and yes.
I agree. It would just be nice to figure out and document what the actual limit will be. With either of the two schemes it will depend on the number of capital letters in the glyph name... |
If my last scheme was too complex, it can also be reduced (but not above that the lowercase and uppercase letters were considered equivalent, due to case-insensitive filesystems). Just retain Similar (may be even simpler) escaping could as well reduce to
Literal underscores for escaping uppercase letters must be doubled if they occur before a literal lowercase letter, or before an literal uppercase letter or underscore needing their own underscope escape. Note that the behavior of underscores for handling lettercase only applies to ASCII letters; filesystems may or may not treat case differences for other letters (depending on versions of the UCD they are using for case mappings) so uppercase non-ASCII letters should be escaped in hex. But in fact non-ASCII characters should probably all be hex-escaped (for having too fuzzy support in filesystems, possibly also changing and enforcing a Unicode normalization form): glyph names themselves shouldbe preferably limited to ASCII, but if not, these extra characters have to be hex-escaped. Finally all this thread is only about finding a solution to restrictions/limitation of filesystems (forbidding or not distinguishing some filenames or creating some additional aliases). We should not care about limitations/restrictions added by shells. So what is relevant is just what is found in common filesystems, the most restrictive being those used by Windows (case insensitive names, the handling of leading/trailing whitespaces or dots, the behavior of wildcards, a few legacy reserved names, and path separators, plus the Windows-specific bahavior of tilde "~" related to the generation of "short filenames" for compatibility with legacy programs inherited from DOS for FAT filesystems when they still did not have LFN support, this behavior being still used on Wnidows filesystems having LFN support; on Linux/Unix, we are just concerned by wildcards, a single path separator "/" and special names "." and "..", which are also restricted as well on Windows filesystems). Additional characters that are restricted on Windows are "|", "<", ">", as well as double quotation marks (they are not restricted on Linux/Unix, just used specically by its common command-line shells, providing escaping mechanisms when needed, including for wildcards "?" and "*", for character classes with "[...]", and for sets of alternate names with "{..., ...}"). Other things like the "=", "%", "$", "&", "{...}", "[...]" and "(...)" characters, specifically used by the syntaxic parsers of command line shells should not concern us: there's a wide set of shells, each one having their tricks and their own escaping mechanism when needed, but not adding restrictions on filesystems on which they are used. The most problematic case is if wel want to use legacy filesystems that don't have LFN support (basically old FAT filesystems without the extension supported since Windows 95): these are extremely unlikely to be ever used here for developing/supporting "unified font objects". The only way to support these would have to use a "mapping file" containing the list of short filenames mapped for what was intended to be long filenames (it is possible to maintain here such mapping file, but in reality there are alternatives, such as storing these files in a ZIP container: extracting files from the ZIP could also create and maintain such mapping file automatically: the short filenames would be generated on the file using a scheme similar to what is used on Windows for "8.3" names, using a few letters in prefix, plus a basic numeric counter, before a shortened extension; those files with short names could remain just in a temporary working folder, along with the temporary mapping file and discarded once we are done and they are rearchived This does not mean that the project development here on GitHub must generate ZIP archives for its collection of glyph files: it's up to the client to manage their local archives when talking with GitHub; reading files form ZIP archives can be extremely fast, as they don't even need to use extreme compression level (they are just there as an easy workaround possible for any client that could not store individual files directly "as is" on its local filesystems). This is exactly like what already happens everyday within all web browsers for managing their cache: browsers maintain their own mapping file to index the long names referenced on external sites in their URLs or web APIs. Web sites do not ever have to know or manage these client-side index themselves. And today, clients can avoid that local "cost" for managing ZIP archives and index, by just using a better-capable filesytem for the storage (today NTFS, or modern FAT32 with LFN, or exFAT, ReFS... Let's forget old ISO-FS on CDROMs/DVDs without support of the "Joliet" extension, old FAT on MSDOS and Windows before Windows 95, or antique filesytems like CP/M that did not even have the concept of distinct directories in their root). However, for deployment of apps, using ZIP archives or libraries can still be interesting if these files are to be mostly used as read-only resources: they take less space and (un)install themselves much faster with less overhead on local filesystems and frequently improve the overall performance of the app using them: that what most modern apps are doing today for their packages (including for their "theme packs", "resource bundles"...), with the additional possiblity of embedding other metadata along with their embedded mapping file (e.g. digital signatures, versioning info, security descriptors, permissions, intended usage, etc.) But our archiving format in fine should be the complete font file (in some OpenType/XML/SVG/PostScript/webfont container format) that this project intends to produce so that they become instantly installable or referenceable in applications needing fonts. Our individual ".glif" files are intermediate development files only to be used by very few users/developers/designers, and will almost never be used "as is" by final applications. We just want individual ".glif" files to manage the development, design, interchange in a more granualr way than just hosting plain ".ttf" files on GitHub or other source repositories (because they offer no facility such as diffs, history of changes, development comments, patches, conditional testing of changes, reusability, restructuration of font contents...) |
The main problems a glyph file naming scheme needs to solve:
UFO does not specify a maximum glyph name length, but in practice we're tied to .fea, which does.
If we were to set a maximum length for glyph names, then it is possible to create a completely reversible glyph file naming scheme that does not need a contents.plist-like mechanism at all. If we keep insisting on not imposing a maximum length, a hybrid solution may be possible (but I'm not necessarily in favor of that).
Proposed solution to 1:
Append a disambiguation code to the glyph name that encodes a bitfield, using one bit per character in the glyph name, corresponding to the case of the character: 0 = lowercase, 1 = uppercase.
If we allow trailing zeros to be omitted, this scheme is very efficient for mostly lowercase glyph names (esp. if any uppercase characters appear early in the glyph name).
If this code is encoded in a base32-like encoding, we need one ascii character per 5 bits of data. We can chose the encoding to use
[a-z0-5]
. Perhaps use#
as a separator character.If the glyph name is entirely lowercase, the disambiguation code can be omitted.
Example: the file name for
Aring
would becomeAring#b.glif
(b
would encode the bits10000
). The file name foraring
would be simplyaring.glif
.This scheme guarantees that if two glyph names only differ in case, their corresponding file names will be unique, even on a non-case sensitive file system, while still containing the full glyph name.
Proposed solution to 2:
Use url-style
%XX
escaping.With an assumed maximum file name length of 255 (which is what ufoLib currently assumes), we can still use fairly long glyph names with this scheme (longer than .fea's 64-character maximum).
To get a glyph name from a file name:
.glif
file extenstion#
, if there is one%XX
sequencesRelates to #122
The text was updated successfully, but these errors were encountered: