What is the relationship between fonts and Unicode characters?

“It is my understanding that not all fonts contain the Unicode character set. Are they contained in certain fonts or are they independent? If a code does not exist in a font then what is used?”
Older version of this originally published at https://www.quora.com/What-is-the-relationship-between-Fonts-and-Unicode-characters/answer/Thomas-Phinney

Unicode is the standard for characters in computing. It assigns a unique code to each character. So for example the capital A is a character. Some things that look the same are different characters, so for example the cap Alpha and cap A usually look the same, but get different Unicode numbers.

A font can contain zero or more glyphs—a glyph is a single slot in the font that usually contains a representation of … something, a letter or symbol. In most cases, one glyph represents one character, although sometimes more than one glyph can be used for one character (for example, an accented character can be composed from a base character plus a combining accent), or more than one character can be represented by a single glyph (for example, a ligature, such as the o-f-f-i ligature in Caflisch Script).

Aside from such complications, usually most (often nearly all) of the glyphs in a font have Unicode codepoints (numbers) assigned to them. If a glyph does not have a Unicode codepoint, it might be related to a Unicode value via an OpenType feature. So for example, the ‘liga’ ligature feature in Caflisch Script would have code that says, if you have the sequence o-f-f-i then replace it with the ligature glyph named “o_f_f_i”. So while that ligature glyph does not have a single or direct Unicode codepoint, it is related to a group of characters that do have Unicode codepoints.

When it comes to combining accents (more technically called “diacritics” by font geeks), the Unicode standard itself has info about some characters that can be assembled from other characters. For common western European languages this is all pretty straightforward: Unicode has codepoints assigned to combinations such as é and ü, as well as separate ideas of special “combining accent” characters that can go with the base letter to make the combo. But Unicode does not have all the possible combinations as predefined characters, so even for characters such as a–z plus diacritics needed by some African languages, there is not precombined character, in the computer it is only represented as base-letter-plus-combining-diacritic.

For many languages, including Indic languages, Arabic, and others, the processing is even more complex. Let’s just say that the further we are from the simple confines of English the less often it is true that one character equals exactly one glyph.

An average western-language font has about 200 to 400 glyphs. A more extensive one might have 500 to 700, and a really extensive one thousands (2000–5000). Fonts for other writing systems such as Chinese or Japanese routinely have 5,000, 10,000 or even 20,000 glyphs, but because of that, and the complexity of the individual glyphs, there are fewer such fonts designed.

“Not all fonts contain the Unicode character set” is an understatement. No single font on earth contains the entire Unicode character set, and perhaps no single font ever will. Unicode currently defines about 150,000 characters, is updated (and expanded) annually, and currently there is a 64K limit on the number of encoded glyphs in a font (in any major format, anyway).

The Unicode character set is completely independent of specific fonts, although specific fonts may attempt to be thorough in covering particular sections of Unicode. (And the origins of Unicode include trying to be a superset of all preexisting font encoding standards.)

“If a code does not exist in a font then what is used?” Aside from cases where the character might be assembled from others (like with the combining accents mentioned previously), if a called-for Unicode character is not supported in any way in the currently selected font, then the behavior still depends on the application and the operating system. In some cases a “notdef” glyph may be shown to indicate a missing glyph in the current font—more common with high-end graphics apps such as Adobe Creative Cloud. Many apps and environments will at least attempt to do font fallback, substituting some other font that does support the desired character. In such cases the right letter or symbol will appear, but in a different font! This is why sometimes you will see a document where most of the characters are in one font, but perhaps an accented character or something else less common is in a clearly non-matching font.

In extreme cases (more common for especially rare or newly-defined characters), even environments that do attempt such fallback may fail to find a match because they have no font that supports the character in question! In such situations, one may still see a notdef, or get fallback to a special Last Resort font. (I have a whole separate article about the notdef, pending!)

What is the relationship between fonts and Unicode characters?

Comments

Leave a ReplyCancel reply

More posts

Also, Quora Lies: WW2 Arial, Helvetica, Courier; also Times misinformation

About .notdef: the symbol (not emoji!) that is often an “X” inside a tall rectangle

“What does a design brief for a new typeface (font) look like?”

More of my fonts/typography answers coming here!

What is the relationship between fonts and Unicode characters?

Share this:

Comments

Leave a ReplyCancel reply

More posts

Also, Quora Lies: WW2 Arial, Helvetica, Courier; also Times misinformation

About .notdef: the symbol (not emoji!) that is often an “X” inside a tall rectangle

“What does a design brief for a new typeface (font) look like?”

More of my fonts/​typography answers coming here!

More of my fonts/typography answers coming here!