Unicode's characters

This chapter concentrates on looking at Unicode as a coded character set: Unicode's character repertoire and character numbering but not on the various interchangeable 7-/8-/16-/32-bit binary representations nor on the underlying history of writing from genetic DNA coding to human writing with clay tablets or paper and later with movable type or computers. This text paraphrases a lot of information that can better be looked up in the more authoritative official standard documents and Unicode web server and many scientific articles about Unicode but I had to write yet another such beast for my thesis.

Why do we need characters?

Text carved in stone or written on paper is stored as graphical image. Why aren't we simply coding all text as concrete graphics? After all, there are excellent standards to encode graphics on computers: either as pixel-scans (GIF, PNG) or as smoothly scalable outline graphics (PostScript, PDF). Using graphics provides maximal freedom of expression: we can draw all the signs we want and send them as graphics to our readers. We are not limited to some stupid ASCII or Latin1 or Unicode character tables that lack our favorite signs. And this technique is indeed still being used: many Japanese businesses prefer telefaxing handwritten letters over slow-to-type e-mails; many Arabic sites on the World-Wide Web present their texts as GIF images instead of using character encodings unsupported by most platforms.

Encoding concrete graphical shapes does have a number of drawbacks, though: More severe than the compression inefficiency (graphics tend to take much more disk space, telephone bandwidth and processing time) is the consequence that your computer will not be able to

without applying artificial intelligence techniques (computer vision, optical character recognition, OCR) that tend to use insecure error-prone heuristics and need an intermediate character encoding like ASCII or Unicode as target to transform into anyway.

What is a character?

A character is an abstract concept in the sense that many different concrete glyphs or sounds can only be recognized as being the same character (sequence) by educated brains.

An abstract character is a unit of textual information such that a sequence of characters defines an abstract text that can be written or recited in various concrete ways all of which are obviously presenting the same underlying text.

A coded character set (CCS) is a mapping from natural numbers into a repertoire of abstract characters so that a number sequence represents text. If each character in the CCS is only reachable through one unique number and there is only one standard way to split a text into characters, then we also have a well-defined mapping for the reverse operation: each text has one unique encoded representation, very easy to search for. We will later see that Unicode follows this theoretical ideal but has historically been compromised to allow different encoding variants for some instances of text.

Unicode gives each character a unique number, name, and additional properties

Unlike any traditional CCS with a regionally limited repertoire, the universal character set (UCS, Unicode, ISO 10646) has been numbering the identified characters of all the world's major languages so that there is a universal agreement which number represents which character and they can all be communicated by e-mail.

To speak in database terminology, the Unicode character number serves as a primary key to index virtually all the world's characters. Consequently, when speaking about any particular character with standardizers, it is nowadays usually identified by the hexadecimal representation of its Unicode number prefixed with a U: either four-digit U+xxxx or eight-digit U-xxxxxxxx.

Each defined Unicode character also has a unique official long English character name that is sometimes quite logical (such as "LATIN CAPITAL LETTER A" for U+0041) but sometimes unhelpful (such as "CJK UNIFIED IDEOGRAPH-9AA8" for the bone ideograph U+9AA8 骨) or even misleading (such as "EURO-CURRENCY SIGN" for the pre-Euro ECU sign U+20A0 ₠).

Besides each character's official character name and number, the Unicode standard defines other normative properties such as character type, combining class, (letter) case, (digit) value, bidirectional behaviour, decomposition, and provides additional information such as alias names, compatibility mapping, casing partner, sample glyph, and usage notes (dearly missed for many characters, though).

Every character has its own history. You could tell long stories about each of them: which languages use it? What words or contexts does it appear in? What does it stand for? Where did it come from? What glyph variants are there? What other characters is it related to / must it be distinguished from / can it be confused with / can it be replaced with? How do you transliterate or transcribe it? While giving more background information than any earlier character set standard, the Unicode book still isn't perfect at giving all the answers to these questions so that you sometimes have to bother the unicode@unicode.org mailinglist.

What characters are distinguished, what characters are unified?

Like ASCII, Unicode distinguishes lowercase characters from their uppercase partners. That makes correct-case display easier but case-insensitive searching or sorting harder.

Like ISO-8859-1, Unicode distinguishes characters by script, but not by language. The U+0041 LATIN CAPITAL LETTER A ('A') is used for an English as well as a French Capital A. This makes it harder to tell which language the text is in. But everything else is made easier: only one code point is occupied, no extra definitions for every language are necessary, French fonts can be used for English and vice versa. Language tagging has to be done via HTML's LANG="fr" attribute or the proposed U-000E0001 LANGUAGE TAG characters.

Most prominently the several thousand Chinese Han ideographs found as simplified hanzi in China, traditional hanzi in Taiwan, kanji in Japanese and hanja in Korean were counted as one script for several languages and unified into one single Unicode character whenever there were only minor typeface differences.

Unicode does however distinguish letters from different scripts. The GREEK CAPITAL LETTER ALPHA looks like the LATIN CAPITAL LETTER A but it gets a different code point U+0391 in the Greek block. Now you can downcase CAPITAL ALPHA to SMALL ALPHA and CAPITAL A to SMALL A and automatically get the variant that looks right.

Identical punctuation is again shared between scripts. ASCII's U+002E FULL STOP is also used in Greek and Cyrillic text and only distinguished from the differently-shaped Armenian, Arabic, Ethiopic and ideographic full stops.

Unicode distinguishes clearer alternatives for ambiguous ASCII punctuation symbols such as U+2010 HYPHEN, U+2011 NON-BREAKING HYPHEN, U+2012 FIGURE DASH, U+2013 EN DASH, U+2014 EM DASH, U+2015 HORIZONTAL BAR (quotation dash), and U+2212 MINUS SIGN for the good old U+002D HYPHEN-MINUS ('-') while keeping the multifunctional ASCII original.

Within one script, Unicode tries to unify glyph variants (like final and medial form) of the same letter but many exceptions have been allowed for implementation ease and because they already existed in older standards. Now we have a mixture of distinguished characters representing particular glyphs like the U+03C2 GREEK SMALL LETTER FINAL SIGMA (now easier to render but more complicated to type or append to) and unified characters representing complex glyph shaping instructions like the U+0645 ARABIC LETTER MEEM (whose shapes are also encoded as deprecated Unicode characters U+FEE1..U+FEE4) or the even more complex U+094D DEVANAGARI SIGN VIRAMA that cancels an inherent vowel sign.

How does Unicode handle diacritics (accents)?

Many accented characters like U+00FC LATIN SMALL LETTER U WITH DIAERESIS ('ü') are distinguished characters in Unicode like they are in ISO-8859-1. ISO-8859-1 was particularly successful because it only defined graphic characters which fill exactly one rectangular cell with a prefabricated constant glyph and required no overstriking semantics.

Yet Unicode contains another notable breed of characters: combining characters such as the U+0308 COMBINING DIAERESIS, generally depicted as a diaeresis floating over a dotted circle representing the character it combines with. Unicode can separate base characters from their diacritics. Our precomposed character U+00FC LATIN SMALL LETTER U WITH DIAERESIS may also equivalently be encoded as decomposed character sequence U+0075 LATIN SMALL LETTER U followed by U+0308 COMBINING DIAERESIS. The combining character is supposed to change the shape of the preceding character. This is the handwriting sequence: first draw the base letter, then the accent next to it, u¨=ü.

So, theoretically speaking, at least 475 precomposed Latin letters are superfluous as Unicode characters as they could be decomposed in this way and could be unified with their base letter and diacritic(s). Combining characters enable you to encode many more character shapes with a limited code range. Combining characters allow you to quickly express accented letters like the Guaraní LATIN SMALL LETTER G WITH TILDE without having to start a standardization initiative to get the overlooked character added to Unicode.

The official opinion is that the existing precomposed characters were only included for compatibility with older standards such as ISO-8859-1 but I don't see why their accented characters couldn't have been decomposed during conversion from ISO-8859-1 to Unicode and recomposed on the way back. Looking at the existing Unicode applications using "Unicode fonts" it seems more likely that the precomposed characters are a compromise thrown in to gain acceptance in the European user community without requiring us to implement tricky support for combining characters. We ought to be thankful for the precomposed glyph characters and be open to grant the same privilege to non-Europeans, perhaps in one of the additional planes as the code range of the Basic Multilingual Plane is already clogged enough.

The old European Videotext character set ISO 6937 had also contained combining characters: its non-spacing accents =C1..=CF. But they were only allowed to appear one at a time (outcasting Vietnamese) and modify particular letters only so that the number of combinations to expect was finite and could be covered with a precomposed font.

Unicode defines no such self-restrictions. Unicode allows any sequence of its several hundred combining characters to be applied to any of its several thousand base characters. Unicode does not inform you which combinations are particularly likely to occur and thus worthy of precomposition besides the precomposed Latin, Greek, Hebrew and Arabic "compatibility" characters. Unicode expects renderers to be able to "draw" accents over, under, into, through and around arbitrary base characters or already-accented glyphs and still get spacing and appearance right and even know of aberrations like that certain cedillas jump over their base letter or haceks turn into apostrophes.

What Unicode does define is a normalization routine (canonical decomposition) to aid with string comparison and sorting. And ISO 10646 offers a reduced implementation level 1 for which the combining characters are not required to be supported.

What other characters cause trouble?

Another group of characters causing headaches on platforms that are only prepared for left-to-right character glyph concatenation, are

  1. the complex scripts such as Arabic and the Indic scripts (Devanagari, Tamil, etc.) that require contextual analysis to select acceptable glyphs and
  2. the right-to-left scripts Hebrew and Arabic.

Arabic appears in both categories so that it may be considered the single most complex script but its contextual joining is relatively easy to accomplish within Unicode using the Arabic presentation forms as I have shown in my arabjoin script. Being able to format Arabic text does not equal integration of the algorithm into interactive surfaces, though.

While written from the left to the right (except for some vowel signs jumping left of the consonant they follow) the Indic scripts appear to be more complicated to render because their presentation forms are no Unicode characters - there is no standard numbering for their many ligature glyphs and no published implementation.

Right-to-left monodirectionality would not be much of a problem as shown by applications such as the Hebrew xhterm. The directionality problem is only complicated by the occurrence of numbers or foreign words or quotes flowing opposite to the dominant writing direction which introduces bidirectionality within the same horizontal line.

Unicode text is usually stored in logical order and the output lines have to be arranged in the readable order algorithmically according to the explicitly defined Unicode BIDI heuristic. In order to produce the right results even in more complicated cases, there are directional control characters such as RIGHT-TO-LEFT EMBEDDING or LEFT-TO-RIGHT OVERRIDE that introduce state.

What scripts are supported?

The following figure from John Clews gives a nice overview of the scripts used in official languages worldwide and their common origins:


   Latin   Cyrillic                Devanagari - - - - - - Tibetan
      \     /                   /  Gujarati                  |
       \   / - Armenian        /   Bengali      SOGDIAN - Mongolian
        \ /                   /    Gurumukhi    SCRIPT
       Greek - Georgian      /     Oriya                  Chinese
         |                  /                            /
         |                 /       Telugu               /
     PHOENICIAN         BRAHMI - - Kannada      SINITIC - Japanese
     SCRIPT    \        SCRIPT     Malayalam    SCRIPT  \
    /    |      \          \       Tamil                 \
Hebrew   |      Arabic      \                             Korean
         |        \          \ - - Sinhala
         |         \          \
         |          \          \ _ _ Burmese
         |           \          \    Khmer
         |            \          \
      Ethiopic      Thaana        \ _ _ Thai
                                        Lao

Another nice overview is Akira Nakanishi's colorful book of the "Writing Systems of the World", ISBN 0-8048-1654-9.

Unicode 2.0 supports the Arabic, Armenian, Bengali, Bopomofo, Cyrillic, Devanagari (the script employed by Hindi and Sanskrit), Georgian, Greek, Gujarati, Gurmukhi, Han, Hangul, Hebrew, Hiragana, Kannada, Katakana, Latin (including the international phonetic alphabet IPA), Lao, Malayalam, Oriya, Tamil, Telugu, Thai, and Tibetan scripts. All of these can be written in horizontal lines. Arabic and Hebrew are written from right to left. The Indic scripts are written from left to right but some vowels jump to the left of or on top or below or around their preceding consonant so that you could say it is sometimes written in a circular motion. Arabic and the Indic scripts require intelligent ligature selection to become readable.

Unicode 3.0 shall extend some of the existing scripts and add Braille, Canadian Aboriginal Syllabics, Cherokee, Ethiopic, Khmer, Mongolian, Myanmar, Ogham, Runic, Sinhala, Syriac, Thaana, and Yi. Mongolian is the first script that can only be written in vertical rows.

Besides the characters for writing the world's major languages, there is a whole set of typographic, technical, graphical, mathematical, astrological and other scientific symbols and geometrical shapes in Unicode.

If you have no access to the paper documents defining the Unicode character set, you can look up all Unicode characters except for the Hangul syllables on charts.unicode.org but there you will not find the additional information on how these characters interact.

Where did the Unicode characters come from?

Unicode tried to subsume all major pre-1991 character repertoires: various national, international, and corporate standards were taken as source standards: ANSI X3.4 ASCII, X3.32 control code symbols, Y10.20 math, Z39.47 bibliographic Latin and Z39.64 EACC, Chinese GB 2312, CNS 11643 and CCCII, Indian ISCII-1988, Japanese JIS X 0208 and 0212, Korean KS C 5601, Thai TIS-620, ISO 2033 OCR, ISO 5426 bibliographic Latin, ISO 5427 bibliographic Cyrillic, ISO 5428 bibliographic Greek, ISO 6429 control functions, ISO 6438 African, ISO 6861 Glagolitic, ISO 6862 math, ISO 6937 Videotext Latin, ISO 8859 multilingual, ISO 8957 bibliographic Hebrew, ISO 9036 Arabic, ISO 10595 Armenian, ISO 10586 Georgian, ISO 10754 Extended Cyrillic, ISO 10822 Extended Arabic, many characters that used to be represented as special ASCII sequences in troff, TeX and ISO 8879 SGML, the PostScript symbols and dingbats, the WordPerfect and Xerox multilingual repertoires and others.

For Unicode to be universally useful its repertoire has to be a superset of all these smaller repertoires. For Unicode to serve as a replacement for all other charsets it has to be able to display everything the other charsets can display so as to make them superfluous. And on first inspection Unicode does indeed look like a proper superset of all the repertoires. Unicode went out of its way and included many "illogical" characters for compatibility with legacy charsets. But on closer inspection, you will occasionally find alleged source charsets' characters missing in Unicode. It is not clear to me whether this is intentional (to omit characters that contradicted the Unicode principles) or accidental (simply overlooked or forgotten because no mapping tables have been compiled yet).

The Unicode Standard neither specifies the source of every Unicode character nor the Unicode mappings for all source charsets but it does at least partially provide this information.

How are the Unicode characters numbered?

Each Unicode character has been assigned a unique nonnegative integer number: its Unicode value. This number is usually printed in U+xxxx hexadecimal representation.

Except for an irregular bunch of numbers reserved for special purposes like control characters or compatibility mappings or future extensions, the linear Unicode code space {U+0000..U+FFFF} is filled tightly with characters ("begin with zero and add the next character" design).

The Unicode numbering is supposed to be logical with zones and blocks and ranges of related characters grouped together. However, as blocks turned out to be too small, scripts had to be extended at remote encoding blocks so that Latin, Greek, Hebrew, Arabic, Hangul, and Han are now split across several blocks which makes certain characters harder to find and certain aspects of the Unicode ordering more historical than logical:

I have been using a small Perl script I have written and called ucoverage that counts the characters per block in any given Unicode font or character table and prints the profile.

The Unicode numbering induces a canonical sort order that is identical for UTF-8 and UCS-4. As collation rules are different in each locale, it would have been impossible to design a universal numbering that automatically produces the culturally preferred lexicographic sort order anyway. Unicode inherited this problem from ASCII. The first 128 codes in Unicode are a copy of US-ASCII so that for each ASCII character the Unicode value is equal to its ASCII value. Furthermore, the first 256 codes in Unicode are a copy of ISO-8859-1 to ease the transition process. The draft UTR10 defines a Unicode Collation Algorithm.

Many stretches of Unicode characters follow the traditional ordering of older standards so that for many c Unicode(c) is as either ISO-8859-1(c), ISO-8859-7(c) + 0x0350, ISO-8859-5(c) + 0x03E0, ISO-8859-8(c) + 0x0570, ISO-8859-6(c) + 0x05E0, ISCII(c) + 0x08E0, TIS-620(c) + 0x0DE0, or JISX0201(c) + 0xFF40. Other stretches follow traditional alphabet orders. The big CJK block follows the traditional KangXi dictionary order. Some stretches appear to be randomly thrown together.

A character's assigned Unicode number is supposed to stay valid for eternity but this ideal was compromised by changes for Unicode 1.1 (removals and reorderings) and Unicode 2.0 (Hangul reordering) already.

The characters with values smaller than 2^16 == 65'536 {U+0000..U+FFFF} make up the Basic Multilingual Plane 0000 (BMP). Their numbers can be represented with 16 bits. The BMP has 65'536 - 6'400 private - 2'048 surrogate = 57'088 definable code points. There were 38'887 defined characters in Unicode 2.0. There shall be 49'120 defined characters in Unicode 3.0 which fills 86% of the available BMP code space already.

Beyond the 16-bit barrier at U+FFFF, the future definitions for the extra planes starting with the first extraplanar non-BMP character U-00010000 are supposed to hold less frequently used characters so that everyday implementations can do without them and limit themselves to BMP support.

What are surrogate characters?

Surrogate characters are the characters {U+D800..U+DFFF} used in high surrogate + low surrogate pairs to reference the extraplanar characters according to the UTF-16 scheme.

What are compatibility characters and presentation forms?

Unicode contains a lot of characters that would not be necessary to have in a character encoding standard as they could have been composed out of other characters by intelligent renderers. Compatibility characters are included in Unicode so that all characters from source standards can be mapped reversibly to Unicode (source separation rule).

However, many compatibility characters are quite handy to have (and that is probably why they are in Unicode anyway) because they contain presentation forms that can be included in fonts to help less intelligent renderers. Examples of these are the precomposed accented letters, Arabic final, initial, medial, final letter and ligature glyphs, braces for vertical text, and precomposed Hangul syllables.

What happened to the Hangul syllables?

Hangul is the Korean syllabic script. Each Hangul syllable is made out of alphabetic components called jamo (leading consonant choseong + vowel jungseong + optional trailing consonant jongseong) stacked upon and aside one another in a square cell (see how this is done in the unifont).

Unicode-1.0 only contained the 2'350 most frequently used precomposed Hangul syllables in the undersized code range {U+3400..U+4DFF} as taken from the Korean source standard KS C 5601-1987.

Unicode-2.0:1996 (and Amendment 5:1998 to ISO-10646-1:1993) deleted the Hangul syllables from the {U+3400..U+4DFF} range and redefined them in the algorithmically ordered {U+AC00..U+D7A3} range:

#!/usr/local/bin/perl
foreach $choseong (split(/;/,
"G;GG;N;D;DD;L;M;B;BB;S;SS;;J;JJ;C;K;T;P;H"))
{
  foreach $jungseong (split(/;/,
  "A;AE;YA;YAE;EO;E;YEO;YE;O;WA;WAE;OE;YO;U;WEO;WE;WI;YU;EU;YI;I"))
  {
    foreach $jongseong (split(/;/,
    ";G;GG;GS;N;NJ;NH;D;L;LG;LM;LB;LS;LT;LP;LH;M;B;BS;S;SS;NG;J;C;K;T;P;H"))
    {
      printf("U+%X:%s\n", 0xAC00 + $i++,
             "HANGUL SYLLABLE $choseong$jungseong$jongseong");
} } }

The resulting 19 × 21 × 28 = 11'172 Hangul syllables are all that are needed for Modern Hangul. But in order to also be able to type medieval Hangul syllables and to encode the modern Hangul syllables without any misunderstandings due to the different Unicode versions, it may be advisable to encode them in decomposed form using the combining jamo characters {U+1100..U+11FF}.

What characters are not in Unicode?

For some "characters", Unicode allegedly made a conscious decision not to include them. Unicode will not encode "idiosyncratic, personal, novel, rarely exchanged, or private-use characters, nor logos or graphics". This is to prevent an inflation of company and product logos from every remote hook of planet Earth cluttering the codespace and font memory. The absence of logo characters has the consequence that the Apple logo present in the Macintosh charset, and millions of Macintosh screens and keyboards has no unambiguous Unicode value and has to be approximated as something like the heart U+2665, an unstandardized code in the private-use range like U+F8FF, an image or an ugly monolingual string like "[APPLE]" (pictograms give better internationalization) and that you cannot teach your preschool kids how to subtract apples from pears in plaintext e-mails. The new U+237E BELL SYMBOL will at least give a better approximation for the the Bell logo used by Unix troff. The dingbat graphics in Unicode {U+2700..U+27BF} appear to be breaking the no-graphics principle already to me.

The encoding of font change, highlighting and formatting functions is also left to higher-level protocols such as the hypertext markup language HTML or the ISO 6429 (VT100) control codes. Unicode restricts itself to plain text characters. This is definitely a wise decision because the plain Unicode text rendering is complicated enough already.

So much for the characters intentionally left blank.

Something as encyclopedic as a universal character set that tries to bring the multitude of all human languages and writing systems into one computer system cannot be perfect from the beginning. Therefore Unicode is a rather open and dynamic standard that is constantly improving. There is a steady flow of additions (allocation pipeline), clarifications (errata and techreports), and amendments defining new characters which eventually lead to the publication of entire new versions of the standard.

Some important characters and scripts are unencoded simply because nobody has done the necessary research and written the formal encoding proposal yet. Unicode 2.0 Appendix B "Submitting New Characters" gives a description of the requirements for such proposals, now detailed on the Unicode proposals web page and the WG2 proposal form.


This whole chapter was requested by Professor Biedl and benefitted from criticism by Joachim Schulz.

Roman Czyborra
$Date: 1998/11/26 16:39:09 $