Why do we need Unicode?

Unicode (ISO 10646) is the one-size-fits-all character encoding standard designed to clean up the mess of dozens of mutually incompatible ASCII extensions and special encodings and to allow the computer interchange of text in any of the world's writing systems.

How can number crunchers handle text?

The Unix Operating System of the early 1970s was one of the first to make use of ASCII. ASCII, the American Standard Code for Information Interchange of 1968, assigned a certain textual meaning to each of the numbers between 32 and 127. ASCII assigned codes for some punctuation symbols, the decimal digits and the letters of the English alphabet in upper and lower case. For instance, the Latin capital letter H was assigned the binary number 01001000 which is =48 hexadecimal or 72 decimal. Most computers nowadays adhere to this standard so that you are able to write a program like

	echo Hello world!

which merely outputs the number sequence

	=48=65=6C=6C=6F=20=77=6F=72=6C=64=21=0A

and rest assured that the readable line "Hello world!" will be drawn on your screen when the program is run on your terminal. You can also save the output to a file on a floppy disk and read it back through some other program or on another computer and you will still get to read "Hello world!". "Hello world!" will also appear on the printout when you redirect the output to the printer or on your recipients' screens when you send the output by e-mail. During all of this only binary numbers have been transmitted. Yet you didn't even have to know the numbers or type them in because your computer's operating system automatically enters the =48 when you press the [H] key on your keyboard while holding the [Shift] key for capital letters.

Dealing with languages other than English in ASCII

Now, how do you get your computer to write the equivalent of "Hello world!" in Polish, Russian, Japanese or Arabic?

The sad answer is: not so easily (yet). On computers, some languages are more equal than others. ASCII simply does not contain all the characters necessary. There are no Cyrillic letters, no Hiragana syllabics, no Chinese ideographs. You cannot even write correct Polish with its accentuated Latin letters. Let alone facilities for Arabic's contextual joining and reverse writing direction.

Of course, you could go and transliterate everything in ASCII but chances are that the results will look somewhere between ugly and unreadable and that you will create ambiguities: the Polish accented c might show up as one of c (flat ASCII, accent-stripped) c' (postfixed modifier) 'c (prefixed accent) cx (digraph) c\b' (backspace-overstrike) \'c (TeX notation) &cacute; (SGML entity) and an ASCII c' can stand for a Polish accented c or a Serbian tshe or a c followed by an apostrophe and many other things. That's why we need a standard to describe the Polish c' in its most accurate form, with a unique and unambiguous encoding.

Extending ASCII: the charset chaos

Computer history has produced dozens of competing ASCII extensions like the Russian KOI-8, the Japanese JIS X 0201, the European Teletex charset ISO 6937, the European single-byte charset series ISO 8859 and manufacturers' code pages. All of them solved some of ASCII's shortcomings and reduced the problem of writing non-English text to outputting the right numbers. But none of them comprehensively tried to cover all scripts. And these ASCII extensions are so incompatible with each other that you simply cannot expect the mere number sequence =FA=C4=D2=C1=D7=D3=D4=D7=D5=CA=20=CD=C9=D2=21 to be recognized as the Russian "Hello world!" on every computer you send it to. Even in Russia it will be misinterpreted by many systems as something else.

Disambiguating charsets: charset announcing

Charset announcing techniques like the ISO 2022 escape sequences and the MIME charset label tried to remedy the misinterpretation risk but they bear their own problems: they are stateful encodings (the number =FA means different things depending on which charset we've switched into), they require string encapsulation instead of allowing the bare string exchange, they require a charset registration authority who can do its job too superficial or too slow and they allow most texts to be encoded in many different ways so that you had a hard time if you wanted to find all appearances of the Russian "Hello world!" in any of its possible encodings in a gigabyte of multilingual text even if you have a fault-tolerant pattern matcher like agrep that allows some orthographic errors. If you want to implement ISO 2022 or MIME charset switching well, you will end up implementing some internal one-code mapping, so you might just as well use the most prominent universal code right away: Unicode.

Disambiguating characters: Unicode

The ISO 10646 Universal Character Set (Unicode) is supposed to contain all characters you might require for typing standard text and all characters found in other standard charsets. Each character gets a unique integer number in Unicode. A different chapter will tell you more about what characters are in Unicode, and another one about how Unicode gets around the 8bit and 16bit barriers for its integer numbers.

If Unicode support was so far that all Linux boxes around the world had fully implemented it, you could simply walk up to any computer, no matter where you work or study or vacation and read and write any language with the correct characters and also assume that what you send out can be understood, quoted and printed.

No more worrying about limiting locales, incompatible or incomplete charsets or fonts, no more misunderstandings because of them, no more negotiation how to encode text would be necessary. One size simply fits all. And all text is encoded in its most natural form. Unicode can become the communication standard of the global village, accepting and preserving the multitude of human languages, and assigning to each alphabet its very own wavelength to prevent distortions.

Unicode has the potential to enable every foreigner, every linguist, every journalist, every librarian, every mathematician, every internationally operating company to write, store, send, publish, search and read text using the correct characters, including accents and symbols in a plain WYSIWIG fashion.

Unfortunately, Unicode support is not that far. Eight years after the publication of Unicode 1.0 in 1991, there is still no system offering a complete and comfortable input-output implementation for the entire Unicode range. And not even the existing half-complete implementations are universally available on every computer. This sad state of affairs is what I set out to analyze...

Roman Czyborra
June 2, 1998