Another fine web page describing these subjects is Jukka Korpela's tutorial on character code issues.
An abstract character is a unit of textual information like the U+0041 LATIN CAPITAL LETTER A ('A'). Exactly how the textual information is broken into units and which units are identical is in many cases open to scientific debate and standardization.
A coded character set (CCS) is a well-defined mapping from integer numbers (code space, code points, code positions, code values, character numbers) to abstract characters.
A small example to play with the terminology: Let ABC := {(65,'A'),(66,'B'),(67,'C')}. ABC would be a CCS by this definition because it is a mapping (table) in set notation. Incidentally, ABC is a very small subset of the ASCII code. The character value of 65 is ABC(65)='A'. The code value of 'A' is 65. In ABC the code points 65, 66, and 67 make up the "code space" {65,66,67} which is completely occupied with character definitions. Within that short range there is no more free code space because none of the integer numbers in {65,66,67} could be mapped to yet another character seeking asylum without deleting a definition from ABC.
To avoid confusion, the corresponding simple set of characters such as ABC's character range {'A','B','C'} is not called a character set but a character repertoire.
The UCS is supposed to contain all other coded character sets like ISO-8859-2 as repertoire subsets (also encodes all of their characters) but only US-ASCII and ISO-8859-1 and MES / SECS are real code subsets with identical character numbers.
The downgrading text translation to a smaller repertoire may require transliteration.
A CCS like US-ASCII or ISO-8859-1 with 256 or less characters and no integer value above 255 can easily serve as a single-byte 8bit charset where each octet of 8 bits (byte) is taken as a binary number to look up the one coded character it represents: 01000001 -> 65 -> 'A'.
The notion of an 8bit byte is deeply ingrained into today's computer systems and their communication protocols. Files and memory sizes are measured and adressed in bytes, strings are represented in char[] arrays, the read() and write() system calls and most network protocols such as TCP transmit 8bit bytes.
You will probably waste your energy and create lots of incompatibilities if you try to change the byte size to 16, 20, 32 or 64 bits even though today's microprocessors could handle such quantities. Besides the incompatibilities there is also the argument that it is wasteful to have one character occupy 16 or 32 bits instead of 8 bits because that would double or quadruple file sizes and memory images.
Any rich CCS with more than 256 characters (as seen in China, Japan and Korea) needs thus a more or less complex character encoding scheme (CES, encoding, multibyte encoding, transformation format) so that a byte sequence (octet stream, multibyte string) can represent a sequence of larger integers (wcs, wide character string) that can then be mapped through a CCS to a sequence of abstract characters called meaningful text.
As stated in RFC 2278, the combination of CCS + CES is labeled as "charset" in MIME context. Many charsets have the same underlying CCS and repertoire: ISO-2022-KR and EUC-KR both encode some linearization of the 94x94 Korean national standard table KS C 5601 and UTF-16, UTF-8, and SCSU all encode the linear CCS Unicode.
The original Unicode design suggested to extend the character size to a fixed length of 16 bits just like ISO-8859-1 uses a fixed length of 8 bits. And there are indeed a couple of systems (namely Windows NT and CE) that frequently read and write 16bit characters.
A fixed length of 16 bits has the problem that only 2^16 == 65'536 characters can be encoded. And the original estimates that this number would be big enough to hold everything the world needs are currently being proven wrong as you can see in Michael Everson's allocation roadmap and the Unicode allocation pipeline.
In order to soothe the code space scarcity ISO 10646 Amendment 1 redefined the range {U+D800..U+DFFF} of (formerly private-use) 16-bit characters as "surrogates" to reference characters from the 20-bit range {U-00010000..U-0010FFFF} which gives us 16 additional 16-bit "planes". (In ISO terminology, our 32-bit code space is divided into 256 Groups of 256 Planes of 256 Rows of 256 character cells each - a "plane" is a 16-bit code space). Plane 1 (U-0001xxxx) is going to hold ancient and invented scripts and musical symbols, while Plane 2 (U-0002xxxx) is reserved for additional Han ideographs, Plane 14 (U-000Exxxx) is going to start with some meta characters for language tagging and there are two entire bonus private-use planes: Plane 15 (U-000Fxxxx) and Plane 16 (U-0010xxxx).
The characters needed for writing today's living languages are still supposed to be placed in the original Unicode Plane 0 (U+xxxx, also known as Basic Multilingual Plane or BMP) so that simpler and pre-UTF-16 implementations don't necessarily have to decode the surrogate pairs.
The UTF-16 reduction of the 20 bit range to 16 bit surrogate pairs goes like this:
putwchar(c) { if (c > 0xFFFF) { putwchar (0xD7C0 + (c >> 10)); putwchar (0xDC00 | c & 0x3FF); }
This has the effect that
\uD800\uDC00 = U-00010000 \uD800\uDC01 = U-00010001 \uD801\uDC01 = U-00010401 \uDBFF\uDFFF = U-0010FFFF
You can always distinguish the leading high surrogates {U+D800..U+DBFF} from the following low surrogates {U+DC00..U+DFFF} so that long stretches of UTF-16 surrogate pairs are self-segregating just like the stretches of UTF-8 multibyte characters.
However note that UTF-16 does have an additive offset to get \uD800\uDC00 to encode U-00010000 instead of simply the 10 + 10 bitmask bitshift concatentation
[1101 10]00 0000 0000 [1101 11]00 0000 0000 != U-00000000
The simple UTF-8-like bitmask concatenation would have encoded a closed 20-bit space {U+0000...U-000FFFFF}. For my taste, that would have sufficed. Now that UTF-16 has the additive offset to skip the first 2^16 characters, we get 2^16 - 2^11 + 2^20 = 1'112'064 encodable characters which is a bit more than 2^20, about 2^20.1. That means that the private use characters of the 17th plane U-0010xxxx accessed through the high surrogates {U+DBC0..U+DBFF} need an odd 21st bit that may only be set if the following four bits are zero.
Another problem with 16-bit characters is that they are fine on 16-bit systems but suffer from byte-order problems whenever they have to be serialized into 8bit bytes for storage or transmission because there are little-endian processors that put the lower value byte at the lower address (first thus) and there are big-endian processors that put the higher byte first like we write our numbers.
The nave approach
unsigned short int unicodechar = 0x20AC; write (1, &unicodechar, sizeof(unicodechar));will output little-endian =AC=20 on an Intel PC or DECstation and big-endian network byte order =20=AC on a Sparc.
The Unicode Standard specifies that the more human-readable big-endian network byte order is supposed to be used for UTF-16. To ensure that byte-order, our UTF-16 putwchar() function thus ends like this:
else { putchar (c >> 8); putchar (c & 0xFF); } }
Annex F of ISO 10646-1:1993 and 2.4 of Unicode 2.0 recommend that UTF-16 texts start with the no-op character U+FEFF ZERO WIDTH NO-BREAK SPACE as byte-order mark (BOM) to recognize byte-swapped UTF-16 text from haphazard programs on little-endian Intel or DEC machines from its =FF=FE signature (U+FFFE is guaranteed to be no Unicode character).
The limited UTF-16 without the surrogate mechanism is called UCS-2, the two-byte subset which is identical with the Basic Multilingual Plane of the original Unicode.
With two bytes per ideograph or syllable, UTF-16 is considered a minimally compact encoding for East Asian or Ethiopic text. Many applications including Xlib, Yudit and Java use UCS-2 or UTF-16 as internal wchar_t because they don't want to waste four bytes per character as UCS-4 does.
But with two bytes instead of one per letter, UTF-16 is also considered a bloated size-doubling encoding for ASCII or ISO-8859 texts which is why you might prefer SCSU for storage or transmission.
The worst problem with UTF-16 is yet that its byte stream often contains null bytes (for example in front of every ASCII character) and values that misleadingly look like other meaningful ASCII characters. That means that you cannot simply send UTF-16 text through your mail server, C compiler, shell nor use it in filenames or any application using the C string functions such as strlen().
Unix systems with their rich heritage of traditional software are simply based on US-ASCII or ASCII-compatible extensions like ISO-8859 or EUC. For example, they have no stty pass16 or stty cs16 command to simply expand the terminal's character size to 16 bits. Therefore people prefer ASCII-compatible 8-bit transformation formats for which they can simply say stty pass8. Or 7-bit ASCII representations of the Unicode characters.
putwchar(c) { putchar (c >> 24 & 0xFF); /* group */ putchar (c >> 16 & 0xFF); /* plane */ putchar (c >> 8 & 0xFF); /* row */ putchar (c & 0xFF); /* cell */ }
The byte-order mark U+FEFF turns into the UCS-4 signature =00=00=FE=FF or little-endian byte-permuted into =FF=FE=00=00.
This format will be able to express all UCS character values even if they lie far beyond UTF-16's reach. The UCS-4 format is binary transparent and knows no illegal sequences. All it requires is that the string lengths are multiples of 4 bytes. UCS-4 can easily be processed on the 32-bit computers of the 1990s using the default data type int. This is the internal wide character data type wchar_t of most Unix boxes.
Unfortunately, it has the same null-byte, space-waste and byte-order problems as UTF-16, only worsened by a much bigger number of redundant null bytes. That's why you will hardly ever get to see UCS-4 text out in the wild life.
putwchar (c) { if (c < 0x80) { putchar (c); /* ASCII character */ } else { putchar (0xA0 + /* ku (row) */ (c >> 8)); putchar (0xA0 + /* ten (cell) */ (c & 0x7F)); } }
ASCII characters thus represent themselves and nothing else. And the square characters are represented by a pair of 8bit bytes each which gave them the name double-byte characters. If you stuff your ASCII characters into fixed-width boxes that equal half of a 14x14, 16x16 or 24x24 pixel square, like cxterm and kterm and many other applications do, you even get the impression that double-byte characters are also double-width. That is because the width stays proportional to the byte count in EUC in spite of the reduced character count. You can put some 40 square characters in one standard screen line that otherwise is 80 ASCII characters wide.
As the first and second byte of a double-byte character both use the same {=A1..=FE} range of values, you cannot easily tell the one from the other and recognize the character boundaries in the middle of a long stretch of 8bit bytes. You have to back up to the first preceding 7bit ASCII byte which might for example be the linefeed \n, use it as synchronization point and then start counting pairs forwards from there. A regular expression search for the =BB=CC square could detect a false positive within the square sequence =AA=BB =CC=DD and cause harm to it.
Another problem with the {=A1..=FE} range is that it only contains 94 values, so that the character set is limited to 94 x 94 == 8'836 squares which is not enough for the Unicode repertoire. 94 x 94 x 94 == 830'584 would however be enough to cover 16bit Unicode. And EUC does allow three-byte characters. But I have never seen any serious proposal for such an EUC-UN. The transformation arithmetic from the linear Unicode axis to the odd 94-cube is perhaps considered too computation-intensive. It might have come different had Unicode not stopped the original draft UCS DIS-1 10646 of 1990 that was still trying to stick to the ISO 2022 code extension techniques for 7/8-bit coded character sets.
#define T(z) (z<94?z+33:z<190?z+66:z<223?z-190:z-96) putwchar(c) { if (c < 160) { putchar(c); } else if (c < 256) { putchar(160); putchar(c); } else if (c < 16406) { putchar(161+(c-256)/190); putchar(T((c-256)%190)); } else if (c < 233005) { putchar(246+(c-16406)/190/190); putchar(T((c-16406)/190%190)); putchar(T((c-16406)%190)); } else { putchar(252+(c-233005)/190/190/190/190); putchar(T((c-233005)/190/190/190%190)); putchar(T((c-233005)/190/190%190)); putchar(T((c-233005)/190%190)); putchar(T((c-233005)%190)); } }
This UTF-1 has a number of serious disadvantages:
The use UTF-1 is deprecated. Annex G (informative) has formally been deleted from ISO-10646 although the page has probably not been ripped out at your local library. Glenn Adams' 1992 implementation utf.c is still floating around.
putwchar(c) { if (c < 0x80) { putchar (c); } else if (c < 0x800) { putchar (0xC0 | c>>6); putchar (0x80 | c & 0x3F); } else if (c < 0x10000) { putchar (0xE0 | c>>12); putchar (0x80 | c>>6 & 0x3F); putchar (0x80 | c & 0x3F); } else if (c < 0x200000) { putchar (0xF0 | c>>18); putchar (0x80 | c>>12 & 0x3F); putchar (0x80 | c>>6 & 0x3F); putchar (0x80 | c & 0x3F); } }
The binary representation of the character's integer value is thus simply spread across the bytes and the number of high bits set in the lead byte announces the number of bytes in the multibyte sequence:
bytes | bits | representation 1 | 7 | 0vvvvvvv 2 | 11 | 110vvvvv 10vvvvvv 3 | 16 | 1110vvvv 10vvvvvv 10vvvvvv 4 | 21 | 11110vvv 10vvvvvv 10vvvvvv 10vvvvvv
(Actually, UTF-8 continues to represent up to 31 bits with up to 6 bytes, but it is generally expected that the one million code points of the 20 bits offered by UTF-16 and 4-byte UTF-8 will suffice to cover all characters and that we will never get to see any Unicode character definitions beyond that.)
The UTF-8 design pays off with a number of attractive features:
The UTF-8 standardization is a success story that led to the abandonment of UTF-1 and made UTF-8 the prime candidate for external multibyte representation of Unicode texts:
Ken Thompson and Rob Pike (from the famous AT&T Bell Laboratories who had also invented Unix and C) describe in their 1993 Usenix presentation "Hello world or Καλημέρα κόσμε or こんにちは 世界" (your browser supports internationalized HTML if you see some Greek and Japanese here) how the UTF-8 format was crafted (together with the X/Open Group who were writing the internationalized XPG4 portability guide, forefather of today's Unix98 specification) to suit the needs of their new Plan9 operating system. Ken Thompson provided his commented sample implementation fss-utf.c in 1992.
UTF-8 was formally adopted as normative Annex R in Amendment 2 to ISO-10646 in 1996. UTF-8 was also included in The Unicode Standard, Version 2.0, as Appendix A.2, accompanied by Mark Davis' CVTUTF implementation. Also in 1996, Franois Yergeau's RFC 2044 defined the MIME charset label "UTF-8" which is now on the Internet standards track as updated RFC 2279.
Newer Internet protocols that want to do without charset labeling tend to use UTF-8 as default text encoding, like RFC 1889 RTP, RFC 2141 URN Syntax, RFC 2192 IMAP URL Scheme, RFC 2218 IWPS, RFC 2229 Dictionary Server Protocol, RFC 2241 DHCP, RFC 2244 ACAP, RFC 2251..2255 LDAP, RFC 2261 SNMP, RFC 2284 PPP, RFC 2295 HTTP, RFC 2324 HTCPCP, RFC 2326 RTSP, RFC 2327 SDP, RFC 2376 XML, ...
The mail internationalization report produced by the Internet Mail Consortium recommends since 1998-08-01 that
All mail-displaying programs created or revised after January 1, 1999, must be able to display mail that uses the UTF-8 charset. Another way to say this is that any program created or revised after January 1, 1999, that cannot display mail using the UTF-8 charset should be considered deficient and lacking in standard internationalization capabilities.
The last but not least argument in favor of UTF-8 is that there is a small but growing number of Unix tools with special UTF-8 handling today already:
text/plain; xviewer yudit < %s; test=case %{charset} in \ [Uu][Tt][Ff]-8) [ yes ] \;\;\ *) [ UTF-8 = no ] \; esacYudit also comes with a code conversion tool named uniconv that you can use as a replacement for Plan9's unsupported tcs and can be plugged into your .pinerc with:
display-filters=_CHARSET(UTF-8)_ /usr/bin/uniconv -I UTF8 -O JAVA, _CHARSET(UTF-7)_ /usr/bin/uniconv -I UTF7 -O JAVA
Roman Czyborra (Latin) = 'rOman tSi"'bOra (IPA) = Roman CHibora (Cyrillic) = TiBoRa RoMaN6 (Japanese) = roman cibora (Ethiopic) = R+W+M+N% ZJ'J+B+W+R+H+ (Hebrew) = r+w+m+/+n+ t+sny+b+w+r+h+ (Arabic) = UB85CUB9CC UCE58 (Korean).
<META HTTP-EQUIV="Content-Type" CONTENT="text/html;charset=UTF-8">Mozilla has brought encapsulated MIME and Unicode rendering capabilities to the masses. Together with some input methods, it might grow into a full-blown UTF-8 mail, news and WWW messaging system. Frank Tang's unixprint.html contains some thoughts about how to make Mozilla print more than ISO-8859-1 PostScript and links to useful documents.
echo 0x20AC | recode -f ucs2/x2..utf8/qp gives: =E2=82=AC=
#!/usr/local/bin/perl -p # assemble <U20AC> into € sub utf8 { local($_)=@_; return $_ < 0x80 ? chr($_) : $_ < 0x800 ? chr($_>>6&0x3F|0xC0) . chr($_&0x3F|0x80) : chr($_>>12&0x0F|0xE0).chr($_>>6&0x3F|0x80).chr($_&0x3F|0x80); } s/<U([0-9A-F]{4})>/&utf8(hex($1))/gei; #!/usr/local/bin/perl -p # disassemble non-ASCII codes from UTF-8 stream $format=$ENV{"UCFORMAT"}||'<U%04X>'; s/([\xC0-\xDF])([\x80-\xBF])/sprintf($format, unpack("c",$1)<<6&0x07C0|unpack("c",$2)&0x003F)/ge; s/([\xE0-\xEF])([\x80-\xBF])([\x80-\xBF])/sprintf($format, unpack("c",$1)<<12&0xF000|unpack("c",$2)<<6&0x0FC0|unpack("c",$3)&0x003F)/ge; s/([\xF0-\xF7])([\x80-\xBF])([\x80-\xBF])([\x80-\xBF])/sprintf($format, unpack("c",$1)<<18&0x1C0000|unpack("c",$2)<<12&0x3F000| unpack("c",$3)<<6&0x0FC0|unpack("c",$4)&0x003F)/ge;The module use utf8; will let you simplify these tasks with future Perl releases. Larry Wall is adding native Unicode and XML support to Perl together with the perl-unicode@perl.org experts.
Yet more needs to be done:
There is hope that Emacs 21 will turn to Unicode completely. Richard Stallman is being advised by the emacs-unicode@gnu.org circle. Since Emacs is not just an editor but a programmable Lisp environment its Unicode support will help a lot to read and write web pages, mails and news in UTF-8 and run shells and other tools in a UTF-8 environment. Emacs has already been using 24-bit-characters (atoms) internally for quite a while.
Other operating systems support UTF-8 as well. Plan9 and BeOS use it as their native representation, Windows NT knows UTF-8 as code page 1208.
With all that hype about UTF-8 having so many advantages and being standardized and supported by implementations, you may find yourself asking what the catch is. And there are indeed things you can criticize about UTF-8:
Latin1 UTF-1 UTF-8 UTF-7,5 UTF-7 JAVA HTML +AKA- \u00a0   ¡ +AKE- \u00a1 ¡ ¢ +AKI- \u00a2 ¢ £ +AKM- \u00a3 £ ¤ +AKQ- \u00a4 ¤ ¥ +AKU- \u00a5 ¥ ¦ +AKY- \u00a6 ¦ § +AKc- \u00a7 § ¨ +AKg- \u00a8 ¨ © +AKk- \u00a9 © ª +AKo- \u00aa ª « +AKs- \u00ab « ¬ +AKw- \u00ac ¬ +AK0- \u00ad ­ ® +AK4- \u00ae ® ¯ +AK8- \u00af ¯ ° +ALA- \u00b0 ° ± +ALE- \u00b1 ± ² +ALI- \u00b2 ² ³ +ALM- \u00b3 ³ ´ +ALQ- \u00b4 ´ µ +ALU- \u00b5 µ ¶ +ALY- \u00b6 ¶ · +ALc- \u00b7 · ¸ +ALg- \u00b8 ¸ ¹ +ALk- \u00b9 ¹ º +ALo- \u00ba º » +ALs- \u00bb » ¼ +ALw- \u00bc ¼ ½ +AL0- \u00bd ½ ¾ +AL4- \u00be ¾ ¿ +AL8- \u00bf ¿ À +AMA- \u00c0 À Á +AME- \u00c1 Á Â +AMI- \u00c2 Â Ã +AMM- \u00c3 Ã Ä +AMQ- \u00c4 Ä Å +AMU- \u00c5 Å Æ +AMY- \u00c6 Æ Ç +AMc- \u00c7 Ç È +AMg- \u00c8 È É +AMk- \u00c9 É Ê +AMo- \u00ca Ê Ë +AMs- \u00cb Ë Ì +AMw- \u00cc Ì Í +AM0- \u00cd Í Î +AM4- \u00ce Î Ï +AM8- \u00cf Ï Ð +ANA- \u00d0 Ð Ñ +ANE- \u00d1 Ñ Ò +ANI- \u00d2 Ò Ó +ANM- \u00d3 Ó Ô +ANQ- \u00d4 Ô Õ +ANU- \u00d5 Õ Ö +ANY- \u00d6 Ö × +ANc- \u00d7 × Ø +ANg- \u00d8 Ø Ù +ANk- \u00d9 Ù Ú +ANo- \u00da Ú Û +ANs- \u00db Û Ü +ANw- \u00dc Ü Ý +AN0- \u00dd Ý Þ +AN4- \u00de Þ ß +AN8- \u00df ß à +AOA- \u00e0 à á +AOE- \u00e1 á â +AOI- \u00e2 â ã +AOM- \u00e3 ã ä +AOQ- \u00e4 ä å +AOU- \u00e5 å æ +AOY- \u00e6 æ ç +AOc- \u00e7 ç è +AOg- \u00e8 è é +AOk- \u00e9 é ê +AOo- \u00ea ê ë +AOs- \u00eb ë ì +AOw- \u00ec ì í +AO0- \u00ed í î +AO4- \u00ee î ï +AO8- \u00ef ï ð +APA- \u00f0 ð ñ +APE- \u00f1 ñ ò +API- \u00f2 ò ó +APM- \u00f3 ó ô +APQ- \u00f4 ô õ +APU- \u00f5 õ ö +APY- \u00f6 ö ÷ +APc- \u00f7 ÷ ø +APg- \u00f8 ø ù +APk- \u00f9 ù ú +APo- \u00fa ú û +APs- \u00fb û ü +APw- \u00fc ü ý +AP0- \u00fd ý þ +AP4- \u00fe þ ÿ +AP8- \u00ff ÿ
bytes | bits | representation 1 | 7 | 0vvvvvvv 2 | 10 | 1010vvvv 11vvvvvv 3 | 16 | 1011vvvv 11vvvvvv 11vvvvvv putwchar(c) { if (c < 0x80) { putchar (c); } else if (c < 0x400) { putchar (0xA0 | c>>6); putchar (0xC0 | c & 0x3F); } else if (c < 0x10000) { putchar (0xB0 | c>>12); putchar (0xC0 | c>>6 & 0x3F); putchar (0xC0 | c & 0x3F); } else if (c < 0x110000) { putwchar (0xD7C0 + (c >> 10)); putwchar (0xDC00 | c & 0x3FF); } }
This UTF shares most of UTF-8's nice and not-so-nice properties inclusive of ASCII and sort-order transparency, disambiguity, self-segregation, and 2/3-byte compactness and adds some Latin1 transparency without any of UTF-1's disadvantages. It avoids the C1 controls and simply uses Latin1 character sequences to represent all non-ASCII characters. That means that it will not mess up terminals that use the C1 controls and allows UTF strings to be cut and pasted between Latin1 applications. Besides that, the UTF representation of Latin1's accented letters contains the original code prefixed by a pound sign () which means that it readability is remained in Latin1 applications.
The price paid for this is that the representations of Cyrillic, Armenian, Hebrew and Arabic letters {U+0400..U+07FF} grow from two to three bytes length, =FE and =FF are no longer avoided, and UTF-16 surrogates have to be used for characters beyond the 16 bit range which could be helped by a scheme like
16 | 1010vvvv 11vvvvvv 11vvvvvv 22 | 1011vvvv 11vvvvvv 11vvvvvv 11vvvvvv
Unlike UTF-8, UTF-7,5 is not officially registered as a standard encoding with ISO, Unicode, or IANA. The name "UTF-7,5" or "UTF-7" is not even a legal charset name by RFC 2278, a better name would be "UTF-3", "UTF-5", "UTF-6", or "UTF-9". Because of the missing acceptance as a standard you are confined to using UTF-7,5 in your own backyard and with your own implementations although the Latin1 compatibility would just make it a prime candidate for information interchange with other systems. Had UTF-7,5 been suggested 4 years earlier, it could have taken the place of UTF-8 as the standard interchange encoding and we would have fewer problems with Unicode text. But now it is probably wiser to invest energy into making the affected ISO-8859-1 tools ready for the UTF-8 standard instead of trying to promote UTF-7,5 as a new UTF-9 standard.
All of the above UTFs produce 8bit bytes that are not in ASCII and that will get stripped on any terminal that is still set to character size 7 or any mail gateway that ensures RFC 822's rule that mail messages have to be in ASCII. To solve that problem, David Goldsmith and Mark Davis invented a mail-safe transformation format UTF-7. It was first published in RFC 1642 in 1994, prominently included as Appendix A.1 in The Unicode Standard, Version 2.0, and now updated in RFC 2152. It makes partial use of the MIME base64 encoding and goes roughly like this:
char base64[]= "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/"; putwchar(c) { if (c == '+') { putchar('+'); putchar('-'); } else if (c < 0x80) { putchar(c); } else if (c < 0x10000) { putchar('+'); putchar(base64[c>>10&63]); putchar(base64[c>>4&63]); putchar(base64[c<<2&63]); putchar('-'); } else if (c < 0x110000) { c = 0xD7C0DC00 + (c >> 10 << 16 | c & 0x3FF); putchar('+'); putchar(base64[c>>26&63]); putchar(base64[c>>20&63]); putchar(base64[c>>14&63]); putchar(base64[c>>8&63]); putchar(base64[c>>2&63]); putchar(base64[c<<4&63]); putchar('-'); } }
Except for the '+' escaping, ASCII text remains unchanged with UTF-7. In some situations, the trailing '-' is optional. And by joining a whole stretch of non-ASCII characters into a larger base64 block you can encode an average of 3 Unicode characters in 8 bytes which is much better than the 9 bytes "=E5=A4=A9" for 1 CJK ideograph 天 in quoted-printable UTF-8.
However, base64 or 8bit SCSU can achieve much better compression, and UTF-7 is a bad general-purpose processing format: its flickering base64 grouping is awkward to program, most ASCII values can stand for almost any character and there are many different possible UTF-7 encodings of the same character so that UTF-7 is practically unsearchable without conversion.
putwchar(c) { if (c >= 0x10000) { printf ("\\u%04x\\u%04x" , 0xD7C0 + (c >> 10), 0xDC00 | c & 0x3FF); } else if (c >= 0x100) printf ("\\u%04x", c); else putchar (c); }
The advantage of the \u20ac notation is that it is very easy to type it in on any old ASCII keyboard and easy to look up the intended character if you happen to have a copy of the Unicode book or the {unidata2,names2,unihan}.txt files from the Unicode FTP site or CD-ROM or know what U+20AC is the €.
What's not so nice about the \u20ac notation is that the small letters are quite unusual for Unicode characters, the backslashes have to be quoted for many Unix tools, the four hexdigits without a terminator may appear merged with the following word as in \u00a333 for 33, it is unclear when and how you have to escape the backslash character itself, 6 bytes for one character may be considered wasteful, and there is no way to clearly present the characters beyond \uffff without \ud800\udc00 surrogates, and last but not least the plain hexnumbers may not be very helpful.
JAVA is one of the target and source encodings of yudit and its uniconv converter.
putwchar(c) { if (c < 0x80 && c != '&' && c != '<') putchar(c); else printf ("&#%d;", c); }
Decimal numbers for Unicode characters are also used in Windows NT's Alt-12345 input method but are still of so little mnemonic value that a hexadecimal alternative Ƽ is being supported by the newer standards HTML 4.0 and XML 1.0. Apart from that, hexadecimal numbers aren't that easy to memorize either. SGML has long allowed symbolic character entities for some character references like é for and € for the € but the table of supported entities differs from browser to browser.
This page may contain some errors and omissions in order to provoke your constructive criticism ♆
Roman Czyborra
$Date: 1998/11/30 19:33:40 $