Codepage & Co.

In the early 1980s there still were no agreed international standards like ISO-8859 or Unicode on how to expand US-ASCII for international users and many manufacturers invented their own encodings using hard-to-memorize numbers:

MS-DOS code pages

CP437 (DOSLatinUS)

The industry-standard IBM Personal Computer started out with the famous code page CP437 with lots of box-drawing characters and a select few foreign letters:

charset=cp437 [TXT] [BDF]

CP850 (DOSLatin1)

Some later MS-DOS versions allowed the changing of code pages on VGA graphics cards to something like CP850 which presented the Latin1 repertoire in positions compatible to CP437 so that line-drawing still worked:

charset=cp850 [TXT] [BDF]

CP852 (DOSLatin2)

CP852 did the same for Latin2 (Eastern Europe):

charset=cp852 [TXT] [BDF]

CP855 (DOSCyrillic)

CP855 was introduced as the corresponding Cyrillic codepage:

charset=cp855 [TXT] [BDF]

CP866 (DOSCyrillicRussian)

CP855 was soon followed by the CP866 which followed the more logical Russian alphabet ordering of the alternativny variant that was preferred by many Russian users:

charset=cp866 [TXT] [BDF]

The even more widely used Cyrillic charset (KOI8-R) has later been numbered CP878.

CP874 (DOSThai)

Microsoft's Thai CP874 is also following established standards, namely TIS-620, but adds non-standard characters in unused positions:

charset=cp874 [TXT] [BDF]

CP737..CP862

Now I have spared you the gory details of

CP737: DOSGreek
CP775: DOSBaltRim
CP857: DOSTurkish
CP860: DOSPortuguese
CP861: DOSIcelandic
CP862: DOSHebrew
CP863: DOSCanadaF
CP864: DOSArabic
CP865: DOSNordic
CP869: DOSGreek2

MS-Windows code pages

CP1252 (WinLatin1)

With the introduction of Windows, Microsoft dared say goodbye to the line-drawing characters and CP437-compatibility and adopted a modified superset of ISO-8859-1 as CP1252:

charset=Windows-1252 [TXT] [BDF]

CP1250 (WinLatin2)

Strange enough, WinLatin2 got the number CP1250 and differs from ISO-8859-2 in some positions but generated a lot of revenue for Microsoft on the emerging markets of Eastern Europe in the 1990s:

charset=Windows-1250 [TXT] [BDF]

CP1251 (WinCyrillic)

Another such example is the Cyrillic code page CP1251 for which Microsoft registered the label "Windows-1251". As of December 1997, even GOST's new (Lotus Notes) webserver greets you with charset=WINDOWS-1251. GOST (the Russian standardization authority and ISO member body) isn't even following its own standards any more!

CP1251 has a rich repertoire in an ordering incompatible with both ISO-IR-111 (KOI8) and ISO-8859-5:

charset=Windows-1251 [TXT] [BDF]

CP1257 (WinBaltic)

This is WinBaltic, which might have served as a model for ISOLatin7:

charset=Windows-1257 [TXT] [BDF]

CP1253...CP1258

You get the picture, the other Windows codepages are:

1253: WinGreek differs from ISO-8859-7 in its placement of the capital alpha with tonos and a few symbols only.
1254: WinTurkish does to WinLatin1 what ISO-8859-9 does to ISO-8859-1.
1255: WinHebrew is letter-compatible with ISO-8859-8.
1256: WinArabic preserves the symbols and small French letters from WinLatin1 and inserts the Arabic letters in the free slots so that only the positions =C1..=D6 (first half of the Arabic alphabet) are compatible with ISO-8859-6.
1257: WinBaltic is letter-compatible with ISOLatin7.
1258: WinVietnamese looks similar to WinLatin1 and very different from VISCII.

CJK codepages

Very much unlike the Extended Unix Coding EUC charsets, all of the following East Asian code pages illegaly reuse the C1 control codes {=80..=9F} for their lead bytes and ASCII values {=40..=7E} for their second bytes in order to encode more than ten thousand characters with two bytes. That means that ASCII values beyond =3F in their byte streams do not always mean ASCII characters.

CP932: Shift-JIS combines the Japanese charsets JIS X 0201 (one byte per character) and JIS X 0208 (two bytes per character) so that the JIS X 0201 Hiragana remain one-byte half-width characters and the 60 free 8bit code positions that carry no Hiragana are used as lead bytes for the 7076 kanji and 648 other full-width characters. Unlike EUC-JP, Shift-JIS has no space left for the additional 5802 kanji from JIS X 0212.
CP936: GBK extends EUC-CN (the 8bit encoding of GB 2312-80 with 6763 hanzi) for simplified zh_CN Mainland Chinese to cover all of the 20902 Han ideographs found in Unicode (GB 13000.1-93).
CP949: UnifiedHangul (UHC) is a superset of the Korean EUC-KR (the 8bit encoding of KS C 5601-1992 with its 2350 Hangul syllables and 4888 hanja) with 8822 additional pre-composed Hangul syllables in the C1 range.
CP950: is Big5 (13072 traditional zh_TW Chinese hanzi) for Taiwanese instead of EUC-TW (CNS 11643-1992).

Check out Ken Lunde's CJK.INF or the Unicode mapping tables for more details. You'll find these charsets illustrated in Ken Lunde's and Nadine Kano's bestsellers even though the latter is written from a pure Microsoft perspective with little mention of ISO standards.

Other Vendors' Standards

Microsoft is not the only company inventing their own more or less incompatible standards, as you can see in ftp://ftp.unicode.org/Public/MAPPINGS/VENDORS/:

Send mail to roman@czyborra.com if you need additional fonts or find errors like Andreas Prilop, Kent Karlsson, Jungshik Shin, and Jan Tomasek did.

Roman Czyborra
$Date: 1998/06/27 08:25:38 $