Good ole' ASCII

The common denominator of all modern character sets like the ones from ISO 8859 is the very successful American Standard Code for Information Interchange (ASCII, /'æski:/) which was published thirty years ago in 1968 as ANSI X3.4 and looks like this:


charset=US-ASCII [TXT] [BDF]

Only EBCDIC ignores ASCII

Nowadays most computers use encodings based on ASCII. The only notable exception are the IBM mainframes that never departed from the myriad of dialects of their Extended Binary Coded Decimal Information Code (EBCDIC) that predated ASCII and gave a more readable BCD coding on paper punch-cards:


charset=EBCDIC-US [TXT] [BDF]

US-ASCII is 7bit only

But back to ASCII:

Don't let your computer manufacturer tell you that there are ASCII characters beyond those shown above. ASCII uses only 7 bits and allows the most significant eighth bit to be used as parity bit, highlight bit, end-of-string bit (all of which are considered bad practice nowadays) or to include additional characters for internationalization (i18n for which we need 8bit-clean programs that do none of afore-mentioned silly tricks) but ASCII defined no standard for this and many manufacturers invented their own proprietary codepages.

ISO-646: First attempts to internationalize the 7bit code

The only languages that can comfortably be written with the repertoire of US-ASCII happen to be Latin, Swahili, Hawaiian and American English without most typographic frills. It is rumoured that there are more languages in the world.

That's why foreign standardizers crafted up national variants of ASCII like the German DIN 66003 which my first CP/M computer (a Sharp MZ-731) used in 1984 to communicate with its daisy wheel printer or the Danish DS 2089 mentioned by Bjarne Stroustrup in § 6.5.3 of his C++ history. These 7bit charsets used the code positions of the less needed ASCII characters

	# $ @ [ \ ] ^ ` { | } ~
to insert their accented letters so that
	,,schöne Grüße``
would get encoded as
	,,sch|ne Gr}~e``
and C programs like
	{*argv[1]='\0'}
showed up as
	æ*argvÆ1Å='Ø0'å
because C didn't offer alternative digraphs like Pascal's (. for [ and (* for {.

This is how these national variants looked like:

German

charset=ISO-646-DE [TXT] [BDF]
Danish

charset=ISO-646-DK [TXT] [BDF]

To save space, I have omitted the variants

CA
Canadian CSA Z243.4
CN
Chinese GB 1988-80
CU
Cuban NC NC00-10
ES
Spanish
FR
French NF Z 62-010
HU
Hungarian MSZ 7795.3
IT
Italian
JP
Japanese JIS X 0201 Roman
KR
Korean KS C 5636
NO
Norwegian NS 4551-1
PT
Portuguese
SE
Swedish SEN 850200 B
UK
British BS 4730
YU
Yugoslavian JUS I.B1.002

ASCII and its national variants were declared international standard ISO 646 in 1972. Back then, the socialist countries managed to substitute the international currency sign ¤ for ASCII's capitalist dollar sign $ in the the first international reference version ISO-646-IRV but this was revised in 1991 and now ISO-646-IRV is a synonym for ISO-646-US or US-ASCII as it is used in the core Internet protocols. ISO-646's national variants have fallen out of use because of their obvious problems and because the 8bit ASCII extension ISO-8859-1 has offered a more interoperable standard.

ISO-10646/Unicode: ASCII's ultimate successor

ISO 8859 still isn't the ultimate solution. The even wider Universal Character Set (UCS, Unicode) wants to simplify world-wide multilingual and multi-platform text exchange just like the ASCII standard simplified Latin text interchange and aims to become equally successful as the ultimate encoding standard and thus it was numbered ISO 10646 = ISO 646 + 10000. And the inevitable success seems to be there: RFC 2070 internationalized the Internet's hypertext markup language HTML and declared ISO 10646 its new base charset. RFC 2277 recommends the use of ISO 10646 to all new Internet protocols.

Unicode begins with US-ASCII and ISO-8859-1 but goes beyond the 8bit barrier and encodes all the world's characters in a 16bit space and a 20bit extension zone for everything that did not fit into the 16bit space. The ASCII-compatible Unicode transformation scheme UTF-8 lets all ASCII characters pass through transparently and encodes all other characters as unambiguous 8bit character sequences.

Roman Czyborra
1998-05-24 .. 1998-11-30