The traditional CJK charsets

CJK Ideographs

CJK is an acronym for "Chinese, Japanese, Korean". What's common to these three East Asian languages is that they all use a substantial number of ideographic characters (several thousand, not enumerable with 8 bits) historically dating back to the Chinese Han dynasty. These Han ideographs are called hanzi in Chinese, kanji in Japanese, and hanja in Korean and are unified as "CJK ideographs" in Unicode. For most ideographs, the pronunciation varies between the different languages but the meaning is similar so that Chinese text is somewhat comprehensible to Japanese readers.

CJK Syllabics

Besides the ideographs, syllabic scripts are used in Japanese and Korean (interspersed with the Han ideographs): Japanese employs its Kana alphabets (Hiragana and Katakana) and Korean its Hangul syllables.

The first double-byte character set: JIS X 0208

The Japanese character set JIS X 0208 (originally named JIS C 6220) was the first standardized charset to break the 8bit barrier in 1976. In a way, it is an early Unicode because it also contained the basic Greek and Cyrillic alphabets and symbols besides the Latin alphabet, the Hiragana and Katakana syllables and the most important Kanji (Chinese ideographs) required for Japanese.

On X11, you can have a look at the JIS X 0208 code chart with the command

	xfd -fn "*jisx0208*" &

[xfd screenshot: first page of JIS X 0208]

The JIS X 0208 characters are organized in 94 rows ("ku") of 94 cells ("ten") so that they can be mapped over the 94 printable G0 characters of ASCII as specified by ISO 2022. JIS X 0208 is thus structurally limited to 94×94 = 8'836 characters and with that low number no feasible replacement for Unicode with its 40'000+ characters. JIS X 0208 has been extended through a companion JIS X 0212 holding a 94×94 grid of supplementary symbols, accented letters and additional kanji, that can be accessed through ISO-2022 shift sequences.

JIS X 0208 was the first double-byte character set (DBCS). There are various character encoding schemes to access these double-byte characters:

mail-safe 7bit schemes using escape sequences, either
- the full ISO-2022 registered escape sequences or
- abbreviated ISO-2022-JP or
8bit projections like the
- ```
ku/ten 01/01 => 7bit =21=21 + offset =80=80 => EUC =A1=A1 
```
  for EUC-JP or
- Microsoft's Shift-JIS (CP932) codes floating around the halfwidth katakana codes found in the 8bit range =A1..=DF of the first (1969) Japanese single-byte 8bit charset JIS X 0201.

Much of this is described in Ken Lunde's book Understanding Japanese Information Processing and its online companion CJK.INF.

On Unix/X11, the JIS X 0208 kanji can be used in the kterm terminal emulator.

GB, KSC & Co.: the Chinese and Korean national charsets

Chinese and Korean standardizers followed the JIS X 0208 example and defined their own 94×94 grids. {ISO-2022,EUC}-{CN,TW,KR} are the MIME labels used for these national standard coded character sets. GB 2312 is the Chinese equivalent to JIS X 0208 holding 6'763 hanzi, KS C 5601 the Korean DBCS holding 4'888 hanja and 2'350 Hangul syllables, and CNS 11643 Plane 1 the Taiwanese (Traditional Chinese) standard (albeit the not formally specified industry-standard "Big5" is used more often in Taiwan).

The Chinese terminal emulator cxterm can handle GB and Big5 hanzi using its own configurable input dictionary.

Roman Czyborra
November 23, 1998