The SCSU charset

SCSU is Reuters' Standard Compression Scheme for Unicode. It is specified in Unicode Technical Report 6.

SCSU is a character encoding scheme which allows plain ISO-8859-1 text to pass through transparently and all other Unicode text to be stored and transmitted without any significant increase in size. Most alphabetic scripts can be reduced to one byte per character on the average.

As a general-purpose processing format, SCSU is much worse than UTF-8: SCSU is stateful and can encode the same text in many different ways. An SCSU byte stream can contain null and 8bit bytes and such that look like ASCII or control characters but have a very different meaning.

Attached you find my little decoder from SCSU to UTF-8 and a collection of errors I found in the specification. I have also initiated IANA registration of the MIME label charset=SCSU for it.

Roman Czyborra (roman@czyborra.com)
$Date: 1998/08/15 18:01:23 $