How can we get our Unicode text printed on paper?

Converting Unicode Text To PostScript

On the contemporary Unix installation, formatting text for print means converting it to PostScript.

PostScript is Adobe's device-independent page description language, first introduced with the Apple Macintosh for desktop publishing in the mid-1980s.

PostScript is directly understood by laser printers of various resolutions as well as phototypesetters. For other (cheap laser or ink jet) printers which cannot interpret PostScript themselves, the GNU Ghostscript interpreter takes care of converting PostScript into instructions in their proprietary control languages so that they also emulate a PostScript printer.

PostScript is a full-fledged Forth-like stack-oriented programming language with many operators to describe graphical shapes.

PostScript is readable ASCII

A sample PostScript program hello.ps to draw "Hello world!" is

%!PS-Adobe-2.0
/Helvetica findfont 72 scalefont setfont
72 72 moveto
(Hello world!) show
showpage

This program owes its shortness to the fact that "Hello world!" is an ASCII-only string and that Times, Helvetica and Courier fonts covering the ASCII letters are pre-installed on every PostScript interpreter.

It has the nice (readability) property that you can look at the PostScript source and understand a thing or two even without a PostScript interpreter (rasterizer) or extract the plain text with

$ sed -n 's/.*(\(.*\)).*/\1/p' hello.ps
Hello world!

So how do we do the same with Unicode text with beyond-ASCII characters?

There is good news and bad news. The good news is that PostScript is a flexible programming language. PostScript is powerful enough to draw even the most exotic glyph. Theoretically, it would even be possible to redefine the show operator to render UTF-8 text.

The bad news are that PostScript's string data type is limited to 8bit characters. The elements of the current font can only be accessed through an encoding vector of 256 elements. All speed optimizations such as glyph caching are only for 256-character fonts with constant outline glyphs. Almost all non-Latin1 glyphs are missing from the preinstalled fonts so that your PostScript document will have to begin with a prologue defining the glyphs which can easily inflate the file size to beyond a megabyte for a single printout. Adobe's optimized Portable Document Format (PDF) introduced in 1992 showed no ambition to make Unicode use just as easy as ASCII's, either.

PostScript and PDF use their own traditional series of textual glyph names like "adieresis" (mnemonic) for Latin and Greek, "afii10017" (AFII-numbered) for Cyrillic, Hebrew, Arabic, and "uni12FF" (finally unicoded) for anything else.

Adobe has been working on a solution to break the 256-character boundary. The PostScript 2.0 Reference Manual introduced composite fonts in 1990: a Type 0 font recursively contains other fonts. The show operator then processes its string argument in chunks of two bytes at a time and uses the first byte as the font number and the second byte as the character code. This could be used to conveniently access a Unicode font split into 256 chunks of 256 characters. Bare UTF-16 text in the show argument would not give the most human-readable PostScript code because UTF-16 does not mix well with the ASCII (UTF-8) characters of the PostScript language. You would more likely resort to the hexadecimal representation:

<0041 0065 006C 006F> show

The undocumented original composite font format led to a large memory-clogging number of font files for one single composite font and left them tied to one particular charset which contradicted East Asian needs. That's why Adobe developed the CID-keyed font format that allows to stuff all the glyphs into one file just like the TrueType font format. Adobe maintains their own glyph numbering scheme: their Character IDentifying number is what hides behind the acronym CID. Various character encodings are now easily supported through simple CMap mapping tables.

CID-keyed fonts are still Type 1 (ISO 9541) fonts and will continue to be supported within OpenType (the compatibility merger of Adobe's Type 1 and Microsoft's TrueType font standards).

Before CID-keyed Unicode fonts hit the Unix freeware world, you can already use traditional PostScript techniques to print Unicode text.

I found two programs capable of converting UTF-8 text to PostScript and took a closer look at the PostScript they generated. The first was Plan9's troff. recoding down to traditional charsets

I have never authored or administered TeX/Metafont/troff/FrameMaker/PostScript setups, I have no exposure to professional typesetting software. There are no courses taught on typography at our school.

Reusing the Paper Renderer For Screen Display

Printing the Screen Rendition


Roman Czyborra
$Date: 1998/11/18 17:56:41 $