Unicode Terminals

For Unicode to gain wide acceptance in the Unix world it will be crucial to provide a Unicode terminal that takes care of the proper display and comfortable input possibilities for all needed Unicode characters in the standard UTF-8 format.

Why terminals?

Many of the fastest Unix application do not use a graphical user interface (GUI) but a character user interface (CUI) and run on a terminal or a terminal emulator like the xterm within the X Window System. These include the the editors vi, joe, emacs -nw, pico, the interactive shells bash and tcsh, the browser lynx, the pager less, the mail readers elm, mutt, pine, the news readers nn, trn, tin, communication programs talk, irc, and minicom, top, the GNU interactive tools git, the Midnight commander mc, the multiple session manager screen, games like gnuchess, the shell script menuing tool dialog, as well as various system and database administration forms.

Why is the CUI still popular? CUI applications start up fast and can be steered quickly through key presses or expect scripts, or a bit slower remotely via modem, telnet, rlogin or ssh. They can run on a console without a graphical window system as well as in a highly customizable terminal emulator in a windowing environment.

How do Unix terminals work?

The Unix terminal is an ancient beast as you can sometimes tell by error messages like "Not a typewriter" which refers to the teletypewriter device of the 1970s (more interactive than punch cards), connected to the host server through a serial line. Nevertheless, the Unix terminal has grown a complex set of functions to deal with various situations.

A terminal is a serial device that processes the bytes written to it and turns keyboard input into bytes to be read by an application. Between the terminal and the application, there is the flow control and line discipline at kernel level that can be controlled through the POSIX termios ioctl functions.

To become a character-imaging device, a terminal is free to interpret certain bytes or as graphic characters and certain byte sequences as formatting controls to move the cursor, switch colors, clear the screen, etc. The capabilities of each terminal type and the control sequences to access them are stored in the file /etc/termcap or the equivalent terminfo database and indexed by the value of the environment variable TERM.

A terminal has many stateful variables (screen contents, cursor position, highlighting modes, ...). As the application has no way of looking at the screen, it has to keep a copy of these variables and log all changes to it and assume that all its instructions have had the desired effects. There is a standardized library for that sort of screen manipulation called curses with a free GNU implementation called ncurses. The Unix98 specification (successor of the X/Open Portability Guide) specifies curses together with a set of wide character extensions to handle the East Asian character sets and these extensions might perhaps be usable for Unicode. I have not studied them that closely yet.

Why Unicode terminals?

The CUI applications listed above rely on their terminals to render text legible and send the keyboard input as proper character codes.

CUI applications can - while running in the X Window System - delegate the task of rendering and inputting Unicode text to the yudit editor. But launching such an external GUI application for every little piece of text can become quite time-consuming and inconvenient and thus void the advantages of the CUI.

It would also not be smart to try to add a whole set of Unicode rendering and input procedures to each of the CUI applications. They have not been bothering about the character rendering internals before. They do not have the direct access to the display hardware the terminal has. They could only hope that their terminal offers high-resolution bitmap functions such as the Tektronix 4014. Otherwise they would have to resort to transliterated approximations or ASCII graphics. On a console of 80x24 ASCII cells they could only display one word of 10 characters at a time when printing the hexdraw image of 8×16 glyph bitmaps.

If we get our terminal to render and input Unicode, all the CUI applications can simply rely on that function provided by their operating environment (terminal) and will only need minor changes. We'd also get a consistent look (font) and feel (input method) imposed on all CUI applications.

Why a unicoded xterm?

Programming a unicoded xterm would not require hardware adaptations for N platforms. A unicoded xterm can simply resort to the portable font loading and text drawing functions provided by the X window system. As an X11 application it will be able to run on all the existing X servers and require no radical redesign of the entire window system nor superuser privileges to install. A unicoded xterm will run in the current environment of choice of most Unix users, regardless of their favorite desktop or window manager. A unicoded xterm can bring Unicode capabilities to a wide range of programs on a wide range of Unix hosts. A unicoded xterm brings us a Unicode editors, Unicode mailers, and a Unicode browser with one strike.

What else is needed?

I believe that for a unicoded xterm to become successful, it would need to provide configurable input methods, font and encoding options, and comfortable cut-and-paste functions similar to yudit. Yudit already allows you to cut and paste UTF-8 text in and out of Emacs and into xterm but not back because xterm strips C1 controls like =8F.

Is it possible?

Unfortunately and as much as I would like to have that application for my Unicode font, I haven't found the time to prove my concept of a unicoded terminal emulator while writing my thesis. I still want to patch the rxvt terminal emulator because - unlike the original xterm - rxvt is slim with a clear structure, and already C1 control code clean and prepared for Japanese multibyte characters and kterm emulation, and even has Unicode support on its TODO list. Most of the work would go into the rxvt-2.4.7/src/screen.c.

Without this project started and tested, I can only speculate about the issue that will have to be dealt with.

Unicode does introduce a few problems we haven't had with ISO-8859-1 on terminals: first of all the character size has to be expanded to UTF-8 variable length. Now you get the problem that the byte count does no longer necessarily equal the character count or the display width. Are the line width and the cursor position now to be counted in bytes, characters, number of spacing characters, halfwidth character cells or what?

Applications unaware of UTF-8 will assume that their lines are longer than they really are. They will think that they have filled the screen with 69 printable bytes for example not realizing that these 69 bytes make up only 23 multibyte characters and break the line prematurely. This leaves the screen looking emptier than necessary but would not be a big problem for Latin text with mainly ASCII and only an occasional UTF-8 character.

Second of all, is it the terminal's or the application's task to handle bidirectionality, combining characters, contextual joining? I would first leave them out of the terminal in the beginning and then start experimenting with some algorithms.

A look at the existing Unicode terminals

9term

9term, the Plan9 terminal emulator for X, can display and enter UTF-8 text in a simple bitmapped window with mouse editing functions but without cursor control. Its TERMCAP says:

9term: :am:bl=^G:do=^J:nl=^J:

That means that the only control function it has is the linebreak. You can run all your line-oriented tools in 9term such as cat, ed, sed, perl but none of the CUI tools. You can enter any Unicode character by hexadecimal number with Alt + Xabcd and some with mnemonic abbreviations.

uxterm

There has been a unicoded xterm called uxterm from the Multilingual Application Support Service (MASS) group at the Institute of Systems Science of the National University of Singapore who have now formed their own company Star + Globe Technologies. But uxterm is only available in binary form for AIX, HP-UX, IRIX, OSF, and SunOS and not in OpenSource so you cannot port it to Linux nor enhance it.

uxterm greets me with

This is a scaled down demo copy of UXTERM 2.4.1.
(c) COPYRIGHT Institute of Systems Science (ISS), Singapore, 1994, 1995.
For information about the full package, please refer to the WWW page:
    http://www.iss.nus.sg/RND/MLP/Projects/MASS/MASS.html

uxterm appears to be based on the original X11 xterm. With its TERM=xterm, its TERMCAP is good enough to run curses applications. I set LESSCHARSET=koi8-r to get less to pass UTF-8 binary transparent.

Instead of xterm's [Control]+[MouseButtopn3] font menu, uxterm comes with a menu:

Code Options
> UTF-8
  UTF-7
  GB
  HZ
  HZX
  Big5
------
> Byte Mode
  Character Mode

Here is the horizontal cell counting problem thus addressed by manual switching betweeen a UTF-8-unaware byte mode in which you and a UTF-8-aware character mode.

uxterm can also be run with newer Unicode fonts than the buggy uni24.bdf it came with:

	uxterm -fn '*-iso10646-1' &

The stripped down uxterm comes without any input methods but you can configure your own input mappings:

UTerm.VT100.Translations:#override:<Key>aring:string("Ã¥")

Linux console

Markus Kuhn hacked UTF-8 support into the Linux console which can be used with Yann Dirson's consoletools. The standard escape sequence ESC % G (=1B=25=47) switches the console into UTF-8 mode. All Unicode characters listed as displayable by one of the two loaded VGA fonts (limited to 512 characters) in the corresponding screen font map are displayed as such and all others are rendered as the default U+FFFD REPLACEMENT CHARACTER (black box or question mark).

Roman Czyborra
November 29, 1998