Unicode Bidirectional Surprise Effects

This page is Unicode utf-8 encoded. In case your browser did not detect this, you may need to manually set the encoding. Please refer to the manual of your browser.

Unicode Text Editors with a full bidirectional support must behave as if they implemented the official Unicode Bidirectional Algorithm. This algorithm is a convoluted process where, in several pass, the logically ordered Unicode text is scanned, and finally reordered into ~~illogical~~ visual order.

This documents describes the unexpected effects of Unicode Bidirectional Algorithm UAX#9 If you browser does not have a bug-free and full support for bidirectional characters, you might not see what I want to show you. You might need to get a compliant browser.

I have no affiliation with Unicode Consortium. Never had, never will.

The Problem Of Not Having Arabic RLM

Unicode Standard Annex #9 requires:

W2: search backward from each instance of a European number until the first strong type (R, L, AL, or sor) is found. If an AL is found, change the type of the European number to Arabic number.

Probably nobody was thinking that sor can never be AL at the beginning of the line?

X10: The remaining rules are applied to each run of characters at the same level. For each run, determine the start-of-level-run (sor) and end-of-level-run (eor) type, either L or R. This depends on the higher of the two levels on either side of the boundary (at the start or end of the paragraph, the level of the “other” run is the base embedding level). If the higher level is odd, the type is R, otherwise it is L.

I think this is ridiculous. In Arabic context you will get:

Logical	Visual
-10% TEST ARABIC	TSET CIBARA -10%
ARABIC -10% TEST	TSET %10- CIBARA

So what is the solution? The standard says that Higher-Level Protocols can:

Override the number handling to use information provided by a broader context. For example, information from other paragraphs in a document could be used to conclude that the document was fundamentally Arabic, and that EN should generally be converted to AN.

In Yudit I decided not to do this hack. The reason is this:

When text using a higher-level protocol is to be converted to Unicode plain text, formatting codes should be inserted to ensure that the order matches that of the higher-level protocol...
What a hack...

The Problem Of Characters That Have Global Effects

What are these characters?

Segment Separator	Its effect is well defined, but surprising.
Boundary Neutral	The location is not defined it can pop up at any place.

So let’s see what we get for at least the one, that is defined: Segment Separator - like Tab. I tried to use RLE in my translation, so that I can see what I will see in this Label as a label Text:

msgstr "‫سلام Gáspár محمد‬"

For this html document I have to write it this way:

  msgstr "‫سلام Gáspár	محمد‬"

As you see, I can not protect the text. If you set Yudit Editor’s Document Text Alignment to the right, you will see what the label will show. Something totally different.

Unfortunately the Unicode Algorithm requires me to. UAX #9 L1:

On each line, reset the embedding level of the following characters to the paragraph embedding level:
1. Segment Separators.

Well this means that regardless of having this tab embedded in our text I have to reset it to this English document’s embedding level. If you use gettext, please use '\t' instead of Tab.

The Problem Of Having Only One Set Of + - / * . % Characters

You might find it surprising, that programs conforming to Unicode Standard Annex #9 I must render the following text segments as you see. I just substituted HEBREW with ‫עברית‬ and ARABIC with ‫العربية‬. and I also inserted a Right to Left embedding mark so that you see what is going on):

Surprise #1:

Input : HEBREW ~~~23%%% HEBREW abc
Output : ‫עברית ~~~23%%% עברית abc‬

Input : ARABIC ~~~23%%% ARABIC abc
Output : ‫العربية ~~~23%%% العربية abc‬

Surprise #2:

Input: HEBREW 1*5 1-5 1/5 1+5
Output: ‫עברית 1*5 1-5 1/5 1+5‬

Input: ARABIC 1*5 1-5 1/5 1+5
Output: ‫العربية 1*5 1-5 1/5 1+5‬

I have checked this with java reference code from Unicode Consortium

http://www.unicode.org/unicode/reports/tr9/BidiReferenceJava/

so what you see here in Yudit is correct. Did you expect this? I feel like there is a fundamental flaw in the official Unicode Bidirectional algorithm that can not be solved unless there are separate character pairs for

+ - / * %

Without that all you can do is embed your mathematical equations with explicit direction overrides. And don’t use tab because that can not be embedded. And don’t use Boundary Neutrals. What else? Did I miss something?

The Problem Of Ir-reversibility

The Unicode Bidirectional Algorithm is irreversible. In other words, the logical text can be reordered into visual order, but there is no way to guess what the logically ordered text is, just by looking at the visual text.

This is a serious problem for digital signatures. If you want to sign a document, what you sign is the bit-stream, but what you see is the text. As there is no algorithm provided you can not possibly imagine, what you sign if you are just looking at the text.

The Problem Of Stateful Encoding

Unicode always made a laugh at other stateful encodings like iso-2022-x. In fact the stateliness they introduced with the explicit bidirectional marks is even worse, and it would make binary editing of Unicode Text files with proper undo operation next to impossible.

Remarks

I tested Yudit and found that it is, probably, 100% Compliant to the full Unicode Bidirectional UAX #9 algorithm. I can not prove that because of it is not possible to test that properly (the Unicode algorithm is inherently un-testable). However

I do not think that that UAX #9 algorithm is good.

Moreover, I think that that algorithm should be replaced with one that makes more sense. My clean-room implementation of the implicit algorithm mostly lies in

stoolkit/SBiDi.h stoolkit/SBiDi.cpp,

You can use it in your GNU programs. If Unicode Consortium ever change their mind it would be very easy to replace that file.

So how much is:

Input: HEBREW 10-2*5
Output: ‫עברי 10-2*5‬
If you don’t see: here it means your browser does not have full bidirectional support, or it is buggy. This means that you saw these pages all wrong. You should download Yudit and type “howto bidi” in the command area of the editor.

Input: ARABIC 10-2*5
Output: ‫العربية 10-2*5‬
If you don’t see: here it means your browser does not have full bidirectional support, or it is buggy. This means that you saw these pages all wrong.You should download Yudit and type “howto bidi” in the command area of the editor.

It is your choice. They both have 0 values, literally.

[Back]

[User Guide]

[Story]

[Yudit]

Gaspar Sinai
Last updated: 2002-11-21