WARNING: This file is encoded with Yudit utf-8-s encoder. I have downloaded the Markus Kuhn's UTF-8-demo.txt test file from: http://www.cl.cam.ac.uk/~mgk25/ucs/examples/ This file contains purposefully malformed sequences. utf-8 text files should not contain surrogates. Yudit reads them and indicates that they came as surrogates, but displays them as a supplementary plane characters. The Glyph Info clearly indicates that something is wrong. When you move the cursor after this character: 𐀀 Glyph Info: [sgt:00010000] DC80 DC00 is displayed. For well formed sequences Glyph Info should never show [sgt:]: http://www.unicode.org/versions/corrigendum1.html When such surrogates are written back to disk, Yudit's build-in utf-8 converter will write the shortest form, as required by utf-8, thus they will not be written back as surrogates, but as shorter supplementary plane characters. If you want to keep the binary integrity of the file, with malformed sequences and surrogate utf-8 characters, you should use Yudit's build-in utf-8-s converter instead of utf-8. The usage of the built-in utf-8-s converter is not recommended, only use it for test purpose. On the other hand the utf-8 encoder will always generate the shortest form. Gáspár Sinai 2002-11-22 5.1 Single UTF-16 surrogates 5.1.1 U+D800 = ed a0 80 = "í €" 5.1.2 U+DB7F = ed ad bf = "í­¿" 5.1.3 U+DB80 = ed ae 80 = "í®€" 5.1.4 U+DBFF = ed af bf = "í¯¿" 5.1.5 U+DC00 = ed b0 80 = "í°€" 5.1.6 U+DF80 = ed be 80 = "í¾€" 5.1.7 U+DFFF = ed bf bf = "í¿¿" 5.2 Paired UTF-16 surrogates 5.2.1 U+D800 U+DC00 = ed a0 80 ed b0 80 = "𐀀" 5.2.2 U+D800 U+DFFF = ed a0 80 ed bf bf = "𐏿" 5.2.3 U+DB7F U+DC00 = ed ad bf ed b0 80 = "í­¿í°€" 5.2.4 U+DB7F U+DFFF = ed ad bf ed bf bf = "í­¿í¿¿" 5.2.5 U+DB80 U+DC00 = ed ae 80 ed b0 80 = "󰀀" 5.2.6 U+DB80 U+DFFF = ed ae 80 ed bf bf = "󰏿" 5.2.7 U+DBFF U+DC00 = ed af bf ed b0 80 = "􏰀" 5.2.8 U+DBFF U+DFFF = ed af bf ed bf bf = "􏿿"