Richard Suchenwirth
Siemens Dematic PA RC D2, Konstanz, Germany
mailto:Richard.Suchenwirth-Bauersachs@siemens.com
Abstract: Problems and solutions of working with writing systems used over the world (internationalization, "i18n") are discussed, with emphasis on encodings, input, output, and the quite satisfying solutions in the Tcl scripting/programming language.
A character is not the same as a glyph, the writing element that we see on screen or paper - that represents it, but the same glyph can stand for different characters, or the same character be represented with different glyphs (think e.g. of a font selector).
Also, a character is not the same as a byte, or sequence of bytes, in memory. That again may represent a character, but not unequivocally, once we leave the safe haven of ASCII.
Let's try the following working definition: "A character is the abstract concept of a small writing unit". This often amounts to a letter, digit, or punctuation sign - but a character can be more or less than that. More: Ligatures, groups of two or more letters, can at times be treated as one character (arranged even in more than one line, as seen in Japanese U+337F ㍿ or Arabic U+FDFA ﷺ). Less: little marks (diacritics) added to a character, like the two dots on ü in Nürnberg (U+00FC), can turn that into a new "precomposed" character, as in German; or they may be treated as a separate, "composing character" (U+0308 in the example) which in rendering is added to the preceding glyph (u, U+0075) without advancing the rendering position - a more sensible treatment of the function of these two dots, "trema", in Spanish, Dutch, or even (older) English orthography: consider the spelling "coöperation" in use before c. 1950. Such composition is the software equivalent of "dead keys" on a typewriter.
Although an abstract concept, a character may of course have attributes, most importantly a name: a string, of course, which describes its function, usage, pronunciation etc. Various sets of names have been formalized in Postscript (/oumlaut) or HTML (ö). Very important in technical applications is of course the assignment of a number (typically a non-negative integer) to identify a character - this is the essence of encodings, where the numbers are more formally called code points. Other attributes may be predicates like "is upper", "is lower", "is digit".
The relations between the three domains are not too complicated: an encoding controls how a 1..n sequence of bytes is interpreted as a character, or vice versa; the act of rendering turns an abstract character into a glyph (typically by a pixel or vector pattern). Conversely, making sense of a set of pixels to correctly represent a sequence of characters, is the much more difficult art of OCR, which earns my daily bread, but is not the topic of this talk.
The most important purpose, outside the US, was of course to accommodate more letters required to represent the national writing system - Greek, Russian, or the mixed set of accented or "umlauted" characters used in virtually every country in Europe. Even England needed a code point for the Pound Sterling sign. The general solution was to use the 128 additional positions available when ASCII was implemented as 8-bit bytes, hex 80..FF. A whole flock of such encodings were defined and used:
16 + 32 + 128 = 176 = 0xB0 1 + 32 + 128 = 161 = 0xA1This implementation became known as "Extended UNIX Code" (EUC) in the national flavors euc-cn (China), -jp (Japan), -kr (Korea). Microsoft did it differently and adopted the similar but different ShiftJIS (Japan) resp. "Big 5" (Taiwan, Hongkong) encodings, to add to the "ideograph soup" confusion.
From Unicode version 3.1, the 16-bit limit was transcended for some rare writing systems, but also for the CJK Unified Ideographs Extension B - apparently, even 65536 code positions are not enough. The total count in Unicode 3.1 is 94,140 encoded characters, of which 70,207 are unified Han ideographs; the next biggest group are over 14000 Korean Hangul. And the number is growing.
Unicode 4.0.0 is the latest version, reported to contain 96,248 Graphic characters, 134 format characters, 65 Control characters, 137,468 "private use", 2,048 surrogates, 66 noncharacters. 878,083 code points are reserved for what the future will bring. From www.unicode.org/versions/Unicode4.0.0 : "1,226 new character assignments were made to the Unicode Standard, Version 4.0 (over and above what was in Unicode 3.2). These additions include currency symbols, additional Latin and Cyrillic characters, the Limbu and Tai Le scripts; Yijing Hexagram symbols, Khmer symbols, Linear B syllables and ideograms, Cypriot, Ugaritic, and a new block of variation selectors (especially for future CJK variants)."
It follows from this that bytes in UTF-8 encoding fall in distinct ranges:
00..7F - plain old ASCII 80..BF - non-initial bytes of multibyte code C2..FD - initial bytes of multibyte code (C0, C1 are not legal!) FE, FF - never used (so, free for byte-order marks).The distinction between initial and non-initial helps in plausibility checks, or to re-synchronize with missing data. Besides, it's independent of byte order (as opposed to UCS-16, see below). Tcl however shields these UTF-8 details from us: characters are just characters, no matter whether 7 bit, 16 bit, or (in the future) more.
I found out by chance that the byte sequence EF BB BF is the UTF-8 equivalent of \uFEFF, and the humble Notepad editor of Windows 2000 indeed switches to UTF-8 encoding when a file starts with these three bytes. I don't know how widely used this convention is, but I like it - my i18n-aware Tcl code will adopt it for reading and writing files, in addition to FEFF treatment that I already do.
The UCS-16 representation (in Tcl just called the "unicode" encoding) is much easier explained: each character code is written as a 16-bit "short" unsigned integer. The practical complication is that the two memory bytes making up a "short" can be arranged in "big-endian" (Motorola, Sparc) or "little-endian" (Intel) byte order. Hence, the following rules were defined for Unicode:
Not all i18n issues are therefore automatically solved for the user. One still has to analyze seemingly simple tasks like uppercase conversion (Turkish dotted/undotted I make an anomaly) or sorting ("collation order" is not necessarily the numeric order of the Unicodes, as [lsort] would apply by default), and write custom routines if a more correct behavior is required. Other locale-dependent i18n issues like number/currency formatting, date/time handling also belong to this group. I recommend to start from the defaults Tcl provides, and if necessary, customize the appearance as desired. International data exchange is severely hampered if localized numeric data are exchanged, one side using period, the other comma as decimal point...
Strictly spoken, the Tcl implementation "violates the UTF-8 spec, which explicitly forbids non-canonical representation of characters and requires that malformed UTF-8 sequences in the input be errors. ... I think that to be an advantage. But the spec says 'MUST' so we're at least technically non-compliant." (Kevin B. Kenny in the Tcl chat, 2003-05-13)
If textual data are internal to your Tcl script, all you have to know is the \uxxxx notation, which is substituted into the character with Unicode U+xxxx (hexadecimal). This notation can be used wherever Tcl substitution takes place, even in braced regexp's and [string map] pairlists; else you can force it by [subst]ing the string in question.
To demonstrate that for instance [scan] works transparently, here's a one-liner to format any Unicode character as HTML hex entity:
proc c2html c {format "&x%4.4x;" [scan $c %c]}Conversely it takes a few lines more:
proc html2u string { while {[regexp {&[xX]([0-9A-Fa-f]+);} $string matched hex]} { regsub -all $matched $string [format %c 0x$hex] string } set string } % html2u "this is a &x20ac; sign" this is a ? sign
For all other purposes, two commands basically provide all i18n support:
format %x [encoding convertfrom utf-8 \xef\xbb\xbf]in an interactive tclsh, and found that it stood for the famous byte-order mark FEFF.
Internally to Tcl, (almost) everything is a Unicode string. All communications with the operating system is done in the "system encoding", which you can query (but best not change) with the [encoding system] command. Typical values are iso8859-1 or -15 on European Linuxes, and cp1252 on European Windowses.
Finally, the msgcat package supports localization ("l10n") of apps by allowing message catalogs for translation of strings, typically for GUI display, from a base language (typically English) to a target language selected by the current locale. For example, an app to be localized for France might contain a file en_fr.msg with, for simplicity, only the line
msgcat::mcset fr File FichierIn the app itself, all you need is
package require msgcat namespace import msgcat::mc msgcat::mclocale fr ;#(1) #... pack [button .b -text [mc File]]to have the button display the localized text for "File", namely "Fichier", as obtained from the message catalog. For other locales, only a new message catalog has to be produced by translating from the base language. Instead of explicit setting as in (1), typically the locale information might come from an environment (LANG) or registry variable.
But having a good font is still not enough. While strings in memory are arranged in logical order, with addresses increasing from beginning to end of text, they may need to be rendered in other ways, with diacritics shifted to various positions of the preceding character, or most evident for the languages that are written from right to left ("r2l"): Arabic, Hebrew. (Tk still lacks automatic "bidi"rectional treatment, so r2l strings have to be directed "wrongly" in memory to appear right when rendered - see [A simple Arabic renderer] on the Wiki).
Correct bidi treatment has consequences for cursor movement, line justification, and line wrapping as well. Vertical lines progressing from right to left are popular in Japan and Taiwan - and mandatory if you had to render Mongolian.
Indian scripts like Devanagari are alphabets with about 40 characters, but the sequence of consonants and vowels is partially reversed in rendering, and consonant clusters must be rendered as ligatures of the two or more characters involved - the pure single letters would look very ugly to an Indian. An Indian font for one writing system already contains several hundred glyphs. Unfortunately, Indian ligatures are not contained in the Unicode (while Arabic ones are), so various vendor standards apply for coding such ligatures.
Finally, a "virtual keyboard" on screen, where characters are selected by mouse click, is especially helpful for non-frequent use of rarer characters, since the physical keyboard gives no hints which key is mapped to which other code. This can be implemented by a set of buttons, or minimally with a canvas that holds the provided characters as text items, and bindings to <1>, so clicking on a character inserts its code into the widget which has keyboard focus.
The term "input methods" is often used for operating-system-specific i18n support, but I have no experiences with this, doing i18n from a German Windows installation. So far I'm totally content with hand-crafted pure Tcl/Tk solutions - see taiku on the Wiki.
The Tclers' Wiki http://mini.net/tcl/ has the members of the Lish family available for copy'n'paste. The ones I most frequently use are
arblish dby w Abw Zby greeklish Aqh'nai hanglish se-qul heblish irwsliM ruslish Moskva i Leningrad
Programming languages are a matter of taste, but I can only testify that the Tcl language is best prepared for dealing with i18n issues, and for instance it took me four lines of code to "upgrade" my English eBook reader iRead so it could also render a Chinese eBook in Big-5 encoding, or similarly simple extend the dictionary browser iDict to accept UTF-8 files. I recommend Tcl for working with many languages. Even if you don't want or need that, Unicode (+suitable fonts) can offer surprising support, e.g. providing glyphs for chessmen or card color symbols...