Unicode Text Processing

Unicode is a character standard designed to allow all possible characters from all known writing systems to be represented on computer systems in an interchangeable form.

Under Unicode, every character is assigned a unique numeric value. In this respect, Unicode is similar to ASCII, EBCDIC and other common character standards used on various computer platforms. What distinguishes Unicode is its scope: this single character standard is capable of consistently and unambiguously representing tens of thousands of unique characters from a vast range of different languages and writing systems.

The mapping of characters to numeric values defined by Unicode is also known as the Universal Character Set (UCS).

Note: Besides UCS, the Unicode standard also defines various algorithms and other implementation requirements. In practice, however, the terms "UCS" and "Unicode" are often used interchangeably.

In order to maximize compatibility with legacy ASCII-based encodings, UCS values 0x20 through 0x7F correspond to the displayable ASCII character set. It is therefore easy to convert these characters to and from Unicode; the specifics of conversion depend on the precise Unicode encoding used.

Note: When describing Unicode values, they are traditionally written in the form U+x, where x is the character's hexadecimal UCS codepoint.

Version 1 of the Unicode standard defined a range of 65,536 (2^16) character values (or codepoints), intended to cover most of the world's modern writing systems as well as many other common symbols.

The ultimate goal of Unicode, however, is to encompass support for all forms of text, including ancient languages, mathematical and musical symbols, and rare or obscure characters. To this end, the Unicode standard was expanded (as of version 2.0) to encompass 17 planes, each of which contains 65,536 codepoints (for a total codespace of 1,114,112 values, or U+0000 through U+10FFFF).

The 65,536 codepoints defined by the original standard (U+0000 - U+FFFF) are now known as Plane 0 or the Basic Multilingual Plane (BMP). As of the latest versions of Unicode, the character sets supported by the BMP include:

Arabic
Armenian
Braille patterns
Burmese
Canadian Aboriginal unified syllabary
Cherokee
Chinese, Japanese and Korean (CJK) unified ideographic characters
Chinese scripts and symbols
Coptic
Cyrillic
Diacritics and modifying characters
Ethiopic
Filipino scripts (Tagalog, Hanunoo, Buhid, and Tagbanwa)
Georgian
Greek
Hebrew
Indic scripts (Thaana, Devanagari, Bengali, Gurmukhi, Gujarati, Oriya, Tamil, Telugu, Kannada, Malayalam, and Sinhala)
International Phonetic Alphabet (IPA)
Japanese scripts and symbols
Khmer
Korean scripts and symbols
Lao
Latin (including Western European, Slavic, Baltic, Nordic, and Turkish)
Limbu
Mongolian
Ogham
Runic
Symbols (punctuation, currency symbols, number forms, mathematical, notational and technical symbols, dingbats, arrows, shapes, and drawing characters)
Syriac
Thai
Tibetan

Planes 1 and 2 contain mostly obscure, historical, and/or specialized notational characters; most of the remaining planes are currently unassigned. See Appendix A for more information.

At present, OS/2 supports only the Basic Multilingual Plane; otherwise, the Unicode version supported by the ULS API is approximately equivalent to Unicode 2.1; this is principally important with respect to the codepage conversion, text transformation, and character classification functions. Appendix A includes a list of character sets supported by version 2 of Unicode.

Actual rendering of characters defined by later versions of the Unicode standard should be possible, as this support depends on the specific font(s) used, and not the ULS APIs themselves.

[Back] [Next]