There are several other common encodings for Unicode text besides UCS-2. The encoding which is used in any particular context normally depends on a number of factors, including:

Unicode encodings are either fixed-width (in which each character is represented by a fixed number of bytes), or variable-width (in which the number of bytes required to represent any given character varies depending on the character's value).

The most common Unicode encoding formats are listed below. Not all of these are supported by the ULS API; unsupported encodings are indicated by an asterisk (*).

Common Unicode Encodings

UCS-2

UTF-32 * UTF-16 * UTF-8 UTF-7 * UPF-8

Encoding Comparison

UCS-2 was the standard encoding used in the days before Unicode was expanded beyond the Basic Multilingual Plane. UCS-4 was subsequently introduced to allow support for the larger Unicode codespace. However, UCS-4 proved somewhat impractical, partly because of its excessive storage requirements, and partly because it is not backwards-compatible with existing UCS-2 data streams.

The introduction of UTF-16 provided a solution to both of these problems. Consequently, UTF-16 is commonly used for internal Unicode processing on many of the platforms which originally used UCS-2 (although OS/2, as noted, continues to use UCS-2 even today).

All three of these encodings, however, share several other disadvantages:

The UTF-8 encoding format eliminates all of these disadvantages. However, its encoding algorithm is non-trivial, and so overhead becomes an issue when processing UTF-8 data.

In general, UTF-8 is ideal for data storage and interchange, whereas UCS-2 and UTF-16 are preferred for internal processing.

Note: UTF-8 and UTF-16 also share the property that no byte sequence representing a valid character can possibly occur within a longer byte sequence representing a different character; this is important when parsing or searching text. Many other variable-width encodings, including UTF-7 and most legacy multi-byte codepages, do not share this property.

Naming conventions

The original fixed-width UCS encoding formats (UCS-2 and UCS-4) were named according to the convention UCS-x, where x represents the fixed number of bytes used for each character.

With the introduction of variable-width encodings (UTF-16, UTF-8 and UTF-7), the naming convention for Unicode encodings was changed to UTF-y, where UTF stands for "Unicode Transformation Format", and y represents the minimum number of bits required to represent a single character.

UCS-4 was accordingly renamed UTF-32. UCS-2 is now treated as a subset of UTF-16; the two terms should not, however, be used interchangeably.


[Back] [Next]