Other Unicode encodings

There are several other common encodings for Unicode text besides UCS-2. The encoding which is used in any particular context normally depends on a number of factors, including:

The need for backwards compatibility with older encoding formats and/or APIs.
Size and data storage considerations.
Processing overhead.
Whether the current environment guarantees eight-bit character integrity.

Unicode encodings are either fixed-width (in which each character is represented by a fixed number of bytes), or variable-width (in which the number of bytes required to represent any given character varies depending on the character's value).

The most common Unicode encoding formats are listed below. Not all of these are supported by the ULS API; unsupported encodings are indicated by an asterisk (*).

Common Unicode Encodings

UCS-2

The main advantage of UCS-2 is its simplicity: a UCS-2 value corresponds directly to its character's UCS codepoint, with no conversion needed. It also requires no more than two bytes to represent any character within the BMP.

UTF-32 *

UTF-32 is not supported by ULS.

UTF-16 *

The main advantage of UTF-16 is that it is backwards-compatible with UCS-2: any valid UCS-2 character is also a valid UTF-16 character. It thus retains most of UCS-2's advantages, while adding the ability to encode Unicode values beyond the BMP.

UTF-16 is not supported by ULS.

UTF-8

UTF-8 has several important advantages. First of all, it is backwards compatible with single-byte ASCII; that is, all ASCII text is also valid UTF-8 text, which makes it ideal for data interchange. In addition, while UTF-8 is capable of representing the entire Unicode codespace, it requires considerably less storage than UTF-32; it may require slightly less than UCS-2 and UTF-16 as well, although this depends on the text being encoded. Finally, every byte in a UTF-8 data stream is instantly identifiable as a lead byte, a following byte, or a single-byte character: this means that a lost or corrupted byte can only compromise the current character, and not subsequent data.

The major disadvantage of UTF-8 is its overhead: it is significantly more complex than UTF-16, which may impact performance under some circumstances. Consequently, UTF-8 is generally used for output rather than internal processing.

UTF-7 *

UTF-7 is rarely used, as it is complex and awkward to implement. Standard MIME encoding or quoted-printable UTF-8 are generally preferred in environments where eight-bit integrity is not guaranteed.

UTF-7 is not supported by ULS.

UPF-8

Because it is an OS/2-specific encoding, UPF-8 should not be used for data interchange.

Encoding Comparison

UCS-2 was the standard encoding used in the days before Unicode was expanded beyond the Basic Multilingual Plane. UCS-4 was subsequently introduced to allow support for the larger Unicode codespace. However, UCS-4 proved somewhat impractical, partly because of its excessive storage requirements, and partly because it is not backwards-compatible with existing UCS-2 data streams.

The introduction of UTF-16 provided a solution to both of these problems. Consequently, UTF-16 is commonly used for internal Unicode processing on many of the platforms which originally used UCS-2 (although OS/2, as noted, continues to use UCS-2 even today).

All three of these encodings, however, share several other disadvantages:

None of them is backwards compatible with single-byte ASCII data streams (although conversion is normally quite easy).
All of them require between two and four times as much storage space as basic ASCII to encode the same characters.
Data streams in any of these formats will contain embedded zero bytes, which many APIs may interpret incorrectly as string termination characters.
Each of these formats depends on reliable tracking of character boundaries when parsing text. A single lost or corrupted byte can compromise the integrity of all the data that follows.

The UTF-8 encoding format eliminates all of these disadvantages. However, its encoding algorithm is non-trivial, and so overhead becomes an issue when processing UTF-8 data.

In general, UTF-8 is ideal for data storage and interchange, whereas UCS-2 and UTF-16 are preferred for internal processing.

Note: UTF-8 and UTF-16 also share the property that no byte sequence representing a valid character can possibly occur within a longer byte sequence representing a different character; this is important when parsing or searching text. Many other variable-width encodings, including UTF-7 and most legacy multi-byte codepages, do not share this property.

Naming conventions

The original fixed-width UCS encoding formats (UCS-2 and UCS-4) were named according to the convention UCS-x, where x represents the fixed number of bytes used for each character.

With the introduction of variable-width encodings (UTF-16, UTF-8 and UTF-7), the naming convention for Unicode encodings was changed to UTF-y, where UTF stands for "Unicode Transformation Format", and y represents the minimum number of bits required to represent a single character.

UCS-4 was accordingly renamed UTF-32. UCS-2 is now treated as a subset of UTF-16; the two terms should not, however, be used interchangeably.

[Back] [Next]