Codepage Conversion

The ULS conversion functions allow character strings in any particular codepage to be converted into equivalent strings in UCS-2 (UniChar) format, and vice versa.

This conversion is designed to preserve character integrity across different encodings: that is, the logical value of the character is preserved, even though the encoded value used to represent it may be different.

For example, the é character (lowercase e with acute accent) has value 0x82 under codepage 850, but value U+00E9 under Unicode. Therefore, a conversion of this character from codepage 850 to UCS-2 would convert the input byte sequence "0x82" to the output byte sequence "0x00E9".

Take the following string:

    A café in Reykjavík.

Under codepage 850, this consists of the byte sequence (in hexadecimal):

    41 20 63 61 66 82 20 69 6E 20 52 65 79 6B 6A 61 76 A1 6B 2E

Converting this string to UCS-2 would result in the output byte sequence:

    00 41 00 20 00 63 00 61 00 66 00 E9 00 20 00 69 00 6E 00 20
    00 52 00 65 00 79 00 6B 00 6A 00 61 00 76 00 ED 00 6B 00 2E

The conversion not only expands the single-byte codepage characters into two-byte UniChar values, it also ensures that the output UniChars have the correct byte values for the characters they are intended to represent.

Conversion to and from UCS-2 is symmetric: if a codepage string is converted to UCS-2, and then subsequently back to the original codepage, the final string and the original string should be identical.

Strings may also be converted from one multi-byte codepage to another, by converting to UCS-2 as an intermediate step. That is, to convert a string from codepage A to codepage B, one first converts the string from codepage A to UCS-2; one then takes the resulting string and converts it in turn from UCS-2 to codepage B.

Note: The term "multi-byte codepage" is used throughout this chapter to refer to non-UCS-2 codepages in general: that is, codepages under which text may be processed byte-by-byte, using the char type (and which are thus compatible with standard APIs). These may correspond to SBCS, DBCS, or variable-width encodings. This is in contrast to the UCS-2 encoding, under which the atomic character is a two-byte (UniChar) value.

Similarly, "multi-byte text" refers to any text encoded for a non-UCS-2 codepage (whether SBCS, DBCS, or variable-width): in other words, text that may be processed as char values without loss of integrity.

[Back] [Next]