Handling Unicode text

As described, OS/2 represents Unicode characters internally as fixed-width two-byte values (UCS-2 encoding). These are treated as integral values for processing purposes. However, issues arise when it comes time to interpret them as literal characters.

Consider the following string:

    Hello world

In standard ASCII, this would consist of the byte sequence:

    48 65 6C 6C 6F 20 77 6F 72 6C 64

When converted to a UniChar string, each character value now occupies two bytes instead of one, resulting in the byte sequence:

    00 48 00 65 00 6C 00 6C 00 6F 00 20 00 77 00 6F 00 72 00 6C 00 64

Attempting to use such a value as a literal string will fail in many cases: most string-handling functions treat one byte as a single atomic unit, and the embedded zero bytes will be seen (incorrectly) as string terminators.

As mentioned in the previous section, it is sometimes possible to use C wide-character conventions to circumvent this problem, at least with those APIs that support it. Doing so, however, does not necessarily convert the Unicode values into meaningful characters for the current codepage or vice versa (certain compilers may support this, but many do not); so in practice this technique can only be applied to basic ASCII text, which is generally the same across different codepages. In any case, not all functions accept wide-character input, and neither do Presentation Manager controls.

In general, character-based text processing falls into one of two categories:

ULS provides mechanisms for handling Unicode text in both contexts.

[Back] [Next]