Conversion specifiers

The codepage identifier passed to UniCreateUconvObject is a specially-formatted UniChar string called a conversion specifier.

The conversion specifier consists of the codepage name, optionally followed by an @ symbol and zero or more comma-separated conversion modifiers.

The codepage name normally takes the form IBM-cpnum, where cpnum is the integer number of the codepage (see Appendix B for a list). This name can be generated from the codepage number using the UniMapCpToUcsCp function. If the codepage name is left blank, the current process codepage will be used. There are also constants defined in uconv.h for several common codepages.

The conversion modifiers affect the behaviour of conversions performed using the UconvObject. The following modifiers can be specified:

map

Some byte values (specifically, the ASCII values 0x00 through 0x19 and 0x7F) may be used either as controls or to represent displayable glyphs, depending on the context. This modifier specifies how these characters should be interpreted during conversion.

The possible values are:

map=data Treat as control characters; leave the values unchanged in the output text. This is the default.

map=display Treat as displayable glyphs; convert according to codepage like any other character.

map=cdra Treat as control codes; attempt to convert to the equivalent control values in the output encoding (using standard IBM conversion rules).

map=crlf Treat CR (carriage return) and LF (line feed) as control codes; treat all others as displayable glyphs.

path

Specifies whether or not strings should be assumed to contain path specifications when converting to or from DBCS codepages.

Several Asian DBCS encodings which are otherwise ASCII-compatible replace the ASCII backslash character (0x5C) with an Asian currency symbol (yen, yuan, or won), and the ASCII tilde (0x7E) with an overline character. The purpose of this parameter is to control how these character values should be treated when converting to and from such codepages.

This modifier is only respected when the UniUconvFromUcs and UniUconvToUcs functions are used for the conversion. When the UniStrFromUcs and UniStrToUcs functions are used, the behaviour is always that of path=yes.

The possible values are:

path=yes Assume that strings may contain path names; convert 0x5C as the ASCII backslash, and 0x7E as the ASCII tilde. This has the following effects:

When converting from one of the affected DBCS codepages, byte value 0x5C will be interpreted as the ASCII backslash, and 0x7E will be interpreted as the ASCII tilde.
When converting to one of the affected DBCS codepages, both the backslash and the applicable Asian currency symbol will be converted to byte value 0x5C, and both the tilde and the overline characters will be converted to byte value 0x7E.
This is the default.

path=no Assumes that strings do not contain path data; convert 0x5C as the single-byte Asian currency symbol, and 0x7E as the single-byte overline character. This has the following effects:

When converting from one of the affected DBCS codepages, byte value 0x5C will be interpreted as the applicable Asian currency symbol, and byte value 0x7E will be interpreted as an overline character.
When converting to one of the affected DBCS codepages, the applicable Asian currency symbol will be converted to byte value 0x5C, and the overline character will be converted to 0x7E. However, the ASCII backslash and the ASCII tilde will be both treated as nonexistent characters, and replaced by substitution symbols.

Note: All the character values affected by this setting (backslash, tilde, overline, and the various currency symbols) refer to the single-byte ('halfwidth') forms only. DBCS codepages may contain double-byte ('fullwidth') forms of all these characters, which are not affected by the path setting.

endian

Specifies the byte order (endian) used by UCS-2 strings.

When converting text to Unicode, this byte order will be used in generating the UCS-2 output string.
When converting text from Unicode, this byte order will be used for parsing the UCS-2 input string.

The possible values are:

endian=big Use big-endian byte order.

endian=little Use little-endian byte order.

endian=system Use the system's native byte order. This is the default.

Different byte orders may also be specified for conversions to and from Unicode. This is done using the format endian=source:target, where source is the byte order (big, little, or system) to apply when converting from Unicode, and target is the byte order to apply when converting to Unicode.

sub

Specifies whether character substitution is enabled. Character substitution means that, when converting text, any characters which do not exist in the target codepage will be replaced by a generic "substitution" character.

This modifier only applies when the UniUconvFromUcs and UniUconvToUcs functions are used for the conversion. Substitution is always performed, regardless of this setting, when the UniStrFromUcs and UniStrToUcs functions are used.

The possible values are:

sub=yes Enable substitution when converting to or from Unicode.

sub=no Disable all substitution.

sub=to-ucs Enable substitution only when converting to Unicode (with UniUconvToUcs).

sub=from-ucs Enable substitution only when converting from Unicode (with UniUconvFromUcs). This is the default.

subchar

Specifies the byte value, within the target codepage, of the substitution character to be used (if substitution is enabled), when converting from Unicode. The possible values are:

subchar=\xXX XX is the hexadecimal codepoint of the desired substitution character within the target codepage.

subchar=\D# # is the decimal codepoint of the desired substitution character within the target codepage.

The default substitution character depends on the codepage. For most single-byte PC codepages it is 0x7F (⌂). Several ISO-8859 codepages such as 819, however, use 0x1A (the ASCII SUB control); this may cause difficulties if displayed under Presentation Manager, as PM does not recognize most ASCII control codes.

Note: Remember to double backslash characters when using C/C++ string notation (e.g. "subchar=\\x7F").

subuni

Specifies the Unicode codepoint of the substitution character to be used (if substitution is enabled), when converting to Unicode. The possible values are:

subuni=\xXX\xYY XX and YY are the high- and low-order hexadecimal byte values, respectively, for the UCS codepoint of the desired substitution character.

subuni=\xXXXX XXXX is the hexadecimal UCS codepoint of the desired substitution character.

The default substitution character is normally U+FFFD.

Note: Remember to double backslash characters when using C/C++ string notation (e.g. "subchar=\\xFFFD").

[Back] [Next]

map=data	Treat as control characters; leave the values unchanged in the output text. This is the default.
map=display	Treat as displayable glyphs; convert according to codepage like any other character.
map=cdra	Treat as control codes; attempt to convert to the equivalent control values in the output encoding (using standard IBM conversion rules).
map=crlf	Treat CR (carriage return) and LF (line feed) as control codes; treat all others as displayable glyphs.

endian=big	Use big-endian byte order.
endian=little	Use little-endian byte order.
endian=system	Use the system's native byte order. This is the default.

sub=yes	Enable substitution when converting to or from Unicode.
sub=no	Disable all substitution.
sub=to-ucs	Enable substitution only when converting to Unicode (with UniUconvToUcs).
sub=from-ucs	Enable substitution only when converting from Unicode (with UniUconvFromUcs). This is the default.

subchar=\xXX	XX is the hexadecimal codepoint of the desired substitution character within the target codepage.
subchar=\D#	# is the decimal codepoint of the desired substitution character within the target codepage.

subuni=\xXX\xYY	XX and YY are the high- and low-order hexadecimal byte values, respectively, for the UCS codepoint of the desired substitution character.
subuni=\xXXXX	XXXX is the hexadecimal UCS codepoint of the desired substitution character.