[ Bottom of Page | Previous Page | Next Page | Contents | Index | Library Home | Legal | Search ]

Files Reference

uconvdef Source File Format

Purpose

Defines UCS-2 (Unicode) conversion mappings for input to the uconvdef command.

Description

Conversion mapping values are defined using UCS-2 symbolic character names followed by character encoding (code point) values for the multibyte code set. For example,

<U0020>    \x20

represents the mapping between the <U0020> UCS-2 symbolic character name for the space character and the \x20 hexadecimal code point for the space character in ASCII.

In addition to the code set mappings, directives are interpreted by the uconvdef command to produce the compiled table. These directives must precede the code set mapping section. They consist of the following keywords surrounded by < > (angle brackets), starting in column 1, followed by white space and the value to be assigned to the symbol:

<code_set_name> The name of the coded character set, enclosed in quotation marks (" "), for which the character set description file is defined.
<mb_cur_max> The maximum number of bytes in a multibyte character. The default value is 1.
<mb_cur_min> An unsigned positive integer value that defines the minimum number of bytes in a character for the encoded character set. The value is less than or equal to <mb_cur_max>. If not specified, the minimum number is equal to <mb_cur_max>.
<escape_char> The escape character used to indicate that the character following is interpreted in a special way. This defaults to a backslash (\).
<comment_char> The character that, when placed in column 1 of a charmap line, is used to indicate that the line is ignored. The default character is the number sign (#).
<char_name_mask> A quoted string consisting of format specifiers for the UCS-2 symbolic names. This must be a value of AXXXX, indicating an alphabetic character followed by 4 hexadecimal digits. Also, the alphabetic character must be a U, and the hexadecimal digits must represent the UCS-2 code point for the character. An example of a symbolic character name based on this mask is <U0020> Unicode space character.
<uconv_class> Specifies the type of the code set. It must be one of the following:
SBCS
Single-byte encoding
DBCS
Stateless double-byte, single-byte, or mixed encodings
EBCDIC_STATEFUL
Stateful double-byte, single-byte, or mixed encodings
MBCS
Stateless multibyte encoding

This type is used to direct uconvdef on what type of table to build. It is also stored in the table to indicate the type of processing algorithm in the UCS conversion methods.

<locale> Specifies the default locale name to be used if locale information is needed.
<subchar> Specifies the encoding of the default substitute character in the multibyte code set.

The mapping definition section consists of a sequence of mapping definition lines preceded by a CHARMAP declaration and terminated by an END CHARMAP declaration. Empty lines and lines containing <comment_char> in the first column are ignored.

Symbolic character names in mapping lines must follow the pattern specified in the <char_name_mask>, except for the reserved symbolic name, <unassigned>, that indicates the associated code points are unassigned.

Each noncomment line of the character set mapping definition must be in one of the following formats:

  1. "%s %s %s/n", <symbolic-name>, <encoding>, <comments>

    For example:

    <U3004>      \x81\x57

    This format defines a single symbolic character name and a corresponding encoding.

    The encoding part is expressed as one or more concatenated decimal, hexadecimal, or octal constants in the following formats:

    Decimal constants are represented by two or more decimal digits preceded by the escape character and the lowercase letter d, as in \d97 or \d143. Hexadecimal constants are represented by two or more hexadecimal digits preceded by an escape character and the lowercase letter x, as in \x61 or \x8f. Octal constants are represented by two or more octal digits preceded by an escape character.

    Each constant represents a single-byte value. When constants are concatenated for multibyte character values, the last value specifies the least significant octet and preceding constants specify successively more significant octets.

  2. "%s. . .%s %s %s/n", <symbolic-name>, <symbolic-name>, <encoding>, <comments>

    For example:

    <U3003>...<U3006>   \x81\x56

    This format defines a range of symbolic character names and corresponding encodings. The range is interpreted as a series of symbolic names formed from the alphabetic prefix and all the values in the range defined by the numeric suffixes.

    The listed encoding value is assigned to the first symbolic name, and subsequent symbolic names in the range are assigned corresponding incremental values. For example, the line:

    <U3003>...<U3006>   \x81\x56

    is interpreted as:

    <U3003>      \x81\x56
    <U3004>      \x81\x57
    <U3005>      \x81\x58
    <U3006>      \x81\x59
  3. "<unassigned> %s. . .%s %s/n", <encoding>, <encoding>, <comments>

    This format defines a range of one or more unassigned encodings. For example, the line:

    <unassigned>   \x9b...\x9c

    is interpreted as:

    <unassigned>   \x9b
    <unassigned>   \x9c

Related Information

The uconvdef command.

Code Set Overview in AIX 5L Version 5.2 Kernel Extensions and Device Support Programming Concepts.

List of UCS-2 Interchange Converters in AIX 5L Version 5.2 General Programming Concepts: Writing and Debugging Programs.

[ Top of Page | Previous Page | Next Page | Contents | Index | Library Home | Legal | Search ]