National Language Support Guide and Reference

Understanding the Locale Definition Source File

Unlike environment variables, which can be set from the command line, locales can only be modified by editing and compiling a locale definition source file.

If a desired locale is not part of the library, a binary version of the locale can be compiled by the localedef command. Locale behavior of programs is not affected by a locale definition source file unless the file is first converted by the localedef command, and the locale object is made available to the program. The localedef command converts source files containing definitions of locales into a run-time format and copies the run-time version to the file specified on the command line, which usually is a locale name. Internationalized commands and subroutines can then access the locale information. For information on preparing source files to be converted by the localedef command, see Locale Definition Source File Format in AIX 5L Version 5.2 Files Reference.

Multibyte Subroutines

Multibyte subroutines process characters in file-code form. The names of these subroutines usually start with the prefix mb. However, some multibyte subroutines do not have this prefix. For example, the strcoll and strxfrm subroutines process characters in their multibyte form but do not have the mb prefix. The following standard C subroutines operate on bytes and can be used to handle multibyte data: strcmp, strcpy, strncmp, strncpy, strcat, and strncat. The standard C search subroutines strchr, strrchr, strpbrk, strcspn, strrchr, strspn, strstr, and strtok can be used in the following cases:

Searching or scanning for characters in single-byte code sets
Searching or scanning for unique code-point range characters in multibyte strings

Wide Character Subroutines

Wide character subroutines process characters in process-code form. Wide character subroutines usually start with a wc prefix. However, there are exceptions to this rule. For example, the wide character classification functions use an isw prefix. To determine if a subroutine is a wide character subroutine, check if the subroutine prototype defines characters as wchar_t data type or wchar_t data pointer, or else check whether the subroutine returns a wchar_t data type. There are some exceptions to this rule. For example, the wide character classification subroutines accept wint_t data type values.

Bidirectionality and Character Shaping

An internationalized program may need to handle bidirectionality of text and character shaping.

Bidirectionality (BIDI) occurs when texts of different direction orientation appear together. For example, English text is read from left to right. Hebrew text is read from right to left. If both English and Hebrew texts appear on the same line, the text is bidirectional.

Character shaping occurs when the shape of a character is dependent on its position in a line of text. In some languages, such as Arabic, characters have different shapes depending on their position in a string and on the surrounding characters.

For more information about bidirectionality and character shaping, see Layout (Bidirectional Text and Character Shaping) Overview.

Code Set Independence

The system needs certain information about code sets to communicate with the external environment. This information is hidden by the code set-independent library subroutines (NLS library). These subroutines pass information to the code set-dependent functions. Because NLS subroutines handle the necessary code set information, you do not need explicit knowledge of any code set when you write programs that process characters. This programming technique is called code set independence.

To see a sample program that illustrates internationalization through code-set independent programming, see Appendix C. NLS Sample Program.

Determining Maximum Number of Bytes in Code Sets

You can use the MB_CUR_MAX macro to determine the maximum number of bytes in a multibyte character for the code set in the current locale. The value of this macro is dependent on the current setting of the LC_CTYPE category. Because the locale can differ between processes, running the MB_CUR_MAX macro in different processes or at different times may produce different results. The MB_CUR_MAX macro is defined in the stdlib.h header file.

You can use the MB_LEN_MAX macro to determine the maximum number of bytes in any code set that is supported by the system. This macro is defined in the limits.h header file.

Determining Character and String Display Widths

The _max_disp_width macro is operating-system-specific, and its use should be avoided in portable applications. If portability is not important, you can use the _max_disp_width macro to determine the maximum number of display columns required by a single character in the code set in the current locale. The value of this macro is dependent on the current setting of the LC_CTYPE category. If the value of this is 1 (one), all characters in the current code set require only one display column width on output.

When both MB_CUR_MAX and _max_disp_width are set to 1 (one), you can use the strlen subroutine to determine the display column width needed for a string. When MB_CUR_MAX is greater than one, use the wcswidth subroutine to find the display column width of the string.

The wcswidth and wcwidth wide-character display-width subroutines do not have corresponding multibyte functions. The wcswidth subroutine does not indicate how many characters can be displayed in the space available on a display. The wcwidth subroutine is useful for this purpose. This subroutine must be called repeatedly on a wide-character string to find out how many characters can be displayed in the available positions on the display.

Exceptions to Code Set Knowledge: Unique Code-Point Range

Because of the way the supported code sets are organized, there is one exception to the statement: "No knowledge of the underlying code set can be assumed in a program."

When a multibyte character string is searched for any character within the unique code-point range (for example, the . (period) character), it is not necessary to convert the string to process code form. It is sufficient to just look for that character (.) by examining each byte. This exception enables the kernel and utilities to search for the special characters . and / while parsing file names. If a program searches for any of the characters in the unique code-point range, the standard string functions that operate on bytes (such as the strchr subroutine), should be used. For a list of the characters in the unique code-point range, see ASCII Characters.

File Name Matching

POSIX.2 defines the fnmatch subroutine to be used for file name matching. An application can use the fnmatch subroutine to read a directory and apply a pattern against each entry. For example, the find utility can use the fnmatch subroutine. The pax utility can use the fnmatch subroutine to process its pattern operands. Applications that must match strings in a similar fashion can use the fnmatch subroutine.

Radix Character Handling

Note that the radix character, as obtained by nl_langinfo(RADIXCHAR), is a pointer to a string. It is possible that a locale may specify this as a multibyte character or as a string of characters. However, in AIX, a simplifying assumption is made that the RADIXCHAR is a single-byte character.

Programming Model

The programming model presented here highlights changes you need to make when an existing program is internationalized or when a new program is developed:

Provide complete internationalization. Do not assume that characters have any specific properties. Determine the properties dynamically by using the appropriate interfaces. Do not assume properties of code sets, except for the ASCII characters with code points in the unique code-point range.
Make programs code set-independent. Programs should not assume single-byte, double-byte, or multibyte encoding of any sort. Data can be processed in either process-code or file-code form by using the appropriate subroutines.
Provide interaction with the kernel in file-code form only. The kernel does not handle process codes.
The NLS subroutine library can handle processing based on file-code as well as processing based on process-code.
Note

Several subroutines based on process-code form do not have corresponding subroutines based on file-code form. Due to this asymmetry, it may be necessary to convert strings to process-code form and invoke the appropriate process-code subroutines.
Some libraries may not provide processing in process-code form. An application needing these libraries must use file-codes when invoking functions from them.
Programs can process characters either in process-code form or file-code form. It is possible to write code set-independent programs using both methods.