[ Bottom of Page | Previous Page | Next Page | Contents | Index | Library Home | Legal | Search ]

National Language Support Guide and Reference

Multibyte Code and Wide Character Code Conversion Subroutines

The internationalized environment of NLS blends multibyte and wide character subroutines. The decision of when to use wide character or multibyte subroutines can be made only after careful analysis.

If a program primarily uses multibyte subroutines, it may be necessary to convert the multibyte character codes to wide character codes before certain wide character subroutines can be used. If a program uses wide character subroutines, data may need to be converted to multibyte form when invoking subroutines. Both methods have drawbacks, depending on the program in use and the availability of standard subroutines to perform the required processing. For instance, the wide character display-column-width subroutine has no corresponding standard multibyte subroutine.

If a program can process its characters in multibyte form, this method should be used instead of converting the characters to wide character form.

Attention: The conversion between multibyte and wide character code depends on the current locale setting. Do not exchange wide character codes between two processes, unless you have knowledge that each locale that might be used handles wide character codes in a consistent fashion. With the exception of locales based on the IBM-eucTW codeset, AIX locales use the Unicode character value as a wide character code.

Multibyte Code to Wide Character Code Conversion Subroutines

The following subroutines are used when converting from multibyte code to wide character code:

mblen
Determines the length of a multibyte character. Do not use p++ to increment a pointer in a multibyte string. Use the mblen subroutine to determine the number of bytes that compose a character.
mbstowcs
Converts a multibyte string to a wide character string.
mbtowc
Converts a multibyte character to a wide character.

Wide Character Code to Multibyte Code Conversion Subroutines

The following subroutines are used when converting from wide character code to multibyte character code:

wcslen
Determines the number of wide characters in a wide character string.
wcstombs
Converts a wide character string to a multibyte character string.
wctomb
Converts a wide character to a multibyte character.

Examples

Wide Character Classification Subroutines

The majority of wide character classification subroutines are similar to traditional character classification subroutines, except that wide character classification subroutines operate on a wchar_t data type argument passed as a wint_t data type argument.

Generic Wide Character Classification Subroutines

In the internationalized environment of National Language Support, you need the ability to create new character class properties. For example, several properties are defined for Japanese characters that are not applicable to the English language. As more languages are supported, a framework enabling applications to deal with a varying number of character properties is needed. The wctype and iswctype subroutines allow handling of character classes in a general fashion. These subroutines are used to allow for both user-defined and language-specific character classes.

The action of wide character classification subroutines is affected by the definitions in the LC_CTYPE category for the current locale.

To create new character classifications for use with the wctype and iswctype subroutines, create a new character class in the LC_CTYPE category and generate the locale using the localedef command. A user application obtains this locale data with the setlocale subroutine. The program can then access the new classification subroutines by using the wctype subroutine to get the wctype_t property handle. It then passes to the iswctype subroutine both the property handle and the wide character code of the character to be tested.

The following subroutines are used for wide character classification:

wctype
Obtains handle for character property classification.
iswctype
Tests for character property.

Standard Wide Character Classification Subroutines

The isw* subroutines determine various aspects of a standard wide character classification. The isw* subroutines also work with single-byte code sets. Use the isw* subroutines in preference to the wctype and iswctype subroutines. Use the wctype and iswctype subroutines only for extended character class properties (for example, Japanese language properties).

When using the wide character functions to convert the case in several blocks of data, the application must convert characters from multibyte to wide character code form. Because this can affect performance in single-byte code set locales, consider providing two conversion paths in your application. The traditional path for single-byte code set locales would convert case using the isupper,islower, toupper, and tolower subroutines. The alternate path for multibyte code set locales would convert multibyte characters to wide character code form and convert case using the iswupper, iswlower, towupper and towlower subroutines. When converting multibyte characters to wide character code form, an application needs to handle special cases where a multibyte character may split across successive blocks.

The following is a list of standard wide character classification subroutines:

iswalnum
Tests for alphanumeric character classification.
iswalpha
Tests for alphabetic character classification.
iswcntrl
Tests for control character classification.
iswdigit
Tests for digit character classification.
iswgraph
Tests for graphic character classification.
iswlower
Tests for lowercase character classification.
iswprint
Tests for printable character classification.
iswpunct
Tests for punctuation character classification.
iswspace
Tests for space character classification.
iswupper
Tests for uppercase character classification.
iswxdigit
Tests for hexadecimal-digit character classification.

Wide Character Case Conversion Subroutines

The following subroutines convert cases for wide characters. The action of wide character case conversion subroutines is affected by the definition in the LC_CTYPE category for the current locale.

towlower
Converts an uppercase wide character to a lowercase wide character.
towupper
Converts a lowercase wide character to an uppercase wide character.

Example

The following example uses the wctype subroutine to test for the NEW_CLASS character classification:

#include <ctype.h>
#include <locale.h>
#include <stdlib.h>

main()
{
    wint_t    wc;
    int       retval;
    wctype_t  chandle;
    
    (void)setlocale(LC_ALL,"");
    /*
    ** Obtain the character property handle for the NEW_CLASS
    ** property.
    */
    chandle = wctype("NEW_CLASS") ;
    if(chandle == (wctype_t)0){
        /* Invalid property. Error handle. */
    }
    /* Let wc be the wide character code for a character */
    /* Test if wc has the property of NEW_CLASS */
    retval = iswctype( wc, chandle ); 
    if( retval > 0 ) {
        /*
        ** wc has the property NEW_CLASS. 
        */
    }else if(retval == 0) {
        /* 
        ** The character represented by wc does not have the 
        ** property NEW_CLASS.
        */
    }
}

Wide Character Display Column Width Subroutines

When characters are displayed or printed, the number of columns occupied by a character may differ. For example, a Kanji character (Japanese language) may occupy more than one column position. The number of display columns required by each character is part of the National Language Support locale database. The LC_CTYPE category defines the number of columns needed to display a character.

No standard multibyte display-column-width subroutines exist. For portability, convert multibyte codes to wide character codes and use the required wide character display-width subroutines. However, if the __max_disp_width macro (defined in the stdlib.h file) is set to 1 and a single-byte code set is in use, then the display-column widths of all characters (except tabs) in the code set are the same, and are equal to 1. In this case, the strlen (string) subroutine gives the display column width of the specified string, as shown in the following example:

#include <stdlib.h>
        int display_column_width;  /* number  of  display  columns  */
        char  *s;                  /*  character  string            */
        ....
        if((MB_CUR_MAX  ==  1)  &&  (__max_disp_width  ==  1)){
                display_column_width  =  strlen(s);
                                   /*  s  is  a  string  pointer    */
        }

The following subroutines find the display widths for wide character strings:

wcswidth
Determines the display width of a wide character string.
wcwidth
Determines the display width of a wide character.

Examples

Multibyte and Wide Character String Collation Subroutines

Strings can be compared in the following ways:

National Language Support (NLS) uses the second method.

Collation is a locale-specific property of characters. A weight is assigned to each character to indicate its relative order for sorting. A character may be assigned more than one weight. Weights are prioritized as primary, secondary, tertiary, and so forth. The maximum number of weights assigned each character is system-defined.

A process inherits the C locale or POSIX locale at its startup time. When the setlocale (LC_ALL, " ") subroutine is called, a process obtains its locale based on the LC_* and LANG environment variables. The following subroutines are affected by the LC_COLLATE category and determine how two strings will be sorted in any given locale.

Note
Collation-based string comparisons take a long time because of the processing involved in obtaining the collation values. Perform such comparisons only when necessary. If you need to determine whether two wide character strings are equal, do not use the wcscoll and wcsxfrm subroutines; use the wcscmp subroutine instead.

The following subroutines compare multibyte character strings:

strcoll
Compares the collation weights of multibyte character strings.
strxfrm
Converts a multibyte character string to values representing character collation weights.

The following subroutines compare wide character strings:

wcscoll
Compares the collation weights of wide character strings.
wcsxfrm
Converts a wide character string to values representing character collation weights.

Examples

Multibyte and Wide Character String Comparison Subroutines

The strcmp and strncmp subroutines determine if the contents of two multibyte strings are equivalent. If your application needs to know how the two strings differ lexically, use the multibyte and wide character string collation subroutines.

The following NLS subroutines compare wide character strings:

wcscmp Compares two wide character strings.
wcsncmp Compares a specific number of wide character strings.

Example

The following example uses the wcscmp subroutine to compare two wide character strings:

#include <string.h>
#include <locale.h>
#include <stdlib.h>
 
main()
{
    wchar_t *pwcs1, *pwcs2;
    int retval;
 
    (void)setlocale(LC_ALL, "");
    /*
    **  pwcs1 and pwcs2 point to two wide character
    **  strings to compare.
    */
    retval = wcscmp(pwcs1, pwcs2);
    /*  pwcs1 contains a copy of the wide character string
    **  in pwcs2
    */
}

Wide Character String Conversion Subroutines

The following NLS subroutines convert wide character strings to double, long, and unsigned long integers:

wcstod Converts a wide character string to a double-precision floating point.
wcstol Converts a wide character string to a signed long integer.
wcstoul Converts a wide character string to an unsigned long integer.

Before calling the wcstod, wcstoul, or wcstol subroutine, the errno global variable must be set to 0. Any error that occurs as a result of calling these subroutines can then be handled correctly.

Examples

Wide Character String Copy Subroutines

The following NLS subroutines copy wide character strings:

wcscpy Copies a wide character string to another wide character string.
wcsncpy Copies a specific number of characters from a wide character string to another wide character string.
wcscat Appends a wide character string to another wide character string.
wcsncat Appends a specific number of characters from a wide character string to another wide character string.

Example

The following example uses the wcscpy subroutine to copy a wide character string into a wide character array:

#include <string.h>
#include <locale.h>
#include <stdlib.h>

main()
{
    wchar_t *pwcs1, *pwcs2;
    size_t  n;
    
    (void)setlocale(LC_ALL, "");
    /*
    **  Allocate the required wide character array.
    */
    pwcs1 = (wchar_t *)malloc( (wcslen(pwcs2) +1)*sizeof(wchar_t));
    wcscpy(pwcs1, pwcs2);
    /*
    **  pwcs1 contains a copy of the wide character string in pwcs2
    */
}

Wide Character String Search Subroutines

The following NLS subroutines are used to search for wide character strings:

wcschr Searches for the first occurrence of a wide character in a wide character string.
wcsrchr Searches for the last occurrence of a wide character in a wide character string.
wcspbrk Searches for the first occurrence of a several wide characters in a wide character string.
wcsspn Determines the number of wide characters in the initial segment of a wide character string.
wcscspn Searches for a wide character string.
wcswcs Searches for the first occurrence of a wide character string within another wide character string.
wcstok Breaks a wide character string into a sequence of separate wide character strings.

Examples

Wide Character Input/Output Subroutines

NLS provides subroutines for both formatted and unformatted I/O.

Formatted Wide Character I/O

The printf and scanf subroutines allow for the formatting of wide characters. The printf and scanf subroutines have two additional format specifiers for wide character handling: %C and %S. The %C and %S format specifiers allow I/O on a wide character and a wide character string, respectively. They are similar to the %c and %s format specifiers, which allow I/O on a multibyte character and string.

The multibyte subroutines accept a multibyte array and output a multibyte array. To convert multibyte output from a multibyte subroutine to a wide character string, use the mbstowcs subroutine.

Unformatted Wide Character I/O

Unformatted wide character I/O subroutines are used when a program requires code set-independent I/O for characters from multibyte code sets. For example, use the fgetwc or getwc subroutine to input a multibyte character. If the program uses the getc subroutine to input a multibyte character, the program must call the getc subroutine once for each byte in the multibyte character.

Wide character input subroutines read multibyte characters from a stream and convert them to wide characters. The conversion is done as if the subroutines call the mbtowc and mbstowcs subroutines.

Wide character output subroutines convert wide characters to multibyte characters and write the result to the stream. The conversion is done as if the subroutines call the wctomb and wcstombs subroutines.

The LC_CTYPE category of the current locale affects the behavior of wide character I/O subroutines.

Reading and Processing an Entire File

If a program must go through an entire file that must be handled in wide character code form, use one of the following ways:

The decision of which of these methods to use should be made on a per program basis. The fgetsw subroutine option is recommended, as it is capable of optimum performance and the program does not have to handle the special cases.

Input Subroutines

The wint_t data type is required to represent the wide character code value as well as the end-of-file (EOF) marker. For example, consider the case of the fgetwc subroutine, which returns a wide character code value:

wchar_t fgetwc(); If the wchar_t data type is defined as a char value, the y-umlaut symbol cannot be distinguished from the end-of-file (EOF) marker in the ISO8859-1 code set. The 0xFF code point is a valid character (y umlaut). Hence, the return value cannot be the wchar_t data type. A data type is needed that can hold both the EOF marker and all the code points in a code set.
int fgetwc(); On some machines, the int data type is defined to be 16 bits. When the wchar_t data type is larger than 16 bits, the int value cannot represent all the return values.

The wint_t data type is therefore needed to represent the fgetwc subroutine return value. The wint_t data type is defined in the wchar.h file.

The following subroutines are used for wide character input:

fgetwc Gets next wide character from a stream.
fgetws Gets a string of wide characters from a stream.
getwc Gets next wide character from a stream.
getwchar Gets next wide character from standard input.
getws Gets a string of wide characters from a standard input.
ungetwc Pushes a wide character onto a stream.
Output Subroutines

The following subroutines are used for wide character output:

fputwc Writes a wide character to an output stream.
fputws Writes a wide character string to an output stream.
putwc Writes a wide character to an output stream.
putwchar Writes a wide character to standard output.
putws Writes a wide character string to standard output.

Examples

Working with the Wide Character Constant

Use the L constant for ASCII characters only. For ASCII characters, the L constant value is numerically the same as the code point value of the character. For example, L'a' is same as a. The L constant obtains the wchar_t value of an ASCII character for assignment purposes. A wide character constant is introduced by the L specifier. For example:

wchar_t wc = L'x' ; 

A wide character code corresponding to the character x is stored in wc. The C compiler converts the character x using the mbtowc or mbstowcs subroutine as appropriate. This conversion to wide characters is based on the current locale setting at compile time. Because ASCII characters are part of all supported code sets and the wide character representation of all ASCII characters is the same in all locales, L'x' results in the same value across all code sets. However, if the character x is non-ASCII, the program may not work when it is run on a different code set than used at compile time. This limitation impacts some programs that use switch statements using the wide character constant representation.

wchar.h Header File

The wchar.h header file declares information that is necessary for programming with multibyte and wide character subroutines. The wchar.h header file declares the wchar_t, wctype_t, and wint_t data types, as well as several functions for testing wide characters. Because the number of characters implemented as wide characters exceeds that of basic characters, it is not possible to classify all wide characters into the existing classes used for basic characters. Therefore, it is necessary to provide a way of defining additional classes specific to some locale. The action of these subroutines is affected by the current locale.

The wchar.h header file also declares subroutines for manipulating wide character strings (that is, wchar_t data type arrays). Array length is always determined in terms of the number of wchar_t elements in an array. A null wide character code ends an array. A pointer to a wchar_t data type array or void array always points to the initial element of the array.

Note: If the number of wchar_t elements in an array exceeds the defined array length, unpredictable results can occur.

Internationalized Regular Expression Subroutines

Programs that contain internationalized regular expressions can use the regcomp, regexec, regerror, regfree, and fnmatch subroutines.

The following subroutines are available for use with internationalized regular expressions.

regcomp
Compiles a specified basic or extended regular expression into an executable string.
regexec
Compares a null-terminated string with a compiled basic or extended regular expression that must have been previously compiled by a call to the regcomp subroutine.
regerror
Provides a mapping from error codes returned by the regcomp and regexec subroutines to printable strings.
regfree
Frees any memory allocated by the regcomp subroutine associated with the compiled basic or extended regular expression. The expression is no longer treated as a compiled basic or extended regular expression after it is given to the regfree subroutine.
fnmatch
Checks a specified string to see if it matches a specified pattern. You can use the fnmatch subroutine in an application that reads a dictionary to find which entries match a given pattern. You also can use the fnmatch subroutine to match path names to patterns.

[ Top of Page | Previous Page | Next Page | Contents | Index | Library Home | Legal | Search ]