National Language Support Guide and Reference

Multibyte Code and Wide Character Code Conversion Subroutines

The internationalized environment of NLS blends multibyte and wide character subroutines. The decision of when to use wide character or multibyte subroutines can be made only after careful analysis.

If a program primarily uses multibyte subroutines, it may be necessary to convert the multibyte character codes to wide character codes before certain wide character subroutines can be used. If a program uses wide character subroutines, data may need to be converted to multibyte form when invoking subroutines. Both methods have drawbacks, depending on the program in use and the availability of standard subroutines to perform the required processing. For instance, the wide character display-column-width subroutine has no corresponding standard multibyte subroutine.

If a program can process its characters in multibyte form, this method should be used instead of converting the characters to wide character form.

Attention: The conversion between multibyte and wide character code depends on the current locale setting. Do not exchange wide character codes between two processes, unless you have knowledge that each locale that might be used handles wide character codes in a consistent fashion. With the exception of locales based on the IBM-eucTW codeset, AIX locales use the Unicode character value as a wide character code.

Multibyte Code to Wide Character Code Conversion Subroutines

The following subroutines are used when converting from multibyte code to wide character code:

mblen: Determines the length of a multibyte character. Do not use p++ to increment a pointer in a multibyte string. Use the mblen subroutine to determine the number of bytes that compose a character.
mbstowcs: Converts a multibyte string to a wide character string.
mbtowc: Converts a multibyte character to a wide character.

Wide Character Code to Multibyte Code Conversion Subroutines

The following subroutines are used when converting from wide character code to multibyte character code:

wcslen: Determines the number of wide characters in a wide character string.
wcstombs: Converts a wide character string to a multibyte character string.
wctomb: Converts a wide character to a multibyte character.

Examples

The following example uses the mbtowc subroutine to convert a character in multibyte character code to wide character code:

main()
{
    char     *s;
    wchar_t  wc;
    int      n;
 
    (void)setlocale(LC_ALL,"");
 
    /*
    **  s points to the character string that needs to be
    **  converted to a wide character to be stored in wc.
    */
    n = mbtowc(&wc, s, MB_CUR_MAX);
 
    if (n == -1){
        /*  Error handle  */
    }
    if (n == 0){
        /*  case of name pointing to null  */
    }
 
    /*
    **  wc contains the process code for the multibyte character
    **  pointed to by s.
    */ 
}

The following example uses the wctomb subroutine to convert a character in wide character code to multibyte character code:

#include <stdlib.h>
#include <limits.h>              /* for MB_LEN_MAX  */
#include <stdlib.h>              /* for wchar_t     */
 
main()
{
    char    s[MB_LEN_MAX};       /*  system wide maximum number of
                                 **  bytes in a multibyte character r. */
    wchar_t wc;
    int     n;
 
    (void)setlocale(LC_ALL,"");
 
    /*
    **  wc is the wide character code to be converted to
    **  multibyte character code.
    */
    n = wctomb(s, wc);
 
    if(n == -1){
        /* pwcs does not point to a valid wide character */
    }
    /*
    ** n has the number of bytes contained in the multibyte
    ** character stored in s.
    */
}

The following example uses the mblen subroutine to find the byte length of a character in multibyte character code:

#include <stdlib.h>
#include <locale.h>
 
main
{ 
    char *name = "h";
    int  n;
 
    (void)setlocale(LC_ALL,"");
 
    n = mblen(name, MB_CUR_MAX);
    /*
    **  The count returned in n is the multibyte length.
    **  It is always less than or equal to the value of
    **  MB_CUR_MAX in stdlib.h
    */
    if(n == -1){
        /* Error Handling */
    }
}

The following example obtains a previous character position in a multibyte string. If you need to determine the previous character position, starting from a current character position (not a random byte position), step through the buffer starting at the beginning. Use the mblen subroutine until the current character position is reached, and save the previous character position to obtain the needed character position.

char buf[];     /* contains the multibyte string */
char *cur,      /* points to the current character position */
char *prev,     /* points to previous multibyte character */
char *p;        /* moving pointer */

/* initialize the buffer and pointers as needed */
/* loop through the buffer until the moving pointer reaches
** the current character position in the buffer, always
** saving the last character position in prev pointer */
p = prev = buf;

/* cur points to a valid character somewhere in buf */
while(p< cur){
        prev = p;
        if( (i=mblen(p, mbcurmax))<=0){
                /* invalid multibyte character or null */
                /* You can have a different error handling
                ** strategy */
                p++;    /* skip it */
        }else {
                p += i;
        }
}
/* prev will point to the previous character position */

/* Note that if( prev == cur), then it means that there was
** no previous character. Also, if all bytes up to the
** current character are invalid, it will treat them as
** all valid single-byte characters and this may not be what
** you want. One may change this to handle another method of
** error recovery. */

The following example uses of the mbstowcs subroutine to convert a multibyte string to wide character string:

#include <stdlib.h>
#include <locale.h>
 
main()
{
     char    *s;
     wchar_t *pwcs;
     size_t  retval, n;
 
     (void)setlocale(LC_ALL, "");

     n = strlen(s) + 1;          /*string length + terminating null */
 
     /*  Allocate required wchar array    */
     pwcs = (wchar_t *)malloc(n * sizeof(wchar_t) );
     retval = mbstowcs(pwcs, s, n);
     if(retval == -1){
    
     /*  Error handle  */
          }
          /*
          ** pwcs contains the wide character string.
          */
}

The following example illustrates the problems with using the mbstowcs subroutine on a large block of data for conversion to wide character form. When it encounters a multibyte that is not valid, the mbstowcs subroutine returns a value of -1 but does not specify where the error occurred. Therefore, the mbtowc subroutine must be used repeatedly to convert one character at a time to wide character code.

Note

Processing in this manner can considerably slow program performance.

During the conversion of single-byte code sets, there is no possibility for partial multibytes. However, during the conversion of multibyte code sets, partial multibytes are copied to a save buffer. During the next call to the read subroutine, the partial multibyte is prefixed to the rest of the byte sequence.

Note

A null-terminated wide character string is obtained. Optional error handling can be done if an instance of an invalid byte sequence is found.

#include <stdio.h>
#include <locale.h>
#include <stdlib.h>
 
main(int argc, char *argv[])
{
 
    char     *curp, *cure;
    int      bytesread, bytestoconvert, leftover;
    int      invalid_multibyte, mbcnt, wcnt;
    wchar_t  *pwcs;
    wchar_t  wbuf[BUFSIZ+1];
    char     buf[BUFSIZ+1];
    char     savebuf[MB_LEN_MAX];
    size_t   mb_cur_max;
    int      fd;
        /*
        **  MB_LEN_MAX specifies the system wide constant for
        **  the maximum number of bytes in a multibyte character.
        */
 
    (void)setlocale(LC_ALL, "");
    mb_cur_max = MB_CUR_MAX;
 
    fd = open(argv[1], 0);
    if(fd < 0){
       /*  error handle  */
    }
 
    leftover = 0;
    if(mb_cur_max==1){    /*  Single byte code sets case  */
        for(;;){
            bytesread = read(fd, buf, BUSIZ);
            if(bytesread <= 0)
                break;
            mbstowcs(wbuf, buf, bytesread+1);
            /*  Process using the wide character buffer   */
        }
                        /*  File processed ...   */
        exit(0);        /*  End of program       */
 
    }else{              /*  Multibyte code sets  */
        leftover = 0;
 
        for(;;) {
            if(leftover)
                strncpy(buf, savebuf ,leftover);
            bytesread=read(fd,buf+leftover, BUFSIZ-leftover);
            if(bytesread <= 0)
                break;
 
            buf[leftover+bytesread] = '\0';
                     /* Null terminate string */
            invalid_multibyte = 0;
            bytestoconvert = leftover+bytesread;
            cure= buf+bytestoconvert;
            leftover=0;
            pwcs = wbuf;
                /* Stop processing when invalid mbyte found. */
            curp= buf;
 
            for(;curp<cure;){
                mbcnt = mbtowc(pwcs,curp, mb_cur_max);
                if(mbcnt>0){
                    curp += mbcnt;
                    pwcs++;
                    continue;
 
                }else{
                    /* More data needed on next read*/
                    if ( cure-curp<mb_cur_max){
                        leftover=cure-curp;
                        strncpy(savebuf,curp,leftover);
                        /* Null terminate before partial mbyte */
                        *curp=0; 
                        break;
 
                    }else{
                            /*Invalid multibyte found */
                        invalid_multibyte =1;
                        break;
                    }
                }
            }
            if(invalid_multibyte){          /*error handle */
            }
            /* Process the wide char buffer */
        }
    }
}

The following example uses the wcstombs and wcslen subroutines to convert a wide character string to multibyte form:

#include <stdlib.h>
#include <locale.h>
 
main()
{
    wchar_t *pwcs; /* Source wide character string */
    char *s;       /* Destination multibyte character string */
    size_t n;
    size_t retval;
 
    (void)setlocale(LC_ALL, "");
    /*
    ** Calculate the maximum number of bytes needed to
    ** store the wide character buffer in multibyte form in the 
    ** current code page and malloc() the appropriate storage,
    ** including the terminating null.
    */
    s = (char *) malloc( wcslen(pwcs) * MB_CUR_MAX + 1 ); 
    retval= wcstombs( s, pwcs, n); 
    if( retval == -1) {
        /* Error handle */
        /* s points to the multibyte character string. */
}

Wide Character Classification Subroutines

The majority of wide character classification subroutines are similar to traditional character classification subroutines, except that wide character classification subroutines operate on a wchar_t data type argument passed as a wint_t data type argument.

Generic Wide Character Classification Subroutines

In the internationalized environment of National Language Support, you need the ability to create new character class properties. For example, several properties are defined for Japanese characters that are not applicable to the English language. As more languages are supported, a framework enabling applications to deal with a varying number of character properties is needed. The wctype and iswctype subroutines allow handling of character classes in a general fashion. These subroutines are used to allow for both user-defined and language-specific character classes.

The action of wide character classification subroutines is affected by the definitions in the LC_CTYPE category for the current locale.

To create new character classifications for use with the wctype and iswctype subroutines, create a new character class in the LC_CTYPE category and generate the locale using the localedef command. A user application obtains this locale data with the setlocale subroutine. The program can then access the new classification subroutines by using the wctype subroutine to get the wctype_t property handle. It then passes to the iswctype subroutine both the property handle and the wide character code of the character to be tested.

The following subroutines are used for wide character classification:

wctype: Obtains handle for character property classification.
iswctype: Tests for character property.

Standard Wide Character Classification Subroutines

The isw* subroutines determine various aspects of a standard wide character classification. The isw* subroutines also work with single-byte code sets. Use the isw* subroutines in preference to the wctype and iswctype subroutines. Use the wctype and iswctype subroutines only for extended character class properties (for example, Japanese language properties).

When using the wide character functions to convert the case in several blocks of data, the application must convert characters from multibyte to wide character code form. Because this can affect performance in single-byte code set locales, consider providing two conversion paths in your application. The traditional path for single-byte code set locales would convert case using the isupper,islower, toupper, and tolower subroutines. The alternate path for multibyte code set locales would convert multibyte characters to wide character code form and convert case using the iswupper, iswlower, towupper and towlower subroutines. When converting multibyte characters to wide character code form, an application needs to handle special cases where a multibyte character may split across successive blocks.

The following is a list of standard wide character classification subroutines:

iswalnum: Tests for alphanumeric character classification.
iswalpha: Tests for alphabetic character classification.
iswcntrl: Tests for control character classification.
iswdigit: Tests for digit character classification.
iswgraph: Tests for graphic character classification.
iswlower: Tests for lowercase character classification.
iswprint: Tests for printable character classification.
iswpunct: Tests for punctuation character classification.
iswspace: Tests for space character classification.
iswupper: Tests for uppercase character classification.
iswxdigit: Tests for hexadecimal-digit character classification.

Wide Character Case Conversion Subroutines

The following subroutines convert cases for wide characters. The action of wide character case conversion subroutines is affected by the definition in the LC_CTYPE category for the current locale.

towlower: Converts an uppercase wide character to a lowercase wide character.
towupper: Converts a lowercase wide character to an uppercase wide character.

Example

The following example uses the wctype subroutine to test for the NEW_CLASS character classification:

#include <ctype.h>
#include <locale.h>
#include <stdlib.h>

main()
{
    wint_t    wc;
    int       retval;
    wctype_t  chandle;
    
    (void)setlocale(LC_ALL,"");
    /*
    ** Obtain the character property handle for the NEW_CLASS
    ** property.
    */
    chandle = wctype("NEW_CLASS") ;
    if(chandle == (wctype_t)0){
        /* Invalid property. Error handle. */
    }
    /* Let wc be the wide character code for a character */
    /* Test if wc has the property of NEW_CLASS */
    retval = iswctype( wc, chandle ); 
    if( retval > 0 ) {
        /*
        ** wc has the property NEW_CLASS. 
        */
    }else if(retval == 0) {
        /* 
        ** The character represented by wc does not have the 
        ** property NEW_CLASS.
        */
    }
}

Wide Character Display Column Width Subroutines

When characters are displayed or printed, the number of columns occupied by a character may differ. For example, a Kanji character (Japanese language) may occupy more than one column position. The number of display columns required by each character is part of the National Language Support locale database. The LC_CTYPE category defines the number of columns needed to display a character.

No standard multibyte display-column-width subroutines exist. For portability, convert multibyte codes to wide character codes and use the required wide character display-width subroutines. However, if the __max_disp_width macro (defined in the stdlib.h file) is set to 1 and a single-byte code set is in use, then the display-column widths of all characters (except tabs) in the code set are the same, and are equal to 1. In this case, the strlen (string) subroutine gives the display column width of the specified string, as shown in the following example:

#include <stdlib.h>
        int display_column_width;  /* number  of  display  columns  */
        char  *s;                  /*  character  string            */
        ....
        if((MB_CUR_MAX  ==  1)  &&  (__max_disp_width  ==  1)){
                display_column_width  =  strlen(s);
                                   /*  s  is  a  string  pointer    */
        }

The following subroutines find the display widths for wide character strings:

wcswidth: Determines the display width of a wide character string.
wcwidth: Determines the display width of a wide character.

Examples

The following example uses the wcwidth subroutine to find the display column width of a wide character:

#include  <string.h>
#include  <locale.h>
#include  <stdlib.h>
  
main()
{
    wint_t  wc;
    int     retval;
 
    (void)setlocale(LC_ALL,  "");
 
    /*
    **    Let wc be the wide character whose display width is
    **    to be found.
    */
    retval  =  wcwidth(wc);
    if(retval  ==  -1){
        /*
        **  Error handling. Invalid or nonprintable
        **  wide character in wc.
        */
    }
}

The following example uses the wcswidth subroutine to find the display column width of a wide character string:

#include  <string.h>
#include  <locale.h>
#include  <stdlib.h>
 
main()
{
    wchar_t  *pwcs;
    int      retval;
    size_t   n;
 
    (void)setlocale(LC_ALL,  "");
    /*
    **    Let pwcs point to a wide character null
    **    terminated string.
    **    Let n be the number of wide characters
    **    whose display column width is to be determined.
    */
    retval  =  wcswidth(pwcs,  n);
    if(retval  ==  -1){
        /*
        **  Error handling. Invalid wide or nonprintable
        **  character  ode encountered in the wide  
        **  character string pwcs.
        */
    }
}

Multibyte and Wide Character String Collation Subroutines

Strings can be compared in the following ways:

Using the ordinal (binary) values of the characters.
Using the weights associated with the characters for each locale, as determined by the LC_COLLATE category.

National Language Support (NLS) uses the second method.

Collation is a locale-specific property of characters. A weight is assigned to each character to indicate its relative order for sorting. A character may be assigned more than one weight. Weights are prioritized as primary, secondary, tertiary, and so forth. The maximum number of weights assigned each character is system-defined.

A process inherits the C locale or POSIX locale at its startup time. When the setlocale (LC_ALL, " ") subroutine is called, a process obtains its locale based on the LC_* and LANG environment variables. The following subroutines are affected by the LC_COLLATE category and determine how two strings will be sorted in any given locale.

Note

Collation-based string comparisons take a long time because of the processing involved in obtaining the collation values. Perform such comparisons only when necessary. If you need to determine whether two wide character strings are equal, do not use the wcscoll and wcsxfrm subroutines; use the wcscmp subroutine instead.

The following subroutines compare multibyte character strings:

strcoll: Compares the collation weights of multibyte character strings.
strxfrm: Converts a multibyte character string to values representing character collation weights.

The following subroutines compare wide character strings:

wcscoll: Compares the collation weights of wide character strings.
wcsxfrm: Converts a wide character string to values representing character collation weights.

Examples

The following example uses the wcscoll subroutine to compare two wide character strings based on their collation weights:

#include  <stdio.h>
#include  <string.h>
#include  <locale.h>
#include  <stdlib.h>
 
extern  int  errno;
 
main()
{
    wchar_t  *pwcs1, *pwcs2;
    size_t   n;
 
    (void)setlocale(LC_ALL,  "");
    
    /*    set it to zero for checking errors on wcscoll    */
    errno  =  0;
    /*
    **    Let pwcs1 and pwcs2 be two wide character strings to
    **    compare.
    */
    n  =  wcscoll(pwcs1, pwcs2);
        /*
        **    If errno is set then it indicates some
        **    collation error.
        */
    if(errno  !=  0){
        /*  error has occurred... handle error ...*/
    }
}

The following example uses the wcsxfrm subroutine to compare two wide character strings based on collation weights:

Note

Determining the size n (where n is a number) of the transformed string, when using the wcsxfrm subroutine, can be accomplished in one of the following ways:

For each character in the wide character string, the number of bytes for possible collation values cannot exceed the COLL_WEIGHTS_MAX * sizeof(wchar_t) value. This value, multiplied by the number of wide character codes, gives the buffer length needed. To the buffer length add 1 for the terminating wide character null. This strategy may slow down performance.
Estimate the byte-length needed. If the previously obtained value is not enough, increase it. This may not satisfy all strings but gives maximum performance.
Call the wcsxfrm subroutine twice: first to find the value of n, and a second time to transform the string using this n value. This strategy slows down performance because the wcsxfrm subroutine is called twice. However, it yields a precise value for the buffer size needed to store the transformed string.

The method you choose depends on the characteristics of the strings used in the program and the performance objectives of the program.

#include  <stdio.h>
#include  <string.h>
#include  <locale.h>
#include  <stdlib.h>
 
main()
{
    wchar_t  *pwcs1, *pwcs2, *pwcs3, *pwcs4;
    size_t   n, retval;
  
    (void)setlocale(LC_ALL, "");
    /*
    **  Let the string pointed to by pwcs1 and pwcs3 be the
    **  wide character arrays to store the transformed wide
    **  character strings. Let the strings pointed to by pwcs2
    **  and pwcs4 be the wide character strings to compare based
    **  on the collation values of the wide characters in these
    **  strings.
    **  Let n be large enough (say,BUFSIZ) to transform the two
    **  wide character strings specified by pwcs2 and pwcs4.
    **
    **  Note:
    **  In practice, it is best to call wcsxfrm if the wide
    **  character string is to be compared several times to
    **  different wide character strings.
    */
 
    do {
        retval = wcsxfrm(pwcs1, pwcs2, n);
        if(retval == (size_t)-1){
            /*  error has occurred.  */
            /*  Process the error if needed  */
            break;
        }
 
        if(retval >= n ){
        /*
        ** Increase the value of n and use a bigger buffer pwcs1.
        */
        }
    }while (retval >= n);
 
    do {
        retval = wcsxfrm(pwcs3, pwcs4, n);
        if (retval == (size_t)-1){
            /*  error has occurred.  */
            /*  Process the error if needed  */
            break;
 
        if(retval >= n){
        /*Increase the value of n and use a bigger buffer pwcs3.*/
        }
    }while (retval >= n);
    retval = wcscmp(pwcs1, pwcs3);
    /*  retval has the result  */
}

Multibyte and Wide Character String Comparison Subroutines

The strcmp and strncmp subroutines determine if the contents of two multibyte strings are equivalent. If your application needs to know how the two strings differ lexically, use the multibyte and wide character string collation subroutines.

The following NLS subroutines compare wide character strings:

wcscmp	Compares two wide character strings.
wcsncmp	Compares a specific number of wide character strings.

Example

The following example uses the wcscmp subroutine to compare two wide character strings:

#include <string.h>
#include <locale.h>
#include <stdlib.h>
 
main()
{
    wchar_t *pwcs1, *pwcs2;
    int retval;
 
    (void)setlocale(LC_ALL, "");
    /*
    **  pwcs1 and pwcs2 point to two wide character
    **  strings to compare.
    */
    retval = wcscmp(pwcs1, pwcs2);
    /*  pwcs1 contains a copy of the wide character string
    **  in pwcs2
    */
}

Wide Character String Conversion Subroutines

The following NLS subroutines convert wide character strings to double, long, and unsigned long integers:

wcstod	Converts a wide character string to a double-precision floating point.
wcstol	Converts a wide character string to a signed long integer.
wcstoul	Converts a wide character string to an unsigned long integer.

Before calling the wcstod, wcstoul, or wcstol subroutine, the errno global variable must be set to 0. Any error that occurs as a result of calling these subroutines can then be handled correctly.

Examples

The following example uses the wcstod subroutine to convert a wide character string to a double-precision floating point:

#include <stdlib.h>
#include <locale.h>
#include <errno.h>
 
extern int errno;
 
main()
{
    wchar_t  *pwcs, *endptr;
    double   retval;
 
    (void)setlocale(LC_ALL, "");
    /*
    **  Let pwcs point to a wide character null terminated
    **  string containing a floating point value.
    */
    errno = 0;   /*  set errno to  zero  */
    retval = wcstod(pwcs, &endptr);
 
    if(errno != 0){
        /*  errno has changed, so error has occurred */
 
        if(errno == ERANGE){
            /*  correct value is outside range of
            **  representable values. Case of overflow
            **  error
            */
 
             if((retval == HUGE_VAL) ||
                (retval == -HUGE_VAL)){
                /*  Error case. Handle accordingly.  */
            }else if(retval == 0){
                /*  correct value causes underflow   */
                /*  Handle appropriately             */
            }
        }
    }
    /*  retval contains the double.  */
}

The following example uses the wcstol subroutine to convert a wide character string to a signed long integer:

#include <stdlib.h>
#include <locale.h>
#include <errno.h>
#include <stdio.h>
 
extern int errno;
 
main()
{
    wchar_t   *pwcs, *endptr;
    long int  retval;
 
    (void)setlocale(LC_ALL, "");
    /*
    **  Let pwcs point to a wide character null terminated
    **  string containing a signed long integer value.
    */
    errno = 0;  /*  set errno to  zero  */
    retval = wcstol(pwcs, &endptr, 0);
 
    if(errno != 0){
        /*  errno has changed, so error has occurred */
 
        if(errno == ERANGE){
            /*  correct value is outside range of
            **  representable values. Case of overflow
            **  error
            */
 
             if((retval == LONG_MAX) || (retval == LONG_MIN)){
                /*  Error case. Handle accordingly.     */
            }else if(errno == EINVAL){
                /*  The value of base is not supported  */
                /*  Handle appropriately                */
            }
        }
    }
    /*  retval contains the long integer.  */
}

The following example uses the wcstoul subroutine to convert a wide character string to an unsigned long integer:

#include <stdlib.h>
#include <locale.h>
#include <errno.h>
 
extern int errno;
 
main()
{
    wchar_t    *pwcs, *endptr;
    unsigned long int  retval;
 
    (void)setlocale(LC_ALL, "");
 
    /*
    **  Let pwcs point to a wide character null terminated
    **  string containing an unsigned long integer value.
    */
    errno = 0;   /*  set errno to  zero  */
    retval = wcstoul(pwcs, &endptr, 0);
 
    if(errno != 0){
        /*  error has occurred */
        if(retval == ULONG_MAX || errno == ERANGE){
            /*
            **  Correct value is outside of
            **  representable value. Handle appropriately
            */
        }else if(errno == EINVAL){
            /*  The value of base is not representable  */
            /*  Handle appropriately                    */
        }
    }
    /*  retval contains the unsigned long integer.  */
}

Wide Character String Copy Subroutines

The following NLS subroutines copy wide character strings:

wcscpy	Copies a wide character string to another wide character string.
wcsncpy	Copies a specific number of characters from a wide character string to another wide character string.
wcscat	Appends a wide character string to another wide character string.
wcsncat	Appends a specific number of characters from a wide character string to another wide character string.

Example

The following example uses the wcscpy subroutine to copy a wide character string into a wide character array:

#include <string.h>
#include <locale.h>
#include <stdlib.h>

main()
{
    wchar_t *pwcs1, *pwcs2;
    size_t  n;
    
    (void)setlocale(LC_ALL, "");
    /*
    **  Allocate the required wide character array.
    */
    pwcs1 = (wchar_t *)malloc( (wcslen(pwcs2) +1)*sizeof(wchar_t));
    wcscpy(pwcs1, pwcs2);
    /*
    **  pwcs1 contains a copy of the wide character string in pwcs2
    */
}

Wide Character String Search Subroutines

The following NLS subroutines are used to search for wide character strings:

wcschr	Searches for the first occurrence of a wide character in a wide character string.
wcsrchr	Searches for the last occurrence of a wide character in a wide character string.
wcspbrk	Searches for the first occurrence of a several wide characters in a wide character string.
wcsspn	Determines the number of wide characters in the initial segment of a wide character string.
wcscspn	Searches for a wide character string.
wcswcs	Searches for the first occurrence of a wide character string within another wide character string.
wcstok	Breaks a wide character string into a sequence of separate wide character strings.

Examples

The following example uses the wcschr subroutine to locate the first occurrence of a wide character in a wide character string:

#include <string.h>
#include <locale.h>
#include <stdlib.h>
 
main()
{
    wchar_t *pwcs1, wc, *pws;
    int     retval;
 
    (void)setlocale(LC_ALL, "");
 
    /*
    **  Let pwcs1 point to a wide character null terminated string.
    **  Let wc point to the wide character to search for.
    **
    */
    pws = wcschr(pwcs1, wc);
    if (pws == (wchar_t )NULL ){
        /*  wc does not occur in pwcs1  */
    }else{
        /*  pws points to the location where wc is found  */
    }
}

The following example uses the wcsrchr subroutine to locate the last occurrence of a wide character in a wide character string:

#include <string.h>
#include <locale.h>
#include <stdlib.h>
 
main()
{
    wchar_t *pwcs1, wc, *pws;
    int     retval;
 
    (void)setlocale(LC_ALL, "");
    /*
    **  Let pwcs1 point to a wide character null terminated string.
    **  Let wc point to the wide character to search for.
    **
    */
    pws = wcsrchr(pwcs1, wc);
    if (pws == (wchar_t )NULL ){
        /*  wc does not occur in pwcs1  */
    }else{
        /*  pws points to the location where wc is found  */
    }
}

The following example uses the wcspbrk subroutine to locate the first occurrence of several wide characters in a wide character string:

#include <string.h>
#include <locale.h>
#include <stdlib.h>
 
main()
{
    wchar_t *pwcs1, *pwcs2, *pws;
 
    (void)setlocale(LC_ALL, "");
 
    /*
    **  Let pwcs1 point to a wide character null terminated string.
    **  Let pwcs2 be initialized to the wide character string
    **  that contains wide characters to search for.
    */
    pws = wcspbrk(pwcs1, pwcs2);
 
    if (pws == (wchar_t )NULL ){
        /* No wide character from pwcs2 is found in pwcs1    */
    }else{
        /* pws points to the location where a match is found */
    }
}

The following example uses the wcsspn subroutine to determine the number of wide characters in the initial segment of a wide character string segment:

#include <string.h>
#include <locale.h>
#include <stdlib.h>
 
main()
{
    wchar_t *pwcs1, *pwcs2;
    size_t count;
 
    (void)setlocale(LC_ALL, "");
    /*
    **  Let pwcs1 point to a wide character null terminated string.
    **  Let pwcs2 be initialized to the wide character string
    **  that contains wide characters to search for.
    */
    count = wcsspn(pwcs1, pwcs2);
    /*
    **  count contains the length of the segment.
    */
 }

The following example uses the wcscspn subroutine to determine the number of wide characters not in a wide character string segment:

#include <string.h>
#include <locale.h>
#include <stdlib.h>
 
main()
{
    wchar_t *pwcs1, *pwcs2;
    size_t count;
 
    (void)setlocale(LC_ALL, "");
 
    /*
    **  Let pwcs1 point to a wide character null terminated string.
    **  Let pwcs2 be initialized to the wide character string
    **  that contains wide characters to search for.
    */
    count = wcscspn(pwcs1, pwcs2);
    /*
    **  count contains the length of the segment consisting
    **  of characters not in pwcs2.
    */
}

The following example uses the wcswcs subroutine to locate the first occurrence of a wide character string within another wide character string:

#include <string.h>
#include <locale.h>
#include <stdlib.h>
 
main()
{
    wchar_t *pwcs1, *pwcs2, *pws;
 
    (void)setlocale(LC_ALL, "");
    /*
    **  Let pwcs1 point to a wide character null terminated string.
    **  Let pwcs2 be initialized to the wide character string
    **  that contains wide characters sequence to locate.
    */
    pws = wcswcs(pwcs1, pwcs2);
    if (pws == (wchar_t)NULL){
       /*  wide character sequence pwcs2 is not found in pwcs1 */
    }else{
        /*
        **  pws points to the first occurrence of the sequence
        **  specified by pwcs2 in pwcs1.
        */
    }
}

The following example uses the wcstok subroutine to tokenize a wide character string:

#include <string.h>
#include <locale.h>
#include <stdlib.h>
 
main()
{
    wchar_t *pwcs1 = L"?a???b,,,#c";
    wchar_t *pwcs;
 
    (void)setlocale(LC_ALL, "");
    pwcs = wcstok(pwcs1, L"?");
    /*  pws points to the token:  L"a"  */
    pwcs = wcstok((wchar_t *)NULL, L",");
    /*  pws points to the token:  L"??b"  */
    pwcs = wcstok((wchar_t *)NULL, L"#,");
    /*  pws points to the token:  L"c"  */
}

Wide Character Input/Output Subroutines

NLS provides subroutines for both formatted and unformatted I/O.

Formatted Wide Character I/O

The printf and scanf subroutines allow for the formatting of wide characters. The printf and scanf subroutines have two additional format specifiers for wide character handling: %C and %S. The %C and %S format specifiers allow I/O on a wide character and a wide character string, respectively. They are similar to the %c and %s format specifiers, which allow I/O on a multibyte character and string.

The multibyte subroutines accept a multibyte array and output a multibyte array. To convert multibyte output from a multibyte subroutine to a wide character string, use the mbstowcs subroutine.

Unformatted Wide Character I/O

Unformatted wide character I/O subroutines are used when a program requires code set-independent I/O for characters from multibyte code sets. For example, use the fgetwc or getwc subroutine to input a multibyte character. If the program uses the getc subroutine to input a multibyte character, the program must call the getc subroutine once for each byte in the multibyte character.

Wide character input subroutines read multibyte characters from a stream and convert them to wide characters. The conversion is done as if the subroutines call the mbtowc and mbstowcs subroutines.

Wide character output subroutines convert wide characters to multibyte characters and write the result to the stream. The conversion is done as if the subroutines call the wctomb and wcstombs subroutines.

The LC_CTYPE category of the current locale affects the behavior of wide character I/O subroutines.

Reading and Processing an Entire File

If a program must go through an entire file that must be handled in wide character code form, use one of the following ways:

In the case of multibyte characters, use either the read or fread subroutine to convert a block of text data into a buffer. Convert one character at a time in this buffer using the mbtowc subroutine. Handle special cases of multibyte characters crossing block boundaries. For multibyte code sets, do not use the mbstowcs subroutine on this buffer. On an invalid or a partial multibyte character sequence, the mbstowcs subroutine returns -1 without indicating how far it successfully converted the data. You can use the mbstowcs subroutine with single-byte code sets because you will not run into a partial-byte sequence problem with single-byte code sets.
Use the fgetws subroutine to obtain a line from the file. If the returned wide character string contains a wide character <new-line>, then a complete line is obtained. If there is no <new-line> wide character, the line is longer than expected, and more calls to the fgetws subroutine are needed to obtain the complete line. If the program can efficiently process one line at a time, this approach is recommended.
If the fgets subroutine is used to read a multibyte file to obtain one line at a time, a split multibyte character may result. Handle this condition just as in the case of the read subroutine breaking up a multibyte character across successive reads. If you can guarantee that the input line length is not more than a set limit, a buffer of that size (plus 1 for null) can be used, thereby avoiding the possibility of a split multibyte character. If the program can efficiently process one line at a time, this approach may be used. Because of the possibility of split bytes in the buffer, use the fgetws subroutine in preference to the fgets subroutine for multibyte characters.
Use the fgetwc subroutine on the file to read one wide character code at a time. If a file is large, the function call overhead becomes large and reduces the value of this method.

The decision of which of these methods to use should be made on a per program basis. The fgetsw subroutine option is recommended, as it is capable of optimum performance and the program does not have to handle the special cases.

Input Subroutines

The wint_t data type is required to represent the wide character code value as well as the end-of-file (EOF) marker. For example, consider the case of the fgetwc subroutine, which returns a wide character code value:

wchar_t fgetwc();	If the wchar_t data type is defined as a char value, the y-umlaut symbol cannot be distinguished from the end-of-file (EOF) marker in the ISO8859-1 code set. The 0xFF code point is a valid character (y umlaut). Hence, the return value cannot be the wchar_t data type. A data type is needed that can hold both the EOF marker and all the code points in a code set.
int fgetwc();	On some machines, the int data type is defined to be 16 bits. When the wchar_t data type is larger than 16 bits, the int value cannot represent all the return values.

The wint_t data type is therefore needed to represent the fgetwc subroutine return value. The wint_t data type is defined in the wchar.h file.

The following subroutines are used for wide character input:

fgetwc	Gets next wide character from a stream.
fgetws	Gets a string of wide characters from a stream.
getwc	Gets next wide character from a stream.
getwchar	Gets next wide character from standard input.
getws	Gets a string of wide characters from a standard input.
ungetwc	Pushes a wide character onto a stream.

Output Subroutines

The following subroutines are used for wide character output:

fputwc	Writes a wide character to an output stream.
fputws	Writes a wide character string to an output stream.
putwc	Writes a wide character to an output stream.
putwchar	Writes a wide character to standard output.
putws	Writes a wide character string to standard output.

Examples

The following example uses the fgetwc subroutine to read wide character codes from a file:

#include <stdio.h>
#include <locale.h>
#include <stdlib.h>
 
main()
{
    wint_t  retval;
    FILE    *fp;
    wchar_t *pwcs;
  
    (void)setlocale(LC_ALL, "");
  
    /*
    **  Open a stream.
    */
    fp = fopen("file", "r");
  
    /*
    **  Error Handling if fopen was not successful.
    */
    if(fp == NULL){
        /*  Error handler  */
    }else{
        /*
        **  pwcs points to a wide character buffer of BUFSIZ.
        */
        while((retval = fgetwc(fp)) != WEOF){
            *pwcs++ = (wchar_t)retval;
               /*  break when buffer is full  */
        }
    }
    /*  Process the wide characters in the buffer  */
}

The following example uses the getwchar subroutine to read wide characters from standard input:

#include <stdio.h>
#include <locale.h>
#include <stdlib.h>
 
 main()
{
    wint_t  retval;
    FILE    *fp;
    wchar_t *pwcs;
 
    (void)setlocale(LC_ALL, "");
 
    index = 0;
    while((retval = getwchar()) != WEOF){
        /*  pwcs points to a wide character buffer of BUFSIZ.  */
        *pwcs++ = (wchar_t)retval;
        /*  break on buffer full  */
    }
    /*  Process the wide characters in the buffer  */
}

The following example uses the ungetwc subroutine to push a wide character onto an input stream:

#include <stdio.h>
#include <locale.h>
#include <stdlib.h>
 
main()
{
    wint_t  retval;
    FILE    *fp;
 
    (void)setlocale(LC_ALL, "");
    /*
    **  Open a stream.
    */
    fp = fopen("file", "r");
 
    /*
    **  Error Handling if fopen was not successful.
    */
    if(fp == NULL){
        /*  Error handler  */
 
    else{
        retval = fgetwc(fp);
        if(retval != WEOF){
            /*
            **  Peek at the character and return it to the stream.
            */
            retval = ungetwc(retval, fp);
            if(retval == EOF){
                /*  Error on ungetwc  */
            }
        }
    }
}

The following example uses the fgetws subroutine to read a file, one line at a time:

#include <stdio.h>
#include <locale.h>
#include <stdlib.h>
 
main()
{
    FILE    *fp;
    wchar_t *pwcs;
 
    (void)setlocale(LC_ALL, "");
 
    /*
    **  Open a stream.
    */
    fp = fopen("file", "r");
 
    /*
    **  Error Handling if fopen was not successful.
    */
    if(fp == NULL){
        /*  Error handler  */
    }else{
        /*  pwcs points to wide character buffer of BUFSIZ.  */
        while(fgetws(pwcs, BUFSIZ, fp) != (wchar_t *)NULL){
            /*
            **  pwcs contains wide characters with null
            **  termination.
            */
        }
    }
}

The following example uses the fputwc subroutine to write wide characters to an output stream:

#include <stdio.h>
#include <locale.h>
#include <stdlib.h>
 
main()
{
    int     index, len;
    wint_t  retval;
    FILE    *fp;
    wchar_t *pwcs;
 
    (void)setlocale(LC_ALL, "");
 
    /*
    **  Open a stream.
    */
    fp = fopen("file", "w");
 
    /*
    **  Error Handling if fopen was not successful.
    */
    if(fp == NULL){
        /*  Error handler  */
    }else{
        /*  Let len indicate number of wide chars to output.
        **  pwcs points to a wide character buffer of BUFSIZ.
        */
        for(index=0; index < len; index++){
            retval = fputwc(*pwcs++, fp);
            if(retval == WEOF)
                break;  /*  write error occurred  */
                        /*  errno is set to indicate the error. */
        }
    }
}

The following example uses the fputws subroutine to write a wide character string to a file:

#include <stdio.h>
#include <locale.h>
#include <stdlib.h>
 
main()
{
    int     retval;
    FILE    *fp;
    wchar_t *pwcs;
 
    (void)setlocale(LC_ALL, "");
 
    /*
    **  Open a stream.
    */
    fp = fopen("file", "w");
 
    /*
    **  Error Handling if fopen was not successful.
    */
    if(fp == NULL){
        /*  Error handler  */
 
    }else{
        /* 
        **  pwcs points to a wide character string
        **  to output to fp.
        */
        retval = fputws(pwcs, fp);
        if(retval == -1){
            /*  Write error occurred                */
            /*  errno is set to indicate the error  */
        }
    }
}

Working with the Wide Character Constant

Use the L constant for ASCII characters only. For ASCII characters, the L constant value is numerically the same as the code point value of the character. For example, L'a' is same as a. The L constant obtains the wchar_t value of an ASCII character for assignment purposes. A wide character constant is introduced by the L specifier. For example:

wchar_t wc = L'x' ;

A wide character code corresponding to the character x is stored in wc. The C compiler converts the character x using the mbtowc or mbstowcs subroutine as appropriate. This conversion to wide characters is based on the current locale setting at compile time. Because ASCII characters are part of all supported code sets and the wide character representation of all ASCII characters is the same in all locales, L'x' results in the same value across all code sets. However, if the character x is non-ASCII, the program may not work when it is run on a different code set than used at compile time. This limitation impacts some programs that use switch statements using the wide character constant representation.

wchar.h Header File

The wchar.h header file declares information that is necessary for programming with multibyte and wide character subroutines. The wchar.h header file declares the wchar_t, wctype_t, and wint_t data types, as well as several functions for testing wide characters. Because the number of characters implemented as wide characters exceeds that of basic characters, it is not possible to classify all wide characters into the existing classes used for basic characters. Therefore, it is necessary to provide a way of defining additional classes specific to some locale. The action of these subroutines is affected by the current locale.

The wchar.h header file also declares subroutines for manipulating wide character strings (that is, wchar_t data type arrays). Array length is always determined in terms of the number of wchar_t elements in an array. A null wide character code ends an array. A pointer to a wchar_t data type array or void array always points to the initial element of the array.

Note: If the number of wchar_t elements in an array exceeds the defined array length, unpredictable results can occur.

Internationalized Regular Expression Subroutines

Programs that contain internationalized regular expressions can use the regcomp, regexec, regerror, regfree, and fnmatch subroutines.

The following subroutines are available for use with internationalized regular expressions.

regcomp: Compiles a specified basic or extended regular expression into an executable string.
regexec: Compares a null-terminated string with a compiled basic or extended regular expression that must have been previously compiled by a call to the regcomp subroutine.
regerror: Provides a mapping from error codes returned by the regcomp and regexec subroutines to printable strings.
regfree: Frees any memory allocated by the regcomp subroutine associated with the compiled basic or extended regular expression. The expression is no longer treated as a compiled basic or extended regular expression after it is given to the regfree subroutine.
fnmatch: Checks a specified string to see if it matches a specified pattern. You can use the fnmatch subroutine in an application that reads a dictionary to find which entries match a given pattern. You also can use the fnmatch subroutine to match path names to patterns.