[ Previous | Next | Table of Contents | Index | Library Home | Legal | Search ]

Technical Reference: Base Operating System and Extensions , Volume 2


regcmp or regex Subroutine

Purpose

Compiles and matches regular-expression patterns.

Libraries

Standard C Library ( libc.a )

Programmers Workbench Library (libPW.a)

Syntax

#include <libgen.h>


char *regcmp ( String [, String, . . . ], (char *) 0)
const char *String, . . . ;


const char *regex ( Pattern, Subject [, ret, . . . ])
char *Pattern, *Subject, *ret, . . . ;
extern char *__loc1;

Description

The regcmp subroutine compiles a regular expression (or Pattern) and returns a pointer to the compiled form. The regcmp subroutine allows multiple String parameters. If more than one String parameter is given, then the regcmp subroutine treats them as if they were concatenated together. It returns a null pointer if it encounters an incorrect parameter.

You can use the regcmp command to compile regular expressions into your C program, frequently eliminating the need to call the regcmp subroutine at run time.

The regex subroutine compares a compiled Pattern to the Subject string. Additional parameters are used to receive values. Upon successful completion, the regex subroutine returns a pointer to the next unmatched character. If the regex subroutine fails, a null pointer is returned. A global character pointer, __loc1, points to where the match began.

The regcmp and regex subroutines are borrowed from the ed command; however, the syntax and semantics have been changed slightly. You can use the following symbols with the regcmp and regex subroutines:

[ ] * . ^ These symbols have the same meaning as they do in the ed command.
- The minus sign (or hyphen) within brackets used with the regex subroutine means "through," according to the current collating sequence. For example, [a-z] can be equivalent to [abcd . . . xyz] or [aBbCc . . . xYyZz]. You can use the - by itself if the - is the last or first character. For example, the character class expression [ ] -] matches the ] (right bracket) and - (minus) characters.

The regcmp subroutine does not use the current collating sequence, and the minus sign in brackets controls only a direct ASCII sequence. For example, [a-z] always means [abc . . . xyz] and [A-Z] always means [ABC . . . XYZ] . If you need to control the specific characters in a range using the regcmp subroutine, you must list them explicitly rather than using the minus sign in the character class expression.

$ Matches the end of the string. Use the \n character to match a new-line character.
+ A regular expression followed by + (plus sign) means one or more times. For example, [0-9] + is equivalent to [0-9] [0-9] *.
[ m] [ m,] [ m, u] Integer values enclosed in [ ] (braces) indicate the number of times to apply the preceding regular expression. The m character is the minimum number and the u character is the maximum number. The u character must be less than 256. If you specify only m, it indicates the exact number of times to apply the regular expression. [m,] is equivalent to [m,u.] and matches m or more occurrences of the expression. The + (plus sign) and * (asterisk) operations are equivalent to [1,] and [0,], respectively.
( . . . )$n This stores the value matched by the enclosed regular expression in the (n+1)th ret parameter. Ten enclosed regular expressions are allowed. The regex subroutine makes the assignments unconditionally.
( . . . ) Parentheses group subexpressions. An operator, such as *, +, or [ ] works on a single character or on a regular expression enclosed in parentheses. For example, (a*(cb+)*)$0.

All of the preceding defined symbols are special. You must precede them with a \ (backslash) if you want to match the special symbol itself. For example, \$ matches a dollar sign.

Note: The regcmp subroutine uses the malloc subroutine to make the space for the vector. Always free the vectors that are not required. If you do not free the unneeded vectors, you can run out of memory if the regcmp subroutine is called repeatedly. Use the following as a replacement for the malloc subroutine to reuse the same vector, thus saving time and space:

/*  . . . Your Program . . .  */
malloc(n)
   int n;
{ 
   static int rebuf[256] ;
  
 return ((n <= sizeof(rebuf)) ? rebuf : NULL);
} 

The regcmp subroutine produces code values that the regex subroutine can interpret as the regular expression. For instance, [a-z] indicates a range expression which the regcmp subroutine compiles into a string containing the two end points (a and z).

The regex subroutine interprets the range statement according to the current collating sequence. The expression [a-z] can be equivalent either to [abcd . . . xyz] , or to [aBbCcDd . . . xXyYzZ], as long as the character preceding the minus sign has a lower collating value than the character following the minus sign.

The behavior of a range expression is dependent on the collation sequence. If you want to match a specific set of characters, you should list each one. For example, to select letters a, b, or c, use [abc] rather than [a-c] .

Notes:
  1. No assumptions are made at compile time about the actual characters contained in the range.
  2. Do not use multibyte characters.
  3. You can use the ] (right bracket) itself within a pair of brackets if it immediately follows the leading [ (left bracket) or [^ (a left bracket followed immediately by a circumflex).
  4. You can also use the minus sign (or hyphen) if it is the first or last character in the expression. For example, the expression [ ] -0] matches either the right bracket ( ] ), or the characters - through 0.

Matching a Character Class in National Language Support

A common use of the range expression is matching a character class. For example, [0-9] represents all digits, and [a-z, A-Z] represents all letters. This form may produce unexpected results when ranges are interpreted according to the current collating sequence.

Instead of the range expression shown above, use a character class expression within brackets to match characters. The system interprets this type of expression according to the current character class definition. However, you cannot use character class expressions in range expressions.

The following exemplifies the syntax of a character class expression:

[:charclass:] 

that is, a left bracket followed by a colon, followed by the name of the character class, followed by another colon and a right bracket.

National Language Support supports the following character classes:

[:upper:] ASCII uppercase letters.
[:lower:] ASCII lowercase letters.
[:alpha:] ASCII uppercase and lowercase letters.
[:digit:] ASCII digits.
[:alnum:] ASCII uppercase and lowercase letters, and digits.
[:xdigit:] ASCII hexadecimal digits.
[:punct:] ASCII punctuation character (neither a control character nor an alphanumeric character).
[:space:] ASCII space, tab, carriage return, new-line, vertical tab, or form feed character.
[:print:] ASCII printing characters.

Parameters


Subject Specifies a comparison string.
String Specifies the Pattern to be compiled.
Pattern Specifies the expression to be compared.
ret Points to an address at which to store comparison data. The regex subroutine allows multiple ret String parameters.

Implementation Specifics

These subroutines are part of Base Operating System (BOS) Runtime.

Related Information

The ctype subroutine, compile, step, or advance subroutine, malloc, free, realloc, calloc, mallopt, mallinfo, or alloca subroutine, regcomp (regcomp Subroutine) subroutine, regex (regexec Subroutine) subroutine.

The ed command, regcmp command.

Subroutines Overview in AIX 5L Version 5.1 General Programming Concepts: Writing and Debugging Programs.


[ Previous | Next | Table of Contents | Index | Library Home | Legal | Search ]