[ Bottom of Page | Previous Page | Next Page | Contents | Index | Library Home |
Legal |
Search ]
Technical Reference: Base Operating System and Extensions, Volume 2
regcmp or regex Subroutine
Purpose
Compiles and matches regular-expression patterns.
Libraries
Standard C Library ( libc.a
)
Programmers Workbench Library (libPW.a)
Syntax
#include <libgen.h>
char *regcmp ( String [, String, . . . ], (char *) 0)
const char *String, . . . ;
const char *regex ( Pattern, Subject [, ret, . . . ])
char *Pattern, *Subject, *ret, . . . ;
extern char *__loc1;
Description
The regcmp subroutine compiles
a regular expression (or Pattern) and returns a pointer
to the compiled form. The regcmp subroutine allows multiple String parameters. If more than one String parameter is given, then the regcmp subroutine
treats them as if they were concatenated together. It returns a null pointer
if it encounters an incorrect parameter.
You can use the regcmp command
to compile regular expressions into your C program, frequently eliminating
the need to call the regcmp subroutine at run time.
The regex subroutine compares
a compiled Pattern to the Subject string. Additional parameters are used to receive values. Upon successful
completion, the regex subroutine returns a pointer to
the next unmatched character. If the regex subroutine
fails, a null pointer is returned. A global character pointer, __loc1, points to where the match began.
The regcmp and regex subroutines are borrowed from the ed command;
however, the syntax and semantics have been changed slightly. You can use
the following symbols with the regcmp and regex subroutines:
[ ] * . ^ |
These symbols have the same meaning as they do in the ed command. |
- |
The minus sign (or hyphen) within brackets used with the regex subroutine means "through," according to the current collating
sequence. For example, [a-z] can be equivalent to [abcd . . . xyz]
or [aBbCc . . . xYyZz]. You can use the - by itself if the - is
the last or first character. For example, the character class expression [
] -] matches the ] (right bracket) and - (minus) characters.
The regcmp subroutine does not use the current collating sequence, and the
minus sign in brackets controls only a direct ASCII sequence. For example,
[a-z] always means [abc . . . xyz] and [A-Z] always means
[ABC . . . XYZ] . If you need to control the specific characters
in a range using the regcmp subroutine, you must list
them explicitly rather than using the minus sign in the character class expression. |
$ |
Matches the end of the string. Use the \n character to match a new-line
character. |
+ |
A regular expression followed by + (plus sign) means one or more
times. For example, [0-9] + is equivalent to [0-9] [0-9] *. |
[ m] [ m,] [ m, u] |
Integer values enclosed in [ ] (braces) indicate the number of times
to apply the preceding regular expression. The m character
is the minimum number and the u character is the maximum
number. The u character must be less than 256. If
you specify only m, it indicates the exact number
of times to apply the regular expression. [m,] is
equivalent to [m,u.] and matches m or more occurrences of the expression. The + (plus sign)
and * (asterisk) operations are equivalent to [1,] and [0,], respectively. |
( . . . )$n |
This stores the value matched by the enclosed regular expression
in the (n+1)th ret parameter.
Ten enclosed regular expressions are allowed. The regex
subroutine makes the assignments unconditionally. |
( . . . ) |
Parentheses group subexpressions. An operator, such as *, +, or [
] works on a single character or on a regular expression enclosed in parentheses.
For example, (a*(cb+)*)$0. |
All of the preceding defined symbols are special.
You must precede them with a \ (backslash) if you want to match the special
symbol itself. For example, \$ matches a dollar sign.
Note
The
regcmp subroutine uses the
malloc subroutine to make the space for the vector. Always free the vectors
that are not required. If you do not free the unneeded vectors, you can run
out of memory if the
regcmp subroutine is called repeatedly.
Use the following as a replacement for the
malloc subroutine
to reuse the same vector, thus saving time and space:
/* . . . Your Program . . . */
malloc(n)
int n;
{
static int rebuf[256] ;
return ((n <= sizeof(rebuf)) ? rebuf : NULL);
}
The regcmp subroutine produces
code values that the regex subroutine can interpret
as the regular expression. For instance, [a-z] indicates a range expression
which the regcmp subroutine compiles into a string containing
the two end points (a and z).
The regex subroutine interprets
the range statement according to the current collating sequence. The expression
[a-z] can be equivalent either to [abcd . . . xyz] , or to
[aBbCcDd . . . xXyYzZ], as long as the character preceding the minus sign has a lower collating value than the character following the minus sign.
The behavior of a range expression is dependent on
the collation sequence. If you want to match a specific set of characters, you should list each one. For example, to select
letters a, b, or c, use [abc] rather than [a-c] .
Notes:
- No assumptions are made at compile time about the actual characters contained
in the range.
- Do not use multibyte characters.
- You can use the ] (right bracket) itself within a pair of brackets if
it immediately follows the leading [ (left bracket) or [^ (a left bracket
followed immediately by a circumflex).
- You can also use the minus sign (or hyphen) if it is the first or last
character in the expression. For example, the expression [ ] -0] matches either
the right bracket ( ] ), or the characters - through 0.
Matching a Character Class in National Language Support
A common use of the range expression is matching a
character class. For example, [0-9] represents all digits, and [a-z, A-Z]
represents all letters. This form may produce unexpected results when ranges
are interpreted according to the current collating sequence.
Instead of the range expression shown above, use a
character class expression within brackets to match characters. The system
interprets this type of expression according to the current character class
definition. However, you cannot use character class expressions in range expressions.
The following exemplifies the syntax of a character
class expression:
[:charclass:]
that is, a left bracket followed by a colon, followed
by the name of the character class, followed by another colon and a right
bracket.
National Language Support supports the following character
classes:
[:upper:] |
ASCII uppercase letters. |
[:lower:] |
ASCII lowercase letters. |
[:alpha:] |
ASCII uppercase and lowercase letters. |
[:digit:] |
ASCII digits. |
[:alnum:] |
ASCII uppercase and lowercase letters, and digits. |
[:xdigit:] |
ASCII hexadecimal digits. |
[:punct:] |
ASCII punctuation character (neither a control character nor an alphanumeric
character). |
[:space:] |
ASCII space, tab, carriage return, new-line, vertical tab, or form
feed character. |
[:print:] |
ASCII printing characters. |
Parameters
Subject |
Specifies a comparison string. |
String |
Specifies the Pattern to be compiled. |
Pattern |
Specifies the expression to be compared. |
ret |
Points to an address at which to store comparison data. The regex subroutine allows multiple ret String parameters. |
Related Information
The ctype subroutine, compile, step, or advance subroutine, malloc, free, realloc, calloc, mallopt, mallinfo, or alloca
subroutine, regcomp (regcomp Subroutine) subroutine, regex (regexec Subroutine) subroutine.
The ed
command, regcmp command.
Subroutines Overview
in AIX 5L Version 5.2 General Programming Concepts: Writing and Debugging Programs.
[ Top of Page | Previous Page | Next Page | Contents | Index | Library Home |
Legal |
Search ]