[ Bottom of Page | Previous Page | Next Page | Contents | Index | Library Home | Legal | Search ]

General Programming Concepts:
Writing and Debugging Programs

Using the lex Program with the yacc Program

When used alone, the lex program generator makes a lexical analyzer that recognizes simple, one-word input or receives statistical input. You can also use the lex program with a parser generator, such as the yacc command. The yacc command generates a program, called a parser, that analyzes the construction of more than one-word input. This parser program operates well with the lexical analyzers that the lex command generates. The parsers recognize many types of grammar with no regard to context. These parsers need a preprocessor to recognize input tokens such as the preprocessor that the lex command produces.

The lex program recognizes only extended regular expressions and formats them into character packages called tokens, as specified by the input file. When using the lex program to make a lexical analyzer for a parser, the lexical analyzer (created from the lex command) partitions the input stream. The parser (from the yacc command) assigns structure to the resulting pieces. You can also use other programs along with the programs generated by either the lex or yacc commands.

A token is the smallest independent unit of meaning as defined by either the parser or the lexical analyzer. A token can contain data, a language keyword, an identifier, or other parts of a language syntax.

The yacc program looks for a lexical analyzer subroutine named yylex, which is generated by the lex command. Normally, the default main program in the lex library calls the yylex subroutine. However, if the yacc command is loaded and its main program is used, yacc calls the yylex subroutine. In this case, each lex rule should end with:


where the appropriate token value is returned.

The yacc command assigns an integer value to each token defined in the yacc grammar file through a #define preprocessor statement. The lexical analyzer must have access to these macros to return the tokens to the parser. Use the yacc -d option to create a y.tab.h file, and include the y.tab.h file in the lex specification file by adding the following lines to the definition section of the lex specification file:

#include "y.tab.h"

Alternately, you can include the lex.yy.c file in the yacc output file by adding the following line after the second %% (percent sign, percent sign) delimiter in the yacc grammar file:

#include "lex.yy.c"

The yacc library should be loaded before the lex library to get a main program that invokes the yacc parser. You can generate lex and yacc programs in either order.

Creating a Parser with the yacc Program

The yacc program creates parsers that define and enforce structure for character input to a computer program. To use this program, you must supply the following inputs:

Grammar file A source file that contains the specifications for the language to recognize. This file also contains the main, yyerror, and yylex subroutines. You must supply these subroutines.
main A C language subroutine that, as a minimum, contains a call to the yyparse subroutine generated by the yacc program. A limited form of this subroutine is available in the yacc library.
yyerror A C language subroutine to handle errors that can occur during parser operation. A limited form of this subroutine is available in the yacc library.
yylex A C language subroutine to perform lexical analysis on the input stream and pass tokens to the parser. You can generate this lexical analyzer subroutine using the lex command.

When the yacc command gets a specification, it generates a file of C language functions called y.tab.c. When compiled using the cc command, these functions form the yyparse subroutine and return an integer. When called, the yyparse subroutine calls the yylex subroutine to get input tokens. The yylex subroutine continues providing input until either the parser detects an error or the yylex subroutine returns an end-marker token to indicate the end of operation. If an error occurs and the yyparse subroutine cannot recover, it returns a value of 1 to main. If it finds the end-marker token, the yyparse subroutine returns a value of 0 to main.

yacc Grammar File

To use the yacc command to generate a parser, give it a grammar file that describes the input data stream and what the parser is to do with the data. The grammar file includes rules describing the input structure, code to be invoked when these rules are recognized, and a subroutine to do the basic input.

The yacc command uses the information in the grammar file to generate a parser that controls the input process. This parser calls an input subroutine (the lexical analyzer) to pick up the basic items (called tokens) from the input stream. The parser organizes these tokens according to the structure rules in the grammar file. The structure rules are called grammar rules. When the parser recognizes one of these rules, it executes the user code supplied for that rule. The user code is called an action. Actions return values and use the values returned by other actions.

Use the C programming language to write the action code and other subroutines. The yacc command uses many of the C language syntax conventions for the grammar file.

main and yyerror Subroutines

You must provide the main and yyerror subroutines for the parser. To ease the initial effort of using the yacc command, the yacc library contains simple versions of the main and yyerror subroutines. Include these subroutines by using the -ly argument to the ld command (or to the cc command). The source code for the main library program is:

#include <locale.h>
     setlocale(LC_ALL, "");

The source code for the yyerror library program is:

#include <stdio.h>
        char *s;
        fprintf( stderr, "%s\n" ,s);

The argument to the yyerror subroutine is a string containing an error message, usually the string syntax error.

These are very limited programs. You should provide more function in these subroutines. For example, keep track of the input line number and print it along with the message when a syntax error is detected. You may also want to use the value in the external integer variable yychar. This variable contains the look-ahead token number at the time the error was detected.

yylex Subroutine

The input subroutine that you supply must be able to:

A token is a symbol or name that tells the parser which pattern is being sent to it by the input subroutine. A nonterminal symbol is the structure that the parser recognizes.

For example, if the input subroutine separates an input stream into the tokens of WORD, NUMBER, and PUNCTUATION, and it receives the following input:

I have 9 turkeys.

the program could choose to pass the following strings and tokens to the parser:

String Token
have WORD
turkeys WORD

The parser must contain definitions for the tokens passed to it by the input subroutine. Using the -d option for the yacc command, it generates a list of tokens in a file called y.tab.h. This list is a set of #define statements that allow the lexical analyzer (yylex) to use the same tokens as the parser.

To avoid conflict with the parser, do not use names that begin with the letters yy.

You can use the lex command to generate the input subroutine or you can write the routine in the C language.

Using the yacc Grammar File

A yacc grammar file consists of three sections:

Two adjacent %% (double percent signs) separate each section of the grammar file. To make the file easier to read, put the %% on a line by themselves. A complete grammar file looks like:


The declarations section may be empty. If you omit the programs section, omit the second set of %%. Therefore, the smallest yacc grammar file is:


The yacc command ignores blanks, tabs and new-line characters in the grammar file. Therefore, use these characters to make the grammar file easier to read. Do not, however, use blanks, tabs or new-lines in names or reserved symbols.

Using Comments

Put comments in the grammar file to explain what the program is doing. You can put comments anywhere in the grammar file you can put a name. However, to make the file easier to read, put the comments on lines by themselves at the beginning of functional blocks of rules. A comment in a yacc grammar file looks the same as a comment in a C language program. The comment is enclosed between /* (backslash, asterisk) and */ (asterisk, backslash). For example:

/* This is a comment on a line by itself. */

Using Literal Strings

A literal string is one or more characters enclosed in '' (single quotes). As in the C language, the \ (backslash) is an escape character within literals, and all the C language escape codes are recognized. Thus, the yacc command accepts the symbols in the following table:

Symbol Definition
'\a' Alert
'\n' New-line
'\r' Return
'\'' Single quote (')
'\"' Double quote (")
'\?' Question mark (?)
'\\' Backslash (\)
'\t' Tab
'\v' Vertical tab
'\b' Backspace
'\f' Form-feed
'\Digits' The character whose encoding is represented by the one-, two-, or three-digit octal integer specified by the Digits string.
'\xDigits' The character whose encoding is represented by the sequence of hexadecimal characters specified by the Digits string.

Never use \0 or 0 (the null character) in grammar rules.

Formatting the Grammar File

Use the following guidelines to help make the yacc grammar file more readable:

Errors in the Grammar File

The yacc command cannot produce a parser for all sets of grammar specifications. If the grammar rules contradict themselves or require matching techniques that are different from what the yacc command provides, the yacc command will not produce a parser. In most cases, the yacc command provides messages to indicate the errors. To correct these errors, redesign the rules in the grammar file, or provide a lexical analyzer (input program to the parser) to recognize the patterns that the yacc command cannot.

yacc Declarations

The declarations section of the yacc grammar file contains:

It is also possible to keep semantic information associated with the tokens currently on the parse stack in a user-defined C language union, if the members of the union are associated with the various names in the grammar file.

A declaration for a variable or constant uses the syntax of the C programming language:

TypeSpecifier Declarator ;

TypeSpecifier is a data type keyword and Declarator is the name of the variable or constant. Names can be any length and consist of letters, dots, underscores, and digits. A name cannot begin with a digit. Uppercase and lowercase letters are distinct.

Terminal (or token) names can be declared using the %token declaration and nonterminal names can be declared using the %type declaration. The %type declaration is not required for nonterminal names. Nonterminal names are defined automatically if they appear on the left side of at least one rule. Without declaring a name in the declarations section, you can use that name only as a nonterminal symbol. The #include statements are identical to C language syntax and perform the same function.

The yacc program has a set of keywords that define processing conditions for the generated parser. Each of the keywords begin with a % (percent sign) and is followed by a list of tokens or nonterminal names. These keywords are:

%left Identifies tokens that are left-associative with other tokens.
%nonassoc Identifies tokens that are not associative with other tokens.
%right Identifies tokens that are right-associative with other tokens.
%start Identifies a nonterminal name for the start symbol.
%token Identifies the token names that the yacc command accepts. Declares all token names in the declarations section.
%type Identifies the type of nonterminals. Type checking is performed when this construct is present.
%union Identifies the yacc value stack as the union of the various type of values desired. By default, the values returned are integers. The effect of this construct is to provide the declaration of YYSTYPE directly from the input.



%} Copies the specified Code into the code file. This construct can be used to add C language declarations and definitions to the declarations section.

Note: The %{ (percent sign, left bracket) and %} (percent sign, right bracket) symbols must appear on lines by themselves.

The %token, %left, %right, and %nonassoc keywords optionally support the name of a C union member (as defined by %union) called a <Tag> (literal angle brackets surrounding a union member name). The %type keyword requires a <Tag>. The use of <Tag> specifies that the tokens named on the line are to be of the same C type as the union member referenced by <Tag>. For example, the declaration:

%token [<Tag>] Name [Number] [Name [Number]]...

declares the Name parameter to be a token. If <Tag> is present, the C type for all tokens on this line are declared to be of the type referenced by <Tag>. If a positive integer, Number, follows the Name parameter, that value is assigned to the token.

All of the tokens on the same line have the same precedence level and associativity. The lines appear in the file in order of increasing precedence or binding strength. For example:

%left '+' '-'
%left '*' '/'

describes the precedence and associativity of the four arithmetic operators. The + (plus sign) and - (minus sign) are left associative and have lower precedence than * (asterisk) and / (slash), which are also left associative.

Defining Global Variables

To define variables to be used by some or all actions, as well as by the lexical analyzer, enclose the declarations for those variables between %{ (percent sign, left bracket) and %} (percent sign, right bracket) symbols. Declarations enclosed in these symbols are called global variables. For example, to make the var variable available to all parts of the complete program, use the following entry in the declarations section of the grammar file:

int var = 0;

Start Conditions

The parser recognizes a special symbol called the start symbol. The start symbol is the name of the rule in the rules section of the grammar file that describes the most general structure of the language to be parsed. Because it is the most general, this is the structure where the parser starts in its top-down analysis of the input stream. Declare the start symbol in the declarations section using the %start keyword. If you do not declare the name of the start symbol, the parser uses the name of the first grammar rule in the grammar file.

For example, when parsing a C language function, the most general structure for the parser to recognize is:


The start symbol should point to the rule that describes this structure. All remaining rules in the file describe ways to identify lower-level structures within the function.

Token Numbers

Token numbers are nonnegative integers that represent the names of tokens. If the lexical analyzer passes the token number to the parser instead of the actual token name, both programs must agree on the numbers assigned to the tokens.

You can assign numbers to the tokens used in the yacc grammar file. If you do not assign numbers to the tokens, the yacc grammar file assigns numbers using the following rules:

  1. A literal character is the numerical value of the character in the ASCII character set.
  2. Other names are assigned token numbers starting at 257.

    Note: Do not assign a token number of 0. This number is assigned to the endmarker token. You cannot redefine it.

To assign a number to a token (including literals) in the declarations section of the grammar file, put a positive integer (not 0) immediately following the token name in the %token line. This integer is the token number of the name or literal. Each token number must be unique. All lexical analyzers used with the yacc command must return a 0 or a negative value for a token when they reach the end of their input.

yacc Rules

The rules section contains one or more grammar rules. Each rule describes a structure and gives it a name. A grammar rule has the form:


where A is a nonterminal name, and BODY is a sequence of 0 or more names, literals, and semantic actions that can optionally be followed by precedence rules. Only the names and literals are required to form the grammar. Semantic actions and precedence rules are optional. The colon and the semicolon are required yacc punctuation.

Semantic actions allow you to associate actions to be performed each time a rule is recognized in the input process. An action can be an arbitrary C statement, and as such, perform input or output, call subprograms, or alter external variables. Actions can also refer to the actions of the parser; for example, shift and reduce.

Precedence rules are defined by the %prec keyword and change the precedence level associated with a particular grammar rule. The reserved symbol %prec can appear immediately after the body of the grammar rule and can be followed by a token name or a literal. The construct causes the precedence of the grammar rule to become that of the token name or literal.

Repeating Nonterminal Names

If several grammar rules have the same nonterminal name, use the | (pipe symbol) to avoid rewriting the left side. In addition, use the ; (semicolon) only at the end of all rules joined by pipe symbols. For example, the grammar rules:

A  :  B  C  D  ;
A  :  E  F  ;
A  :  G  ;

can be given to the yacc command by using the pipe symbol as follows:

A  :  B  C  D
   |  E  F
   |  G

Using Recursion in a Grammar File

Recursion is the process of using a function to define itself. In language definitions, these rules normally take the form:

rule    :        EndCase
        |        rule EndCase

Therefore, the simplest case of the rule is the EndCase, but rule can also consist of more than one occurrence of EndCase. The entry in the second line that uses rule in the definition of rule is the recursion. The parser cycles through the input until the stream is reduced to the final EndCase.

When using recursion in a rule, always put the call to the name of the rule as the leftmost entry in the rule (as it is in the preceding example). If the call to the name of the rule occurs later in the line, such as in the following example, the parser may run out of internal stack space and stop.:

rule    :       EndCase
        |       EndCase rule

The following example defines the line rule as one or more combinations of a string followed by a newline character (\n):

lines   :        line
        |        lines line

line    :        string '\n'

Empty String

To indicate a nonterminal symbol that matches the empty string, use a ; (semicolon) by itself in the body of the rule. To define a symbol empty that matches the empty string, use a rule similar to the following rule:

empty   :  ;
        | x;


empty   :
        | x

End-of-Input Marker

When the lexical analyzer reaches the end of the input stream, it sends an end-of-input marker to the parser. This marker is a special token called endmarker, and has a token value of 0. When the parser receives an end-of-input marker, it checks to see that it has assigned all input to defined grammar rules and that the processed input forms a complete unit (as defined in the yacc grammar file). If the input is a complete unit, the parser stops. If the input is not a complete unit, the parser signals an error and stops.

The lexical analyzer must send the end-of-input marker at the appropriate time, such as the end of a file, or the end of a record.

yacc Error Handling

When the parser reads an input stream, that input stream might not match the rules in the grammar file. The parser detects the problem as early as possible. If there is an error-handling subroutine in the grammar file, the parser can allow for entering the data again, skipping over the bad data, or initiating a cleanup and recovery action. When the parser finds an error, for example, it may need to reclaim parse tree storage, delete or alter symbol table entries, and set switches to avoid generating further output.

When an error occurs, the parser stops unless you provide error-handling subroutines. To continue processing the input to find more errors, restart the parser at a point in the input stream where the parser can try to recognize more input. One way to restart the parser when an error occurs is to discard some of the tokens following the error. Then try to restart the parser at that point in the input stream.

The yacc command has a special token name, error, to use for error handling. Put this token in the rules file at places that an input error might occur so that you can provide a recovery subroutine. If an input error occurs in this position, the parser executes the action for the error token, rather than the normal action.

The following macros can be placed in yacc actions to assist in error handling:

YYERROR Causes the parser to initiate error handling.
YYABORT Causes the parser to return with a value of 1.
YYACCEPT Causes the parser to return with a value of 0.
YYRECOVERING() Returns a value of 1 if a syntax error has been detected and the parser has not yet fully recovered.

To prevent a single error from producing many error messages, the parser remains in error state until it processes 3 tokens following an error. If another error occurs while the parser is in the error state, the parser discards the input token and does not produce a message.

For example, a rule of the form:

stat  :  error ';'

tells the parser that when there is an error, it should skip over the token and all following tokens until it finds the next semicolon. All tokens after the error and before the next semicolon are discarded. After finding the semicolon, the parser reduces this rule and performs any cleanup action associated with it.

Providing for Error Correction

You can also allow the person entering the input stream in an interactive environment to correct any input errors by entering a line in the data stream again. For example:

input : error '\n'
          printf(" Reenter last line: " );
         $$ = $4;

is one way to do this. However, in this example the parser stays in the error state for 3 input tokens following the error. If the corrected line contains an error in the first 3 tokens, the parser deletes the tokens and does not give a message. To allow for this condition, use the yacc statement:


When the parser finds this statement, it leaves the error state and begins processing normally. The error-recovery example then becomes:

input : error '\n'
          printf(" Reenter last line: " );
         $$ = $4;

Clearing the Look-Ahead Token

The look-ahead token is the next token that the parser examines. When an error occurs, the look-ahead token becomes the token at which the error was detected. However, if the error recovery action includes code to find the correct place to start processing again, that code must also change the look-ahead token. To clear the look-ahead token include the following statement in the error-recovery action:

yyclearin ;

Lexical Analysis for the yacc Command

You must provide a lexical analyzer to read the input stream and send tokens (with values, if required) to the parser that the yacc command generates. To build a lexical analyzer that works well with the parser that yacc generates, use the lexlex command.

The lex command generates a lexical analyzer called yylex. The yylexprogram must return an integer that represents the kind of token that was read. The integer is called the token number. In addition, if a value is associated with the token, the lexical analyzer must assign that value to the yylval external variable.

yacc-Generated Parser Operation

The yacc command turns the grammar file into a C language program. That program, when compiled and executed, parses the input according to the grammar specification given.

The parser is a finite state machine with a stack. The parser can read and remember the look-ahead token. The current state is always the state at the top of the stack. The states of the finite state machine are represented by small integers. Initially, the machine is in state 0, the stack contains only 0, and no look-ahead token has been read.

The machine can perform one of four actions:

shift State The parser pushes the current state onto the stack, makes State the current state, and clears the look-ahead token.
reduce Rule When the parser finds a string defined by Rule (a rule number) in the input stream, the parser replaces that string with Rule in the output stream.
accept The parser looks at all input, matches it to the grammar specification, and recognizes the input as satisfying the highest-level structure (defined by the start symbol). This action appears only when the look-ahead token is the endmarker and indicates that the parser has successfully done its job.
error The parser cannot continue processing the input stream and still successfully match it with any rule defined in the grammar specification. The input tokens the parser looked at, together with the look-ahead token, cannot be followed by anything that would result in valid input. The parser reports an error and attempts to recover the situation and resume parsing.

The parser performs the following actions during one process step:

  1. Based on its current state, the parser decides whether it needs a look-ahead token to decide the action to be taken. If the parser needs a look-ahead token and does not have one, it calls the yylex subroutine to obtain the next token.
  2. Using the current state, and the look-ahead token if needed, the parser decides on its next action and carries it out. As a result, states may be pushed onto or popped off the stack, and the look-ahead token may be processed or left alone.


The shift action is the most common action the parser takes. Whenever the parser does a shift, there is always a look-ahead token. For example, consider the following grammar specification rule:

IF shift 34

If the parser is in the state that contains this rule and the look-ahead token is IF, the parser:

  1. Pushes the current state down on the stack.
  2. Makes state 34 the current state (puts it at the top of the stack).
  3. Clears the look-ahead token.


The reduce action keeps the stack from growing too large. The parser uses reduce actions after matching the right side of a rule with the input stream. The parser is then ready to replace the characters in the input stream with the left side of the rule. The parser may have to use the look-ahead token to decide if the pattern is a complete match.

Reduce actions are associated with individual grammar rules. Because grammar rules also have small integer numbers, you can easily confuse the meanings of the numbers in the two actions, shift and reduce. For example, the following action refers to grammar rule 18:

. reduce 18

The following action refers to state 34:

IF shift 34

For example, to reduce the following rule, the parser pops off the top three states from the stack:

A  :  x y z ;

The number of states popped equals the number of symbols on the right side of the rule. These states are the ones put on the stack while recognizing x, y, and z. After popping these states, a state is uncovered, which is the state the parser was in before beginning to process the rule, that is, the state that needed to recognize rule A to satisfy its rule. Using this uncovered state and the symbol on the left side of the rule, the parser performs an action called goto, which is similar to a shift of A. A new state is obtained, pushed onto the stack, and parsing continues.

The goto action is different from an ordinary shift of a token. The look-ahead token is cleared by a shift but is not affected by a goto action. When the three states are popped, the uncovered state contains an entry such as:

A  goto 20

This entry causes state 20 to be pushed onto the stack and become the current state.

The reduce action is also important in the treatment of user-supplied actions and values. When a rule is reduced, the parser executes the code that you included in the rule before adjusting the stack. Another stack running in parallel with the stack holding the states holds the values returned from the lexical analyzer and the actions. When a shift takes place, the yylval external variable is copied onto the stack holding the values. After executing the code that you provide, the parser performs the reduction. When the parser performs the goto action, it copies the yylval external variable onto the value stack. The yacc keywords that begin with $ refer to the value stack.

Using Ambiguous Rules in the yacc Program

A set of grammar rules is ambiguous if any possible input string can be structured in two or more different ways. For example, the grammar rule:

expr : expr '-' expr

states a rule that forms an arithmetic expression by putting two other expressions together with a minus sign between them. Unfortunately, this grammar rule does not specify how to structure all complex inputs. For example, if the input is:

expr - expr - expr

a program could structure this input as either left associative:

( expr - expr ) - expr

or as right associative:

expr - ( expr - expr )

and produce different results.

Parser Conflicts

When the parser tries to handle an ambiguous rule, confusion occurs over which of its four actions to perform when processing the input. Two major types of conflict develop:

shift/reduce conflict A rule can be evaluated correctly using either a shift action or a reduce action, but the result is different.
reduce/reduce conflict A rule can be evaluated correctly using one of two different reduce actions, producing two different actions.

A shift/shift conflict is not possible. The shift/reduce and reduce/reduce conflicts result from a rule that is not completely stated. For example, using the ambiguous rule stated in the previous section, if the parser receives the input:

expr - expr - expr

after reading the first three parts the parser has:

expr - expr

which matches the right side of the preceding grammar rule. The parser can reduce the input by applying this rule. After applying the rule, the input becomes:


which is the left side of the rule. The parser then reads the final part of the input:

- expr

and reduces it. This produces a left associative interpretation.

However, the parser can also look ahead in the input stream. If, when the parser receives the first three parts:

expr - expr

it reads the input stream until it has the next two parts, it then has the following input:

expr - expr - expr

Applying the rule to the rightmost three parts reduces them to expr. The parser then has the expression:

expr - expr

Reducing the expression once more produces a right associative interpretation.

Therefore, at the point where the parser has read only the first three parts, it can take two valid actions: a shift or a reduce. If the parser has no rule to decide between the two actions, a shift/reduce conflict results.

A similar situation occurs if the parser can choose between two valid reduce actions. That situation is called a reduce/reduce conflict.

How the Parser Responds to Conflicts

When shift/reduce or reduce/reduce conflicts occur, the yacc command produces a parser by selecting one of the valid steps wherever it has a choice. If you do not provide a rule that makes the choice, yacc uses two rules:

  1. In a shift/reduce conflict, choose the shift.
  2. In a reduce/reduce conflict, reduce by the grammar rule that can be applied at the earliest point in the input stream.

Using actions within rules can cause conflicts if the action must be performed before the parser is sure which rule is being recognized. In such cases, the preceding rules result in an incorrect parser. For this reason, the yacc program reports the number of shift/reduce and reduce/reduce conflicts resolved using the preceding rules.

Turning on Debug Mode for a yacc-Generated Parser

You can access the debugging code either by invoking the yacc command with the -t option or compiling the y.tab.c file with -DYYDEBUG.

For normal operation, the yydebug external integer variable is set to 0. However, if you set it to a nonzero value, the parser generates a description of:

Set this variable in one of the following ways:

[ Top of Page | Previous Page | Next Page | Contents | Index | Library Home | Legal | Search ]