In the last chapter we investigated LL(1) grammars which allow the construction of top-down parsers that avoid backup altogether. There are several ways that LL(1) parsers can be implemented. They all share the same basic algorithm to parse the input, but implementation details vary. In this chapter we will study in detail recursive descent parsers which is one way to implement an LL(1)-parser.
The method is called recursive descent because the parser is implemented as a set of mutually recursive parsing routines, and because it is a top-down method. The current sentential form is not explicitly represented in a recursive descent parser. Instead, it is implicitly represented by the call stack and the point of execution in the currently active parsing routine.
Recursive descent parsers have the disadvantage that code has to be written specifically tailored to each LL(1) grammar. LL(1) parsers can be implemented more generally in which this disadvantage is eliminated. In these more general parsers, the grammar can be entered as data. On the other hand, recursive descent parsers have their own advantages. They are intuitive, and the parser can be easily understood from the code and related to the grammar. The parser, once written, becomes the foundation of the whole compiler, and the scope analyser, type analyser, and code generator are all relatively easily integrated into the parser. Since the code of a recursive descent parser that corresponds to a given LL(1) grammar can be generated in a totally algorithmic way, one could write a tool which generates the code given the grammar.
In the next section we describe how to write a recursive descent parser. In later sections we discuss error recovery.
In a recursive descent parser each nonterminal of the LL(1) grammar corresponds to an identically named parsing procedure. Thus, if A is a nonterminal, the corresponding parsing procedure will be called also A. The parsing procedure A will be invoked during parsing, whenever the nonterminal A is the leftmost nonterminal of the sentential form. At this stage of the parsing, the string of terminals to the left of A in the sentential form have been matched already against the input, and the parser variable symbol holds the first unmatched token of the input. Thus symbol holds the look-ahead symbol of the input.
Whenever the parser needs another input symbol, it invokes the nextSymbol() routine of the scanner, which reads in the characters corresponding to the next token from the input, identifies the token, and returns it. nextSymbol returns an object of the type Token. We will assume that a Token has the representation discussed in the chapter on the scanner, that is it has a field byte token which contains the enumerated type value of the token, and int value , which contains the value of the numeral or the unique integer associated with a name. Token has accessor methods byte getToken() and int getValue() , so that the parser can access these fields. Thus, symbol.getToken() == Token.BECOMES1 for the token := , and for a numeral 104, we would have, symbol.getToken() == Token.NUMERAL1 and symbol.getValue() == 104. For the identifier temperature, we would have symbol.getToken() = Token.NAME1 and symbol.getValue() = 8, assuming that temperature was the eighth name created by the scanner.
The overall responsibility of the parsing procedure A() is to read in the tokens generated by the nonterminal A in this sentence and then return with symbol containing the next look-ahead symbol, that is the token that comes after the string generated by A in the input. We will now describe the algorithm used by parsing procedure A().
The first step of a derivation from the nonterminal A must
replace A by one of the alternatives of the productions with
left-hand-side A, namely
A -> x1 | x2 | ... | xk
Since the grammar is LL(1), symbol is in the First
set of at most one of these alternatives, therefore the pseudo-code for
A has the form,
void A() {
if ( symbol.getToken() in First(x1) then
parse(x1);
else if symbol.getToken() in First(x2) then
parse(x2);
...
else if symbol.getToken() in First(xk) then
parse(xk);
else ...
}
In the above pseudo-code the sets First(...) must be pre-computed for all alternatives and replaced by the corresponding set of tokens. In practice, very often these sets contain a single token, or just a few tokens, in which case the code can be simplified, and we would not need an implementation for sets of tokens.
What must the procedure do if symbol is in none of the First sets?
There are two cases to consider. If A may not generate the
empty string, then this is an error, consequently the code should be,
void A() {
if (symbol.getToken() in First(x1)) then
parse(x1);
else if (symbol.getToken() in First(x2)) then
parse(x2);
...
else if (symbol.getToken() in First(xk)) then
parse(xk);
else syntaxError(message); // if A may not generate the empty string
}
void A() {
if (symbol.getToken() in First(x1)) then
parse(x1);
else if (symbol.getToken() in First(x2)) then
parse(x2);
...
else if (symbol.getToken() in First(xk)) then
parse(xk);
else return; // if A may generate the empty string
}
Next we consider parse(x), the parsing algorithm for an
alternative x. This must be replaced by the correct code to parse
the corresponding alternative. What parse(x) must
accomplish
overall is to read in whatever was generated by the alternative, and then
return with symbol containing the next look-ahead symbol. Since an
alternative in general consists of a string of nonterminal and/or
terminal symbols X1X2...Xn,
the string generated by this alternative must consist of the string generated by
X1 followed by the string generated by X2
... followed by the string generated by Xn. Consequently,
parse(X1X2...Xn) {
parse(X1);
parse(X2);
...
parse(Xn);
}
It remains to consider the parsing algorithm for a single nonterminal or
terminal symbol. For a nonterminal symbol X all we have to do
is call the parsing procedure X().
parse(X) {
X();
}
parse( [ x ] ) {
if (symbol.getToken() in First(x)) then
parse(x);
}
parse( { x } ) {
while (symbol.getToken() in First(x)) do
parse(x);
}
parse(t) {
match(t);
}
void match( byte t) {
if (symbol.getToken() == t) then {
symbol = scan.nextSymbol();
return;
}
else syntaxError("Unexpected token found");
}
We apply the rules above to construct a recursive descent parser for the
simple LL(1) grammar for expressions from the last chapter.
We assume that the terminal symbols a and b are identifiers,
so both will be mapped to name1 tokens by the scanner.
E -> T [ + E ]
T -> F [ * T ]
F -> a | b | ( E )
void E() {
T();
if (symbol.getToken() == Token.PLUS1) {
match(Token.PLUS1);
E();
}
}
void T() {
F();
if ( symbol.getToken() == Token.MULTIPLY1 ) {
match(Token.MULTIPLY1);
T();
}
}
void F() {
if ( symbol.getToken() == Token.NAME1 )
match(Token.NAME1);
else if ( symbol.getToken() == Token.LEFTPAREN1 ) {
match(Token.LEFTPAREN1);
E();
match(Token.RIGHTPAREN1);
}
else syntaxError("Illegal symbol starting a factor.");
}
package pascalminus.parser;
/**
* The Pascal Minus Parser -- a recursive descent parser
* version 1: no error recovery
*/
import pascalminus.scanner.*;
import plcompiler.util.SymbolSet;
public class Parser {
private static Token symbol;
private static TokenReader tokenReader;
private static String tokenSource = "tokens.tok";
// SymbolSets
private static SymbolSet exprSymbols = new SymbolSet(
Token.MINUS1, Token.NUMERAL1, Token.NAME1, Token.LPAREN1, Token.NOT1);
private static SymbolSet relOpSymbols = new SymbolSet(
Token.LESS1, Token.EQUAL1, Token.GREATER1,
Token.LESSOREQUAL1, Token.GREATOREQUAL1, Token.NOTEQUAL1);
public Parser() {
tokenReader = new TokenReader(tokenSource);
symbol = tokenReader.getSymbol();
}
public Parser(String tokensource) {
tokenReader = new TokenReader(tokensource);
symbol = tokenReader.getSymbol();
tokenSource = tokensource;
}
/**
* syntax: Program -> "program" ProgramName ";" Block "."
*/
public void program() {
match(Token.PROGRAM1);
match(Token.NAME1);
match(Token.SEMICOLON1);
block();
match(Token.PERIOD1);
}
/*
* private void match(byte tok)
* match a token in the input
*/
private void match(byte tok) {
if (symbol.getToken() == tok)
symbol=tokenReader.nextSymbol();
else syntaxError(" unexpected symbol: "+ symbol + " . Expected: " +
Token.getTokenString(tok) + " .");
} // match()
... other parsing procedures ...
} // class Parser
First let us look at some features of the overall design of the class. The Pascal minus parser is part of a multi-pass compiler. Consequently when it needs a token it must read it from a file of tokens previouely output by the scanner. This is done by an object of type TokenReader. When the object tokenReader is created, the constructor for TokenReader reads in the first token automatically and stores it in a variable that always holds the current symbol. TokenReader has an accessor method Token getToken() which returns the current token. The parser stores this in it's own private variable Token symbol. Both tokenReader and symbol are static variables, because the Parser only needs one of each.
The parser only has one public method: the parsing procedure program() for the start symbol Program of the Pascal minus grammar. All other parsing procedures are internal details of the parser, so are private. The code of program() is given above.
Finally we note the use of some sets of tokens used in the Pascal minus parser. For this, the parser relies on a class that implements sets of tokens called SymbolSet. Java has convenient set representations in the library, and the code for SymbolSet relies on java.util.TreeSet to build actual sets. However SymbolSet presents a convenient interface for the parser to construct small sets of tokens directly through constructors, as well as other methods that make the code for the parser more pleasant, readable, and easier to understand. The parser uses for example two of these pre-computed sets in static variables exprSymbols and relOpSymbols which contain the sets for First(Expression) and First(RelationalOperator), because these contain at least 5 tokens.
The code for parsing procedure block() illustrates the use of optional items, and the simplification of the pseudo-code we presented before, when a set of tokens consists of a single token.
/**
* private void block()
*
* syntax: Block -> [ ConstantDefinitions ] [ TypeDefinitions ]
* [ VarDefinitions] [ ProcedureDefinitions ] CompoundStatement
*
*/
private void block(){
if (symbol.getToken() == Token.CONST1)
constantDefinitions();
if (symbol.getToken() == Token.TYPE1)
typeDefinitions();
if (symbol.getToken() == Token.VAR1)
varDefinitions();
if (symbol.getToken() == Token.PROCEDURE1)
procedureDefinitions();
compoundStatement();
} // block()
The code for constantDefinitions() illustrates the use of repetition in the BNF grammar for Pascal minus.
/*
* private void constantDefinitions()
*
* syntax: ConstantDefinitions ->
* "const" ConstantDefinition { ConstantDefinition }
*/
private void constantDefinitions() {
match(Token.CONST1);
constantDefinition();
while (symbol.getToken() == Token.NAME1) {
constantDefinition();
}
} // constantDefinitions()
/*
* private void constantDefinition()
*
* syntax: ConstantDefinition -> ConstantName "=" Constant ";"
*
*/
private void constantDefinition() {
match(Token.NAME1);
match(Token.EQUAL1);
constant();
match(Token.SEMICOLON1);
} // constantDefinition()
Error Recovery
Until now we have not considered what to do if the input does not consist of a valid sentence of the language. The parser will detect this at some point and invoke the procedure syntaxError() but what should this procedure do? A simple solution is for syntaxError() to report the error and then terminate the parser. Typically we want to indicate where the error was encountered, and so for example the current line number might be reported. Some compilers use exactly this approach, that is they abort compilation when the first error is encountered, which is acceptable when only relatively small programs are compiled, so that the cost of recompilation is not too large. This approach can be particularly effective when the compiler is embedded in an integrated development environment, so the error automatically is linked to the source program editor, which imemdiately highlights the offending line of code to be corrected. But if rather large programs might be compiled, recompiling the program for each syntax error encountered becomes unsatisfactory. If a syntax error is encountered, code cannot be generated, but we still would like the compiler to continue compilation after reporting the error, so that as many syntax errors as possible are discovered. It is more efficient to correct a number of mistakes than one at a time. Continuing compilation after an error is encountered is however not as simple as it sounds, because before the parser can continue parsing, it must take the appropriate corrective action. This process is called error recovery.
When an error is encountered what the parser will detect is that the next symbol in the input is not what it should have been. This may be detected either in match() or in a parsing procedure for a nonterminal that may not generate the empty string and for which symbol is not in the First set of any of its alternatives. Before the parser can continue to parse it must decide what to do with the offending symbol. The simplest case to consider is when parsing can resume with the change of a single symbol in the input. However, even in this simplest of cases there are several possibilities. The offending symbol may have been in the input instead of the correct symbol, in which case we could continue parsing if the parser pretended that it saw the correct symbol and continued parsing with the next symbol of the input. This is the case on line 8 and line 13 in the example code below.
1 { syntax errors -- adapted from Brinch Hansen on Pascal Compilers
2 Prentice Hall, 1985.
3 }
4
5 program Test;
6
7 const
8 a := 1; { wrong symbol: := instead of = }
9 b = 2 ;
10 c = ; { symbol missing!!! }
11 d = 4 ;
12 type
13 S = recrod f,g : integer end; { mistyped keyword instead
14 of record }
15 T = array[ a .. b ] of integer;
16 var
17 x : integer;
18 begin
19 if x == 2 then { extra = symbol }
20 x := 1
21 end.
Another possibility would be that the correct symbol is missing from the input, as in line 10, in which case we could continue parsing by pretending that the correct symbol was present in the input and then continue parsing with the current symbol. Still another possibility is that the offending symbol is one that should be cut from the input, as on line 19, in which case we could continue parsing by pretending that the offending symbol was never read, read the next symbol and continue parsing with it as if it were the offending symbol.
The trouble is that for the parser all three cases appear to be identical yet recovery would consist of three different actions. The difficulty is compounded by the fact that the parser sees a single look-ahead symbol, and this is the only context that it sees. There are of course many other more complicated situations that could occur, in which more than one symbol would have to be changed to resume parsing. The important points to understand from the above discussion is that when an error occurs the parser and the input must be resynchronized to permit the continuation of parsing and that the proper resynchronization is not at all obvious from the information available to the parser.
There is no "best" solution known to the problem of error recovery, but good heuristic algorithms have been found which work well in practice under most circumstances though they may sometimes fail. We will present in this section one such algorithm due to Wirth. When an error occurs we will allow the parser to skip 0 or more symbols in the input to enable resynchronization. The symbols which are skipped are not parsed at all, so it is desirable to skip the smallest number of tokens in order to catch the maximum number of syntax errors. The general idea is to skip symbols in the input until a symbol is found which "makes sense" at that point, and which will allow parsing to resume. The set of symbols which "makes sense" needs to be changed as the parsing progresses. The symbols that make sense in the current context are called stop symbols, because they allow skipping to be stopped. Each parsing procedure will receive as a parameter an initial set of stop symbols which make sense in the current context. We will also add a new responsibility to the parsing procedure, namely that it must insure that when it returns, whether an error occurred inside of it or not, symbol contains one of the stop symbols. Inside of the parsing procedure the stop symbols are adjusted at each step to include additional symbols that make sense at that point. Parsing starts with the start symbol which generates the entire input. When this parsing procedure returns symbol will contain the end of sentence marker Token.ENDTEXT1, so the initial set of stop symbols contains the single symbol Token.ENDTEXT1.
When a syntax error occurs, it is reported and then symbols are skipped until a stop symbol is found. However, as we shall see error recovery tends to generate a number of syntax errors before the proper resynchronization takes place. It is desirable not to report multiple errors caused by a single syntax error, so we adopt the rule that at most one error is reported per line, which will tend to suppress these multiple error messages. This is accomplished by setting a flag with each new line, and resetting it when an error is reported. The error reporing routine checks the flag and suppresses the error message if it is reset. The syntaxError() procedure thus can have the following pseudo-code. We assume we have a data type SymbolSet for the set of stop symbols.
void syntaxError( SymbolSet stop, String errorMessage) {
reportError(errorMessage); // if firstError flag is true.
// Also, reset firstError.
while (! (symbol.getToken() in stop )) symbol = scan.nextSymbol();
}
Now we revisit the rules for a recursive descent parser to incorporate
the above ideas for error recovery. Each nonterminal
A has a corresponding parsing procedure A(SymbolSet stop)
which is invoked when A is the leftmost nonterminal of the
current sentential form, and where stop is the current set of
stop symbols. The overall
responsibility of A(stop) is to read the sequence of tokens
generated by the nonterminal A in this sentence and check that
the next look-ahead symbol is in stop. Suppose that the productions
of the grammar with left-hand-side A are
A -> x1 | x2 | ... | xk .
Then if A may generate
the empty string the pseudo code for A(stop) has the form,
void A(SymbolSet stop) {
if ( symbol.getToken() in First(x1 ) then
parse(x1, stop );
else if ( symbol.getToken() in First(x2 ) then
parse(x2, stop );
...
else if ( symbol.getToken() in First(xk ) then
parse(xk, stop );
else // if A may generate the empty string
// make sure symbol is in stop
if ( symbol.getToken() in stop ) then return;
else syntaxError(stop, "Unexpected token" + symbol
+ " after parsing an empty A.");
}
void A(SymbolSet stop) {
if ( symbol.getToken() in First(x1 ) then
parse(x1, stop );
else if ( symbol.getToken() in First(x2 ) then
parse(x2, stop );
...
else if ( symbol.getToken() in First(xk ) then
parse(xk, stop );
else // if A may not generate the empty string
// this is a syntax error
syntaxError(stop, "An A cannot start with "+ symbol);
}
parse(X1X2...Xn, stop) {
parse( X1, stop1 );
parse( X2, stop2 );
...
parse(Xn, stopn );
}
The parsing algorithm for a single nonterminal X consists
of just calling the corresponding parsing procedure with the current stop
symbols,
parse(X,stop) {
X(stop);
}
parse( [ x ], stop ) {
if symbol.getToken() in First(x) then
parse(x,stop );
else if (symbol.getToken() not in stop) then
syntaxError(stop, "Unexpected token" + symbol);
}
parse( { x } , stop ) {
while ( symbol.getToken() in First(x) ) do
parse(x, stop U First(x);
if (symbol.getToken() not in stop) then
syntaxError(stop, "Unexpected token" + symbol);
}
The test at the end is necessary in the case that the loop body is not
executed at all.
Finally, for a single terminal symbol we call match
void match(byte t, SymbolSet stop) {
if ( symbol.getToken() == t ) then {
symbol = scan.nextSymbol();
if ( symbol.getToken() in stop ) return;
else syntaxError(stop, "Unexpected token" + symbol
+ " after " + t);
}
else syntaxError(stop, symbol + " does not match expected token " + t);
}
We now show the Pascal minus parsing procedures above that include error recovery. The reader may want to compare this code with the previously presented version. Notice also the heavier reliance on the services provided by SymbolSet.
package pascalminus.parser;
/**
* The Pascal Minus Parser -- a recursive descent parser
* version 2: with error recovery
*/
import pascalminus.scanner.*;
import plcompiler.util.SymbolSet;
public class Parser {
private static Token symbol;
private static TokenReader tokenReader;
private static String tokenSource = "tokens.tok";
// SymbolSets
private static SymbolSet relOpSymbols = new SymbolSet(
Token.LESS1, Token.EQUAL1, Token.GREATER1,
Token.LESSOREQUAL1, Token.GREATOREQUAL1, Token.NOTEQUAL1);
private static SymbolSet exprSymbols = new SymbolSet(Token.PLUS1,Token.MINUS1,
Token.NUMERAL1, Token.NAME1, Token.LPAREN1, Token.NOT1);
private static SymbolSet termSymbols = new SymbolSet(Token.NUMERAL1,
Token.NAME1, Token.LPAREN1, Token.NOT1);
private static SymbolSet addOpSymbols = new SymbolSet(Token.PLUS1,
Token.MINUS1, Token.OR1);
private static SymbolSet multOpSymbols = new SymbolSet(Token.TIMES1,
Token.DIV1, Token.MOD1, Token.AND1);
private static SymbolSet blockSymbols = new SymbolSet(Token.CONST1,
Token.TYPE1, Token.VAR1, Token.PROCEDURE1, Token.BEGIN1);
private static SymbolSet statementSymbols = new SymbolSet(Token.IF1,
Token.NAME1, Token.WHILE1, Token.BEGIN1);
public Parser() {
tokenReader = new TokenReader(tokenSource);
symbol = tokenReader.getSymbol();
}
public Parser(String tokensource) {
tokenReader = new TokenReader(tokensource);
symbol = tokenReader.getSymbol();
tokenSource = tokensource;
}
/**
* public void program(SymbolSet stop)
*
* syntax: Program -> "program" ProgramName ";" Block "."
*/
public void program(SymbolSet stop) {
SymbolSet
st4 = stop.add(Token.PERIOD1), // stop + .
st3 = st4.add(blockSymbols), // stop + (blockSymbols, .)
st2 = st3.add(Token.SEMICOLON1), // stop + (; , blockSymbols, .)
st1 = st2.add(Token.NAME1); // stop + ( name, ; , blocksymbols, . )
match(Token.PROGRAM1, st1);
match(Token.NAME1, st2);
match(Token.SEMICOLON1,st3);
block(st4);
match(Token.PERIOD1,stop);
} // program()
/*
* private void match(byte tok,SymbolSet stop)
* match a token in the input
*/
private void match(byte tok, SymbolSet stop) {
if (symbol.getToken() == tok) {
symbol=tokenReader.nextSymbol(); // now make sure symbol is in stop
if ( stop.contains(symbol.getToken())) return; // OK
else { // not OK -- skip to a stop symbol then return
syntaxError(
" Illegal symbol after " + Token.getTokenString(tok) + " ." ,stop );
return;
}
}
else syntaxError(" unexpected symbol: "+ symbol + " . Expected: " +
Token.getTokenString(tok) + " .",stop);
} // match
/**
* private void block(SymbolSet stop)
*
* syntax: Block -> [ ConstantDefinitions ] [ TypeDefinitions ]
* [ VarDefinitions] [ ProcedureDefinitions ] CompoundStatement
*
*/
private void block(SymbolSet stop){
SymbolSet
st4 = stop.add(Token.BEGIN1),
st3 = st4.add(Token.PROCEDURE1),
st2 = st3.add(Token.VAR1),
st1 = st2.add(Token.TYPE1);
if (symbol.getToken() == Token.CONST1)
constantDefinitions(st1);
if (symbol.getToken() == Token.TYPE1)
typeDefinitions(st2);
if (symbol.getToken() == Token.VAR1)
varDefinitions(st3);
if (symbol.getToken() == Token.PROCEDURE1)
procedureDefinitions(st4);
compoundStatement(stop);
} // block()
/*
* private void constantDefinitions(SymbolSet stop)
*
* syntax: ConstantDefinitions ->
* "const" ConstantDefinition { ConstantDefinition }
*/
private void constantDefinitions(SymbolSet stop) {
SymbolSet st1 = stop.add(Token.NAME1);
match(Token.CONST1, st1);
constantDefinition(st1);
while (symbol.getToken() == Token.NAME1) {
constantDefinition(st1);
}
} // constantDefinitions()
/*
* private void constantDefinition(SymbolSet stop)
*
* syntax: ConstantDefinition -> Constantname "=" Constant ";"
*
*/
private void constantDefinition(SymbolSet stop) {
SymbolSet
st3 = stop.add(Token.SEMICOLON1),
st2 = st3.add(Token.NAME1, Token.NUMERAL1),
st1 = st2.add(Token.EQUAL1);
match(Token.NAME1, st1);
match(Token.EQUAL1, st2);
constant(st3);
match(Token.SEMICOLON1, stop);
} // constantDefinition()
... other parsing procedures ...
} // class Parser