6.4. Fixing The Grammar

The problem that we've encountered comes up because our definitions of both arithmetic and Boolean factors permit the use of parenthesized expressions. Since the definitions are recursive, we can end up with any number of levels of parentheses, and the parser can't know which kind of expression it's dealing with.

The solution is simple, although it ends up causing profound changes to our grammar. We can only allow parentheses in one kind of factor. The way to do that varies considerably from language to language. This is one place where there is no agreement or convention to help us.

When Niklaus Wirth designed Pascal, the desire was to limit the number of levels of precedence (fewer parse routines, after all). So the OR and exclusive OR operators are treated just like an Addop and processed at the level of a math expression. Similarly, the AND is treated like a Mulop and processed with Term. The precedence levels are

LevelSyntax ElementOperator
0factorliteral, variable
1signed factorunary minus, NOT
2term*, /, AND
3expression+, -, OR

Notice that there is only one set of syntax rules, applying to both kinds of operators. According to this grammar, then, expressions like

x + (y AND NOT z) DIV 3

are perfectly legal. And, in fact, they are … as far as the parser is concerned. Pascal doesn't allow the mixing of arithmetic and Boolean variables, and things like this are caught at the semantic level, when it comes time to generate code for them, rather than at the syntax level.

The authors of C took a diametrically opposite approach: they treat the operators as different, and have something much more akin to our seven levels of precedence. In fact, in C there are no fewer than 17 levels! That's because C also has the operators =, += and its kin, <<, >>, ++, --, etc. Ironically, although in C the arithmetic and Boolean operators are treated separately, the variables are not … there are no Boolean or logical variables in C, so a Boolean test can be made on any integer value.

We'll do something that's sort of in-between. I'm tempted to stick mostly with the Pascal approach, since that seems the simplest from an implementation point of view, but it results in some funnies that I never liked very much, such as the fact that, in the expression

IF (c >= 'A') and (c <= 'Z') then …

the parens above are required. I never understood why before, and neither my compiler nor any human ever explained it very well, either. But now, we can all see that the and operator, having the precedence of a multiply, has a higher one than the relational operators, so without the parens the expression is equivalent to IF c >= ('A' and c) <= 'Z' then which doesn't make sense.

In any case, I've elected to separate the operators into different levels, although not as many as in C.

<b-expression> ::= <b-term> [<orop> <b-term>]*
<b-term>       ::= <not-factor> [AND <not-factor>]*
<not-factor>   ::= [NOT] <b-factor>
<b-factor>     ::= <b-literal> | <b-variable> | <relation>
<relation>     ::= | <expression> [<relop> <expression]
<expression>   ::= <term> [<addop> <term>]*
<term>         ::= <signed factor> [<mulop> factor]*
<signed factor>::= [<addop>] <factor>
<factor>       ::= <integer> | <variable> | (<b-expression>)

This grammar results in the same set of seven levels that I showed earlier. Really, it's almost the same grammar … I just removed the option of parenthesized b-expressions as a possible b-factor, and added the relation as a legal form of b-factor.

There is one subtle but crucial difference, which is what makes the whole thing work. Notice the square brackets in the definition of a relation. This means that the relop and the second expression are optional.

A strange consequence of this grammar (and one shared by C) is that every expression is potentially a Boolean expression. The parser will always be looking for a Boolean expression, but will “settle” for an arithmetic one. To be honest, that's going to slow down the parser, because it has to wade through more layers of procedure calls. That's one reason why Pascal compilers tend to compile faster than C compilers. If it's raw speed you want, stick with the Pascal syntax.