15.6. Scanning And Parsing

The classical compiler architecture consists of separate modules for the lexical scanner, which supplies tokens in the language, and the parser, which tries to make sense of the tokens as syntax elements. If you can still remember what we did in earlier installments, you'll recall that we didn't do things that way. Because we're using a predictive parser, we can almost always tell what language element is coming next, just by examining the lookahead character. Therefore, we found no need to prefetch tokens, as a scanner would do.

But, even though there is no functional procedure called “Scanner,” it still makes sense to separate the scanning functions from the parsing functions. So I've created two more units called, amazingly enough, Scanner and Parser. The Scanner unit contains all of the routines known as recognizers. Some of these, such as IsAlpha, are pure boolean routines which operate on the lookahead character only. The other routines are those which collect tokens, such as identifiers and numeric constants. The Parser unit will contain all of the routines making up the recursive-descent parser. The general rule should be that unit Parser contains all of the information that is language-specific; in other words, the syntax of the language should be wholly contained in Parser. In an ideal world, this rule should be true to the extent that we can change the compiler to compile a different language, merely by replacing the single unit, Parser.

In practice, things are almost never this pure. There's always a small amount of “leakage” of syntax rules into the scanner as well. For example, the rules concerning what makes up a legal identifier or constant may vary from language to language. In some languages, the rules concerning comments permit them to be filtered by the scanner, while in others they do not. So in practice, both units are likely to end up having language-dependent components, but the changes required to the scanner should be relatively trivial.

Now, recall that we've used two versions of the scanner routines: One that handled only single-character tokens, which we used for a number of our tests, and another that provided full support for multi-character tokens. Now that we have our software separated into units, I don't anticipate getting much use out of the single-character version, but it doesn't cost us much to provide for both. I've created two versions of the Scanner unit. The first one, called Scanner1, contains the single-digit version of the recognizers:

unit Scanner1;

interface
uses Input, Errors;

function IsAlpha(c: char): boolean;
function IsDigit(c: char): boolean;
function IsAlNum(c: char): boolean;
function IsAddop(c: char): boolean;
function IsMulop(c: char): boolean;

procedure Match(x: char);
function GetName: char;
function GetNumber: char;

implementation

{ Recognize an Alpha Character }
function IsAlpha(c: char): boolean;
begin
        IsAlpha := UpCase(c) in ['A'..'Z'];
end;

{ Recognize a Numeric Character }
function IsDigit(c: char): boolean;
begin
        IsDigit := c in ['0'..'9'];
end;

{ Recognize an Alphanumeric Character }
function IsAlnum(c: char): boolean;
begin
        IsAlnum := IsAlpha(c) or IsDigit(c);
end;

{ Recognize an Addition Operator }
function IsAddop(c: char): boolean;
begin
        IsAddop := c in ['+','-'];
end;

{ Recognize a Multiplication Operator }
function IsMulop(c: char): boolean;
begin
        IsMulop := c in ['*','/'];
end;

{ Match One Character }
procedure Match(x: char);
begin
        if Look = x then GetChar
	else Expected('''' + x + '''');
end;

{ Get an Identifier }
function GetName: char;
begin
        if not IsAlpha(Look) then Expected('Name');
        GetName := UpCase(Look);
        GetChar;
end;

{ Get a Number }
function GetNumber: char;
begin
        if not IsDigit(Look) then Expected('Integer');
        GetNumber := Look;
        GetChar;
end;

end.

The following code fragment of the main program provides a good test of the scanner. For brevity, I'll only include the executable code here; the rest remains the same. Don't forget, though, to add the name Scanner1 to the "uses" clause.

        Write(GetName);
        Match('=');
        Write(GetNumber);
        Match('+');
        WriteLn(GetName);

This code will recognize all sentences of the form:

x=0+y

where x and y can be any single-character variable names, and 0 any digit. The code should reject all other sentences, and give a meaningful error message. If it did, you're in good shape and we can proceed.