10.9. Lexical Scanning

Of course, you know what's next: We have to convert the program so that it can deal with multi-character keywords, newlines, and whitespace. We have just gone through all that in Part Chapter 7, Lexical Scanning. We'll use the distributed scanner technique that I showed you in that installment. The actual implementation is a little different because the way I'm handling newlines is different.

To begin with, let's simply allow for whitespace. This involves only adding calls to SkipWhite at the end of the three routines, GetName, GetNum, and Match. A call to SkipWhite in Init primes the pump in case there are leading spaces.

Next, we need to deal with newlines. This is really a two-step process, since the treatment of the newlines with single-character tokens is different from that for multi-character ones. We can eliminate some work by doing both steps at once, but I feel safer taking things one step at a time.

Insert the new procedure:

{ Skip Over an End-of-Line }
procedure NewLine;
begin
   while Look = CR do begin
      GetChar;
      if Look = LF then GetChar;
      SkipWhite;
   end;
end;

Note that we have seen this procedure before in the form of Procedure Fin. I've changed the name since this new one seems more descriptive of the actual function. I've also changed the code to allow for multiple newlines and lines with nothing but white space.

The next step is to insert calls to NewLine wherever we decide a newline is permissible. As I've pointed out before, this can be very different in different languages. In TINY, I've decided to allow them virtually anywhere. This means that we need calls to NewLine at the BEGINNING (not the end, as with SkipWhite) of the procedures GetName, GetNum, and Match.

For procedures that have while loops, such as TopDecl, we need a call to NewLine at the beginning of the procedure and at the bottom of each loop. That way, we can be assured that NewLine has just been called at the beginning of each pass through the loop.

If you've got all this done, try the program out and verify that it will indeed handle white space and newlines.

If it does, then we're ready to deal with multi-character tokens and keywords. To begin, add the additional declarations (copied almost verbatim from Chapter 7, Lexical Scanning):

{ Type Declarations }
type Symbol = string[8];
     SymTab = array[1..1000] of Symbol;
     TabPtr = ^SymTab;

{ Variable Declarations }
var Look : char;             { Lookahead Character }
    Token: char;             { Encoded Token       }
    Value: string[16];       { Unencoded Token     }
    ST: Array['A'..'Z'] of char;

{ Definition of Keywords and Token Types }
const NKW =   9;
      NKW1 = 10;

const KWlist: array[1..NKW] of Symbol =
              ('IF', 'ELSE', 'ENDIF', 'WHILE', 'ENDWHILE',
               'VAR', 'BEGIN', 'END', 'PROGRAM');

const KWcode: string[NKW1] = 'xilewevbep';

Next, add the three procedures, also from Chapter 7, Lexical Scanning:

{ Table Lookup }
function Lookup(T: TabPtr; s: string; n: integer): integer;
var i: integer;
    found: Boolean;
begin
   found := false;
   i := n;
   while (i > 0) and not found do
      if s = T^[i] then
         found := true
      else
         dec(i);
   Lookup := i;
end;
.
.
.
{ Get an Identifier and Scan it for Keywords }
procedure Scan;
begin
   GetName;
   Token := KWcode[Lookup(Addr(KWlist), Value, NKW) + 1];
end;
.
.
.
{ Match a Specific Input String }
procedure MatchString(x: string);
begin
   if Value <> x then Expected('''' + x + '''');
end;

Now, we have to make a fairly large number of subtle changes to the remaining procedures. First, we must change the function GetName to a procedure, again as we did in Chapter 7, Lexical Scanning:

{ Get an Identifier }
procedure GetName;
begin
   NewLine;
   if not IsAlpha(Look) then Expected('Name');
   Value := '';
   while IsAlNum(Look) do begin
      Value := Value + UpCase(Look);
      GetChar;
   end;
   SkipWhite;
end;

Note that this procedure leaves its result in the global string Value.

Next, we have to change every reference to GetName to reflect its new form. These occur in Factor, Assignment, and Decl:

{ Parse and Translate a Math Factor }
procedure BoolExpression; Forward;
procedure Factor;
begin
   if Look = '(' then begin
      Match('(');
      BoolExpression;
      Match(')');
      end
   else if IsAlpha(Look) then begin
      GetName;
      LoadVar(Value[1]);
      end
   else
      LoadConst(GetNum);
end;
.
.
{ Parse and Translate an Assignment Statement }
procedure Assignment;
var Name: char;
begin
   Name := Value[1];
   Match('=');
   BoolExpression;
   Store(Name);
end;
.
.
{ Parse and Translate a Data Declaration }
procedure Decl;
begin
   GetName;
   Alloc(Value[1]);
   while Look = ',' do begin
      Match(',');
      GetName;
      Alloc(Value[1]);
   end;
end;

Note

Note that we're still only allowing single-character variable names, so we take the easy way out here and simply use the first character of the string.

Finally, we must make the changes to use Token instead of Look as the test character and to call Scan at the appropriate places. Mostly, this involves deleting calls to Match, occasionally replacing calls to Match by calls to MatchString, and Replacing calls to NewLine by calls to Scan. Here are the affected routines:

{ Recognize and Translate an IF Construct }
procedure Block; Forward;
procedure DoIf;
var L1, L2: string;
begin
   BoolExpression;
   L1 := NewLabel;
   L2 := L1;
   BranchFalse(L1);
   Block;
   if Token = 'l' then begin
      L2 := NewLabel;
      Branch(L2);
      PostLabel(L1);
      Block;
   end;
   PostLabel(L2);
   MatchString('ENDIF');
end;

{ Parse and Translate a WHILE Statement }
procedure DoWhile;
var L1, L2: string;
begin
   L1 := NewLabel;
   L2 := NewLabel;
   PostLabel(L1);
   BoolExpression;
   BranchFalse(L2);
   Block;
   MatchString('ENDWHILE');
   Branch(L1);
   PostLabel(L2);
end;

{ Parse and Translate a Block of Statements }
procedure Block;
begin
   Scan;
   while not(Token in ['e', 'l']) do begin
      case Token of
       'i': DoIf;
       'w': DoWhile;
      else Assignment;
      end;
      Scan;
   end;
end;

{ Parse and Translate Global Declarations }
procedure TopDecls;
begin
   Scan;
   while Token <> 'b' do begin
      case Token of
        'v': Decl;
      else Abort('Unrecognized Keyword ' + Value);
      end;
      Scan;
   end;
end;

{ Parse and Translate a Main Program }
procedure Main;
begin
   MatchString('BEGIN');
   Prolog;
   Block;
   MatchString('END');
   Epilog;
end;

{  Parse and Translate a Program }
procedure Prog;
begin
   MatchString('PROGRAM');
   Header;
   TopDecls;
   Main;
   Match('.');
end;

{ Initialize }
procedure Init;
var i: char;
begin
   for i := 'A' to 'Z' do
      ST[i] := ' ';
   GetChar;
   Scan;
end;

That should do it. If all the changes got in correctly, you should now be parsing programs that look like programs. (If you didn't make it through all the changes, don't despair. A complete listing of the final form is given later.)

Did it work? If so, then we're just about home. In fact, with a few minor exceptions we've already got a compiler that's usable. There are still a few areas that need improvement.