7.9. Getting Fancy

OK, at this point we have a pretty nice lexical scanner that will break an input stream up into tokens. We could use it as it stands and have a servicable compiler. But there are some other aspects of lexical scanning that we need to cover.

The main consideration is <shudder> efficiency. Remember when we were dealing with single-character tokens, every test was a comparison of a single character, Look, with a byte constant. We also used the Case statement heavily.

With the multi-character tokens being returned by Scan, all those tests now become string comparisons. Much slower. And not only slower, but more awkward, since there is no string equivalent of the CASE statement in Pascal. It seems especially wasteful to test for what used to be single characters … the =, +, and other operators … using string comparisons.

Using string comparison is not impossible … Ron Cain used just that approach in writing Small C. Since we're sticking to the KISS principle here, we would be truly justified in settling for this approach. But then I would have failed to tell you about one of the key approaches used in “real” compilers.

You have to remember: the lexical scanner is going to be called a lot! Once for every token in the whole source program, in fact. Experiments have indicated that the average compiler spends anywhere from 20% to 40% of its time in the scanner routines. If there were ever a place where efficiency deserves real consideration, this is it.

For this reason, most compiler writers ask the lexical scanner to do a little more work, by “tokenizing” the input stream. The idea is to match every token against a list of acceptable keywords and operators, and return unique codes for each one recognized. In the case of ordinary variable names or numbers, we just return a code that says what kind of token they are, and save the actual string somewhere else.

One of the first things we're going to need is a way to identify keywords. We can always do it with successive IF tests, but it surely would be nice if we had a general-purpose routine that could compare a given string with a table of keywords. (By the way, we're also going to need such a routine later, for dealing with symbol tables.) This usually presents a problem in Pascal, because standard Pascal doesn't allow for arrays of variable lengths. It's a real bother to have to declare a different search routine for every table. Standard Pascal also doesn't allow for initializing arrays, so you tend to see code like

   Table[1] := 'IF';
   Table[2] := 'ELSE';
   ⋮
   Table[n] := 'END';

which can get pretty old if there are many keywords.

Fortunately, Turbo Pascal 4.0 has extensions that eliminate both of these problems. Constant arrays can be declared using TP's “typed constant” facility, and the variable dimensions can be handled with its C-like extensions for pointers.

First, modify your declarations like this:

{ Type Declarations  }
type Symbol = string[8];
     SymTab = array[1..1000] of Symbol;
     TabPtr = ^SymTab;

Note

The dimension used in SymTab is not real … no storage is allocated by the declaration itself, and the number need only be “big enough.

Now, just beneath those declarations, add the following:

{ Definition of Keywords and Token Types }
const KWlist: array [1..4] of Symbol =
              ('IF', 'ELSE', 'ENDIF', 'END');

Next, insert the following new function:

{ Table Lookup }
{ If the input string matches a table entry, return the entry
  index.  If not, return a zero.  }
function Lookup(T: TabPtr; s: string; n: integer): integer;
var i: integer;
    found: boolean;
begin
   found := false;
   i := n;
   while (i > 0) and not found do
      if s = T^[i] then
         found := true
      else
         dec(i);
   Lookup := i;
end;

To test it, you can temporarily change the main program as follows:

{ Main Program }
begin
   ReadLn(Token);
   WriteLn(Lookup(Addr(KWList), Token, 4));
end.

Notice how Lookup is called: The Addr function sets up a pointer to KWList, which gets passed to Lookup.

OK, give this a try. Since we're bypassing Scan here, you'll have to type the keywords in upper case to get any matches.

Now that we can recognize keywords, the next thing is to arrange to return codes for them.

So what kind of code should we return? There are really only two reasonable choices. This seems like an ideal application for the Pascal enumerated type. For example, you can define something like

     SymType = (IfSym, ElseSym, EndifSym, EndSym, Ident, Number,
                    Operator);

and arrange to return a variable of this type. Let's give it a try. Insert the line above into your type definitions.

Now, add the two variable declarations:

    Token: Symtype;          { Current Token  }
    Value: String[16];       { String Token of Look }

Modify the scanner to read:

{ Lexical Scanner }
procedure Scan;
var k: integer;
begin
   while Look = CR do
      Fin;
   if IsAlpha(Look) then begin
      Value := GetName;
      k := Lookup(Addr(KWlist), Value, 4);
      if k = 0 then
         Token := Ident
      else
         Token := SymType(k - 1);
      end
   else if IsDigit(Look) then begin
      Value := GetNum;
      Token := Number;
      end
   else if IsOp(Look) then begin
      Value := GetOp;
      Token := Operator;
      end
   else begin
      Value := Look;
      Token := Operator;
      GetChar;
   end;
   SkipWhite;
end;

Note

Notice that Scan is now a procedure, not a function.

Finally, modify the main program to read:

{ Main Program }
begin
   Init;
   repeat
      Scan;
      case Token of
        Ident: write('Ident ');
        Number: Write('Number ');
        Operator: Write('Operator ');
        IfSym, ElseSym, EndifSym, EndSym: Write('Keyword ');
      end;
      Writeln(Value);
   until Token = EndSym;
end.

What we've done here is to replace the string Token used earlier with an enumerated type. Scan returns the type in variable Token, and returns the string itself in the new variable Value.

OK, compile this and give it a whirl. If everything goes right, you should see that we are now recognizing keywords.

What we have now is working right, and it was easy to generate from what we had earlier. However, it still seems a little “busy” to me. We can simplify things a bit by letting GetName, GetNum, GetOp, and Scan be procedures working with the global variables Token and Value, thereby eliminating the local copies. It also seems a little cleaner to move the table lookup into GetName. The new form for the four procedures is, then:

{ Get an Identifier }
procedure GetName;
var k: integer;
begin
   Value := '';
   if not IsAlpha(Look) then Expected('Name');
   while IsAlNum(Look) do begin
     Value := Value + UpCase(Look);
     GetChar;
   end;
   k := Lookup(Addr(KWlist), Value, 4);
   if k = 0 then
      Token := Ident
   else
      Token := SymType(k-1);
end;

{ Get a Number }
procedure GetNum;
begin
   Value := '';
   if not IsDigit(Look) then Expected('Integer');
   while IsDigit(Look) do begin
     Value := Value + Look;
     GetChar;
   end;
   Token := Number;
end;

{ Get an Operator }
procedure GetOp;
begin
   Value := '';
   if not IsOp(Look) then Expected('Operator');
   while IsOp(Look) do begin
     Value := Value + Look;
     GetChar;
   end;
   Token := Operator;
end;

{ Lexical Scanner }
procedure Scan;
var k: integer;
begin
   while Look = CR do
      Fin;
   if IsAlpha(Look) then
      GetName
   else if IsDigit(Look) then
      GetNum
   else if IsOp(Look) then
      GetOp
   else begin
      Value := Look;
      Token := Operator;
      GetChar;
   end;
   SkipWhite;
end;