Chapter 7. Lexical Scanning

Table of Contents

7.1. Lexical Scanning
7.2. State Machines and Alternatives
7.3. Some Experiments In Scanning
7.4. White Space
7.5. State Machines
7.6. Newlines
7.7. Operators
7.8. Lists, Commas And Command Lines
7.9. Getting Fancy
7.10. Returning A Character
7.11. Distributed vs Centralized Scanners
7.12. Merging Scanner And Parser
7.13. Conclusion

Installment published 7 November 1988.

In the last chapter , I left you with a compiler that would almost work, except that we were still limited to single-character tokens. The purpose of this session is to get rid of that restriction, once and for all. This means that we must deal with the concept of the lexical scanner.

Maybe I should mention why we need a lexical scanner at all … after all, we've been able to manage all right without one, up till now, even when we provided for multi-character tokens.

The only reason, really, has to do with keywords. It's a fact of computer life that the syntax for a keyword has the same form as that for any other identifier. We can't tell until we get the complete word whether or not it is a keyword. For example, the variable IFILE and the keyword IF look just alike, until you get to the third character. In the examples to date, we were always able to make a decision based upon the first character of the token, but that's no longer possible when keywords are present. We need to know that a given string is a keyword before we begin to process it. And that's why we need a scanner.

In the last session, I also promised that we would be able to provide for normal tokens without making wholesale changes to what we have already done. I didn't lie … we can, as you will see later. But every time I set out to install these elements of the software into the parser we have already built, I had bad feelings about it. The whole thing felt entirely too much like a band-aid. I finally figured out what was causing the problem: I was installing lexical scanning software without first explaining to you what scanning is all about, and what the alternatives are. Up till now, I have studiously avoided giving you a lot of theory, and certainly not alternatives. I generally don't respond well to the textbooks that give you twenty-five different ways to do something, but no clue as to which way best fits your needs. I've tried to avoid that pitfall by just showing you one method, that works.

But this is an important area. While the lexical scanner is hardly the most exciting part of a compiler, it often has the most profound effect on the general “look & feel” of the language, since after all it's the part closest to the user. I have a particular structure in mind for the scanner to be used with KISS. It fits the look & feel that I want for that language. But it may not work at all for the language you're cooking up, so in this one case I feel that it's important for you to know your options.

So I'm going to depart, again, from my usual format. In this session we'll be getting much deeper than usual into the basic theory of languages and grammars. I'll also be talking about areas other than compilers in which lexical scanning plays an important role. Finally, I will show you some alternatives for the structure of the lexical scanner. Then, and only then, will we get back to our parser from the last installment. Bear with me … I think you'll find it's worth the wait. In fact, since scanners have many applications outside of compilers, you may well find this to be the most useful session for you.