7.11. Distributed vs Centralized Scanners

The structure for the lexical scanner that I've just shown you is very conventional, and about 99% of all compilers use something very close to it. This is not, however, the only possible structure, or even always the best one.

The problem with the conventional approach is that the scanner has no knowledge of context. For example, it can't distinguish between the assignment operator = and the relational operator = (perhaps that's why both C and Pascal use different strings for the two). All the scanner can do is to pass the operator along to the parser, which can hopefully tell from the context which operator is meant. Similarly, a keyword like IF has no place in the middle of a math expression, but if one happens to appear there, the scanner will see no problem with it, and will return it to the parser, properly encoded as an IF.

With this kind of approach, we are not really using all the information at our disposal. In the middle of an expression, for example, the parser “knows” that there is no need to look for keywords, but it has no way of telling the scanner that. So the scanner continues to do so. This, of course, slows down the compilation.

In real-world compilers, the designers often arrange for more information to be passed between parser and scanner, just to avoid this kind of problem. But that can get awkward, and certainly destroys a lot of the modularity of the structure.

The alternative is to seek some way to use the contextual information that comes from knowing where we are in the parser. This leads us back to the notion of a distributed scanner, in which various portions of the scanner are called depending upon the context.

In KISS, as in most languages, keywords only appear at the beginning of a statement. In places like expressions, they are not allowed. Also, with one minor exception (the multi-character relops) that is easily handled, all operators are single characters, which means that we don't need GetOp at all.

So it turns out that even with multi-character tokens, we can still always tell from the current lookahead character exactly what kind of token is coming, except at the very beginning of a statement.

Even at that point, the only kind of token we can accept is an identifier. We need only to determine if that identifier is a keyword or the target of an assignment statement.

We end up, then, still needing only GetName and GetNum, which are used very much as we've used them in earlier chapters.

It may seem at first to you that this is a step backwards, and a rather primitive approach. In fact, it is an improvement over the classical scanner, since we're using the scanning routines only where they're really needed. In places where keywords are not allowed, we don't slow things down by looking for them.