7.6. Newlines

Moving right along, let's modify our scanner to handle more than one line. As I mentioned last time, the most straightforward way to do this is to simply treat the newline characters, carriage return and line feed, as white space. This is, in fact, the way the C standard library routine, iswhite, works. We didn't actually try this before. I'd like to do it now, so you can get a feel for the results.

To do this, simply modify the single executable line of IsWhite to read:

   IsWhite := c in [' ', TAB, CR, LF];

We need to give the main program a new stop condition, since it will never see a CR. Let's just use:

   until Token = '.';

OK, compile this program and run it. Try a couple of lines, terminated by the period. I used:

now is the time
for all good men.

Hey, what happened? When I tried it, I didn't get the last token, the period. The program didn't halt. What's more, when I pressed the Enter key a few times, I still didn't get the period.

If you're still stuck in your program, you'll find that typing a period on a new line will terminate it.

What's going on here? The answer is that we're hanging up in SkipWhite. A quick look at that routine will show that as long as we're typing null lines, we're going to just continue to loop. After SkipWhite encounters an LF, it tries to execute a GetChar. But since the input buffer is now empty, GetChar's read statement insists on having another line. Procedure Scan gets the terminating period, all right, but it calls SkipWhite to clean up, and SkipWhite won't return until it gets a non-null line.

This kind of behavior is not quite as bad as it seems. In a real compiler, we'd be reading from an input file instead of the console, and as long as we have some procedure for dealing with end-of-files, everything will come out OK. But for reading data from the console, the behavior is just too bizarre. The fact of the matter is that the C/UNIX convention is just not compatible with the structure of our parser, which calls for a lookahead character. The code that the Bell wizards have implemented doesn't use that convention, which is why they need “ungetc”.

OK, let's fix the problem. To do that, we need to go back to the old definition of IsWhite (delete the CR and LF characters) and make use of the procedure Fin that I introduced last time. If it's not in your current version of the cradle, put it there now.

Also, modify the main program to read:

{ Main Program }
begin
   Init;
   repeat
      Token := Scan;
      writeln(Token);
      if Token = CR then Fin;
   until Token = '.';
end.

Note the “guard” test preceding the call to Fin. That's what makes the whole thing work, and ensures that we don't try to read a line ahead.

Try the code now. I think you'll like it better.

If you refer to the code we did in the last installment, you'll find that I quietly sprinkled calls to Fin throughout the code, wherever a line break was appropriate. This is one of those areas that really affects the look & feel that I mentioned. At this point I would urge you to experiment with different arrangements and see how you like them. If you want your language to be truly free-field, then newlines should be transparent. In this case, the best approach is to put the following lines at the beginning of Scan:

   while Look = CR do
      Fin;

If, on the other hand, you want a line-oriented language like Assembler, BASIC, or FORTRAN (or even Ada … note that it has comments terminated by newlines), then you'll need for Scan to return CR's as tokens. It must also eat the trailing LF. The best way to do that is to use this line, again at the beginning of Scan:

   if Look = LF then Fin;

For other conventions, you'll have to use other arrangements. In my example of the last session, I allowed newlines only at specific places, so I was somewhere in the middle ground. In the rest of these sessions, I'll be picking ways to handle newlines that I happen to like, but I want you to know how to choose other ways for yourselves.