10.1. Getting Started

Yet a language called Tiny-anything still carries some baggage inherited from its parent language. I've often wondered if this is a good idea. Granted, a language based upon some parent language will have the advantage of familiarity, but there may also be some peculiar syntax carried over from the parent that may tend to add unnecessary complexity to the compiler. (Nowhere is this more true than in Small C.)

I've wondered just how small and simple a compiler could be made and still be useful, if it were designed from the outset to be both easy to use and to parse. Let's find out. This language will just be called “TINY,” period. It's a subset of KISS, which I also haven't fully defined, so that at least makes us consistent (!). I suppose you could call it TINY KISS. But that opens up a whole can of worms involving cuter and cuter (and perhaps more risque) names, so let's just stick with TINY.

The main limitations of TINY will be because of the things we haven't yet covered, such as data types. Like its cousins Tiny C and Tiny BASIC, TINY will have only one data type, the 16-bit integer. The first version we develop will also have no procedure calls and will use single-character variable names, although as you will see we can remove these restrictions without much effort.

The language I have in mind will share some of the good features of Pascal, C, and Ada. Taking a lesson from the comparison of the Pascal and C compilers in the previous installment, though, TINY will have a decided Pascal flavor. Wherever feasible, a language structure will be bracketed by keywords or symbols, so that the parser will know where it's going without having to guess.

One other ground rule: As we go, I'd like to keep the compiler producing real, executable code. Even though it may not do much at the beginning, it will at least do it correctly.

Finally, I'll use a couple of Pascal restrictions that make sense: All data and procedures must be declared before they are used. That makes good sense, even though for now the only data type we'll use is a word. This rule in turn means that the only reasonable place to put the executable code for the main program is at the end of the listing.

The top-level definition will be similar to Pascal:

     <program> ::= PROGRAM <top-level decl> <main> '.'

Already, we've reached a decision point. My first thought was to make the main block optional. It doesn't seem to make sense to write a "program" with no main program, but it does make sense if we're allowing for multiple modules, linked together. As a matter of fact, I intend to allow for this in KISS. But then we begin to open up a can of worms that I'd rather leave closed for now. For example, the term "PROGRAM" really becomes a misnomer. The MODULE of Modula-2 or the Unit of Turbo Pascal would be more appropriate. Second, what about scope rules? We'd need a convention for dealing with name visibility across modules. Better for now to just keep it simple and ignore the idea altogether.

There's also a decision in choosing to require the main program to be last. I toyed with the idea of making its position optional, as in C. The nature of SK*DOS, the OS I'm compiling for, make this very easy to do. But this doesn't really make much sense in view of the Pascal-like requirement that all data and procedures be declared before they're referenced. Since the main program can only call procedures that have already been declared, the only position that makes sense is at the end, a la Pascal.

Given the BNF above, let's write a parser that just recognizes the brackets:

{  Parse and Translate a Program }
procedure Prog;
begin
   Match('p');
   Header;
   Prolog;
   Match('.');
   Epilog;
end;

The procedure Header just emits the startup code required by the assembler:

{ Write Header Info }
procedure Header;
begin
   WriteLn('WARMST', TAB, 'EQU $A01E');
end;

The procedures Prolog and Epilog emit the code for identifying the main program, and for returning to the OS:

{ Write the Prolog }
procedure Prolog;
begin
   PostLabel('MAIN');
end;

{ Write the Epilog }
procedure Epilog;
begin
   EmitLn('DC WARMST');
   EmitLn('END MAIN');
end;

The main program just calls Prog, and then looks for a clean ending:

{ Main Program }
begin
   Init;
   Prog;
   if Look <> CR then Abort('Unexpected data after ''.''');
end.

At this point, TINY will accept only one input “program,” the null program:

     PROGRAM .   (or 'p.' in our shorthand.)

Note, though, that the compiler does generate correct code for this program. It will run, and do what you'd expect the null program to do, that is, nothing but return gracefully to the OS.

As a matter of interest, one of my favorite compiler benchmarks is to compile, link, and execute the null program in whatever language is involved. You can learn a lot about the implementation by measuring the overhead in time required to compile what should be a trivial case. It's also interesting to measure the amount of code produced. In many compilers, the code can be fairly large, because they always include the whole run- time library whether they need it or not. Early versions of Turbo Pascal produced a 12K object file for this case. VAX C generates 50K!

The smallest null programs I've seen are those produced by Modula-2 compilers, and they run about 200-800 bytes.

In the case of TINY, we have no run-time library as yet, so the object code is indeed tiny: two bytes. That's got to be a record, and it's likely to remain one since it is the minimum size required by the OS.

The next step is to process the code for the main program. I'll use the Pascal BEGIN-block:

     <main> ::= BEGIN <block> END

Here, again, we have made a decision. We could have chosen to require a “PROCEDURE MAIN” sort of declaration, similar to C. I must admit that this is not a bad idea at all … I don't particularly like the Pascal approach since I tend to have trouble locating the main program in a Pascal listing. But the alternative is a little awkward, too, since you have to deal with the error condition where the user omits the main program or misspells its name. Here I'm taking the easy way out.

Another solution to the “where is the main program” problem might be to require a name for the program, and then bracket the main by

     BEGIN <name>
     END <name>

similar to the convention of Modula 2. This adds a bit of “syntactic sugar” to the language. Things like this are easy to add or change to your liking, if the language is your own design.

To parse this definition of a main block, change procedure Prog to read:

{  Parse and Translate a Program }
procedure Prog;
begin
   Match('p');
   Header;
   Main;
   Match('.');
end;

and add the new procedure:

{ Parse and Translate a Main Program }
procedure Main;
begin
   Match('b');
   Prolog;
   Match('e');
   Epilog;
end;

Now, the only legal program is:

     PROGRAM BEGIN END . (or 'pbe.')

Aren't we making progress??? Well, as usual it gets better. You might try some deliberate errors here, like omitting the 'b' or the 'e', and see what happens. As always, the compiler should flag all illegal inputs.