9.5. The Structure of C

The C language is quite another matter, as you'll see. Texts on C rarely include a BNF definition of the language. Probably that's because the language is quite hard to write BNF for. One reason I'm showing you these structures now is so that I can impress upon you these two facts:

  1. The definition of the language drives the structure of the compiler. What works for one language may be a disaster for another. It's a very bad idea to try to force a given structure upon the compiler. Rather, you should let the BNF drive the structure, as we have done here.
  2. A language that is hard to write BNF for will probably be hard to write a compiler for, as well. C is a popular language, and it has a reputation for letting you do virtually anything that is possible to do. Despite the success of Small C, C is not an easy language to parse.

A C program has less structure than its Pascal counterpart. At the top level, everything in C is a static declaration, either of data or of a function. We can capture this thought like this:

     <program> ::= ( <global declaration> )*
     <global declaration> ::= <data declaration>  |
                              <function>

In Small C, functions can only have the default type int, which is not declared. This makes the input easy to parse: the first token is either “int,” “char,” or the name of a function. In Small C, the preprocessor commands are also processed by the compiler proper, so the syntax becomes:

     <global declaration> ::= '#' <preprocessor command>  |
                              'int' <data list>           |
                              'char' <data list>          |
                              <ident> <function body>     |

Although we're really more interested in full C here, I'll show you the code corresponding to this top-level structure for Small C.

{ Parse and Translate A Program }
procedure Prog;
begin
   while Look <> ^Z do begin
      case Look of
       '#': PreProc;
       'i': IntDecl;
       'c': CharDecl;
      else DoFunction(Int);
      end;
   end;
end;

Note that I've had to use a ^Z to indicate the end of the source. C has no keyword such as END or the '.' to otherwise indicate the end.

With full C, things aren't even this easy. The problem comes about because in full C, functions can also have types. So when the compiler sees a keyword like "int," it still doesn't know whether to expect a data declaration or a function definition. Things get more complicated since the next token may not be a name … it may start with an '*' or '(', or combinations of the two.

More specifically, the BNF for full C begins with:

     <program> ::= ( <top-level decl> )*
     <top-level decl> ::= <function def> | <data decl>
     <data decl> ::= [<class>] <type> <decl-list>
     <function def> ::= [<class>] [<type>] <function decl>

You can now see the problem: The first two parts of the declarations for data and functions can be the same. Because of the ambiguity in the grammar as written above, it's not a suitable grammar for a recursive-descent parser. Can we transform it into one that is suitable? Yes, with a little work. Suppose we write it this way:

     <top-level decl> ::= [<class>] <decl>
     <decl> ::= <type> <typed decl> | <function decl>
     <typed decl> ::= <data list> | <function decl>

We can build a parsing routine for the class and type definitions, and have them store away their findings and go on, without their ever having to “know” whether a function or a data declaration is being processed.

To begin, key in the following version of the main program:

{ Main Program }
begin
   Init;
   while Look <> ^Z do begin
      GetClass;
      GetType;
      TopDecl;
   end;
end.

For the first round, just make the three procedures stubs that do nothing but call GetChar.

Does this program work? Well, it would be hard put not to, since we're not really asking it to do anything. It's been said that a C compiler will accept virtually any input without choking. It's certainly true of this compiler, since in effect all it does is to eat input characters until it finds a ^Z.

Next, let's make GetClass do something worthwhile. Declare the global variable

     var Class: char;

and change GetClass to do the following:

{  Get a Storage Class Specifier }
Procedure GetClass;
begin
   if Look in ['a', 'x', 's'] then begin
      Class := Look;
      GetChar;
      end
   else Class := 'a';
end;

Here, I've used three single characters to represent the three storage classes “auto,” “extern,” and “static.” These are not the only three possible classes … there are also “register” and “typedef,” but this should give you the picture. Note that the default class is “auto.

We can do a similar thing for types. Enter the following procedure next:

{  Get a Type Specifier }
procedure GetType;
begin
   Typ := ' ';
   if Look = 'u' then begin
      Sign := 'u';
      Typ := 'i';
      GetChar;
      end
   else Sign := 's';
   if Look in ['i', 'l', 'c'] then begin
      Typ := Look;
      GetChar;
   end;
end;

Note that you must add two more global variables, Sign and Typ.

With these two procedures in place, the compiler will process the class and type definitions and store away their findings. We can now process the rest of the declaration.

We are by no means out of the woods yet, because there are still many complexities just in the definition of the type, before we even get to the actual data or function names. Let's pretend for the moment that we have passed all those gates, and that the next thing in the input stream is a name. If the name is followed by a left paren, we have a function declaration. If not, we have at least one data item, and possibly a list, each element of which can have an initializer.

Insert the following version of TopDecl:

{ Process a Top-Level Declaration }
procedure TopDecl;
var Name: char;
begin
   Name := Getname;
   if Look = '(' then
      DoFunc(Name)
   else
      DoData(Name);
end;

Note

Note that, since we have already read the name, we must pass it along to the appropriate routine.

Finally, add the two procedures DoFunc and DoData:

{ Process a Function Definition }
procedure DoFunc(n: char);
begin
   Match('(');
   Match(')');
   Match('{');
   Match('}');
   if Typ = ' ' then Typ := 'i';
   Writeln(Class, Sign, Typ, ' function ', n);
end;

{ Process a Data Declaration }
procedure DoData(n: char);
begin
   if Typ = ' ' then Expected('Type declaration');
   Writeln(Class, Sign, Typ, ' data ', n);
   while Look = ',' do begin
      Match(',');
      n := GetName;
      WriteLn(Class, Sign, Typ, ' data ', n);
   end;
   Match(';');
end;

Since we're still a long way from producing executable code, I decided to just have these two routines tell us what they found.

OK, give this program a try. For data declarations, it's OK to give a list separated by commas. We can't process initializers as yet. We also can't process argument lists for the functions, but the “(){}” characters should be there.

We're still a very long way from having a C compiler, but what we have is starting to process the right kinds of inputs, and is recognizing both good and bad inputs. In the process, the natural structure of the compiler is starting to take form.

Can we continue this until we have something that acts more like a compiler. Of course we can. Should we? That's another matter. I don't know about you, but I'm beginning to get dizzy, and we've still got a long way to go to even get past the data declarations.

At this point, I think you can see how the structure of the compiler evolves from the language definition. The structures we've seen for our two examples, Pascal and C, are as different as night and day. Pascal was designed at least partly to be easy to parse, and that's reflected in the compiler. In general, in Pascal there is more structure and we have a better idea of what kinds of constructs to expect at any point. In C, on the other hand, the program is essentially a list of declarations, terminated only by the end of file.

We could pursue both of these structures much farther, but remember that our purpose here is not to build a Pascal or a C compiler, but rather to study compilers in general. For those of you who DO want to deal with Pascal or C, I hope I've given you enough of a start so that you can take it from here (although you'll soon need some of the stuff we still haven't covered yet, such as typing and procedure calls). For the rest of you, stay with me through the next installment. There, I'll be leading you through the development of a complete compiler for TINY, a subset of KISS.

See you then.