The task of the mandatory assignment is to write a compiler, perhaps not a full-blown compiler for a complex language, not even half-blown, but still covering some phases. One phase is the parser, resp. the lexer and the parser, both working hand in hand.

The task is not supposed to be overwhelming. It’s a focused and rather well-specified task using mature tools and techniques. It has been done many times before. Though done by others, for instance earlier participants of this or similar courses.

For someone who does it the first time, for instance for the mandatory assignment, it involves to get an overview over a couple of moving parts that all need to fit together and which are all potential sources of errors. Hopefully, the underlying concepts like regular expressions, BNF, bottom-up parsing, etc. are halfway grasped from the lecture. That helps, but only up-to a point. Now one has to fit it all together into a running program. One has to deal with CUP, with JLEX, or similar tools, one has to deal with ant, if one uses that, etc. all coming with pages and pages of documentation. So, where to start?

Concretely, I describe first steps in the Readme at the repos. They should be followed for quick check, if at least the provided initial checks work out of the box, i.e., Java works, jlex and cup work, ant works, in the provided configuration. And that should be done right away.

But that’s of course really only one initial step. Then the real work starts. In earlier years, sometimes groups or individuals got stuck or overwhelmed. They had invested effort to rig up a parser, defined the grammar, designed some AST, etc., coded it up, but somehow “it did not work”. And then it took quite some time and help to debug it… It happened not often, maybe 3 times I saw that “approach”, but it may be that more people had similar problems without asking for help, perhaps silently throwing the towel and not completing the obligs, or fighting it through with more effort and headache than the oblig is supposed to cause.

I gave, in those cases some advice, which I repeat here, mostly in the “let it grow” (and keep on testing) paragraphs below.

Mainly it’s how I would do it resp. how I do it, when writing a parser, or other things, especially if it’s the first time with some new technology or language I have not much experience with. I have written compilers and parsers a couple of times (also for the compila language and others), mostly in the context of teaching or research, though not for an industrial grade compiler project from scratch. The two mentioned paragraphs are in particular for the parser and the lexer (i.e., oblig 1), though similar strategies I use also in other contexts. Especially the remaining points are more general than for a parser.

The strategy I describe is meant to avoid the scenario from above: Namely to design it and code everything up, only to detect in the end (which is typically shortly before the oblig deadline…) that “it does not work”. There seems many “bugs” of this and that kind, the exact causes are a bit unclear, it’s unclear when they were introduced, some syntactically correct compila programs throw an error, some erroneous programs parse, so there seem also troubles with the grammar (which still has some shift-reduced conflicts that need to be analyzed and removed). And the build-file is not yet properly adapted, so one has to fiddle with it and try to get it running semi-manually, and errors are hard to reproduce. It’s a big mess.

It means the project has grown out of hand and the code has become overwhelming, one has lost overview and no longer knows where to start repairing; there are too many flames to put out and time is running short. It can become rather frustrating. And maybe it was all done with good intentions: “Let’s make first a good battle plan, design the grammar precisely according to what one has learned in the lecture, then carefully design some AST data structure, according to the recipe from the lecture, or according to one’s own ideas, fix all the reserved words as they are specified in the language specification, etc.”. When everything is thought out carefully and seemingly complete, one integrates the parts and and checks if it works, perhaps in our case, trying to parse the provided compila test program.

Only to find out that there seems to be a number of problems, maybe starting from the fact, that the integration fails, it just does not “fit together”, and that’s only the beginning of the bug hunt.

The sketched strategy is not bad per se, it can be seen as textbook top-down development. However, I would not approach the problem like that, in particular, if it’s the first time and if unsure about how lex and yacc actually work, and unsure about other technological aspects.

I am not saying top-down development is bad, it has its place, and most development processes are probably a mixture of top-down and bottom-up anyway. Actually, the overall parser-task (or both obligs) is developed top-down: there is a pretty detailed specification up-front, the language specification, there is some test suite provided up-front etc., before anyone starts to write any line of code. It’s only that the specification is not written by the course participants and coders. And that the specification and implementation phases are done by different people is not unheard of either in software development.

Bottom-up, integrate early, let it grow and stay healthy

The way I would approach it is to integrate early, actually right from the start. I.e., doing a compilable and runnable main program of some sort. The integration should involve a yacc or CUP or whatever specification, a lex specification, and a main-program that joins them to a program, that can read in a file and parse it. Also part of that integration must be the build-file. If you are using ant, you can use the provided build.xml as starting point; or do Makefile or maven or whatever. The task at hand, programming a parser, is rather easy, so any of the build facilities provide more functionality that you need for that, and you should be able to configure one of those without (much) reading their documentation and becoming a build-tool virtuoso. Keep it simple and straight.

Don’t use Eclipse or any fancy IDE for integration. I am not saying you should not use Eclipse or any of those when programming. But don’t rely on their behind-the-scenes wizardry. Perhaps they manage stuff because you configured your tool and set-up your project this and that way, and you don’t even know what is done. Besides that, the end-product of the project, the compiler, has to work independently from any development environment. One cannot expect that someone downloads a piece of program and it’s run within one specific IDE or editor. Users use programs, they do not edit or develop them. Ok, we will expect that they build them, but not more. We are not targeting consumer-level users, but users interested in checking out a compiler …

Since the end product needs to come with a build-mechanisms, in the spirit of early integration, also the build process as one of the several moving parts of the project should be in place right from the start!

That’s important, because when developing the parser, it’s a good idea to check and re-check and re-re-re-check the growing parser. And that testing must go painlessly, fast, and easy. If testing is hard and cumbersome, it’s not done. Building a compiler requires at least a few steps, like invoking yacc, invoking lex, and then compiling the whole code (and perhaps running the compiler).

There is not overly many steps, so the build process is fairly easy. Still you want to automatize it.

What to leave out first and what to ``grow’’?

Now, it seems everything is already in place from almost the start, what is left out? I would start with an almost trivial grammar and perhaps even an almost trivial lexer specification as well.

That means, one does not start with trying to nail down the compila grammar as described in the language spec. Instead, one makes a parser for a different language, maybe for a super trivial start one that corresponds to a grammar like

program ::= <begin> <end>

With the grammar like that, there is only one syntactically correct program, which looks this

  begin
  end

One should test out whether one can have an integrated parser being built without problems.

And one should test if the parser parses correctly according to the current state of the grammar. So one has to (temporarily) provide one’s own test program(s). Also testing should be integrated into the build-process.

Instead of starting with trivial program as first shot and an integrated ``parser’’, one can also work bottom-up in the grammar, starting from numbers first, then doing also expressions, then doing some statements, then adding more etc.

At any rate, if one grows the parser that way, it’s in my eyes important never to break again the integration, the fact that it builds, and that it does what (at that stage) the compiler is supposed to do. It’s not just early integration, is a bit like continuous integration.

At any rate, progressing like that make error localization easier: when it works properly, but after adding, say, loops to the syntax and the keywords suddenly something breaks (maybe the grammar suddenly contains conflicts), it’s more easy to know where to put the blame.

To stay continuously healthy when growing the compiler may also be less frustrating than just coding for a while until discovering that one cannot put it together into something that works. It gives a feeling of steady progress to see that larger and larger parts of the grammar are covered. And actually, each new addition gets easier and more routine anyway, so at some point one simply adds all the rest, since one has gotten a feeling how it all hangs together.

Parse only first, without AST

When letting the grammar (and the lex-specification) grow like that, one may at the same time grow the AST and perhaps the ``pretty-printer’’ alongside. That’s fine.

One can, in my opinion, also do it differently: for a start, ignore the AST (and the pretty printer) when growing the parser. That means, one leaves out the ``action part’’ of the grammar completely. Or rather one makes it to ``stub’’ returning not a a tree, but perhaps an integer 0, or nothing.

The disadvantage is that, when running the compiler on test files, one has not much output, except perhaps warnings from yacc, or when the parse fails, the parser prints some error diagnosis. Instead of doing no action at all, one might add print-statements, something at least is printed.

Once the parser works and seems happy with the test program complexaddition.cmp, one replaces the print-statements with creating ast tree nodes. Again, one may find it useful to follow a gradually growing approach, replacing not all print statements at once. Note that replacing the print-statements with AST generation requires fiddling with the cup or yacc-file: the types of the corresponding tokens have to be changed from void for print to whatever type has been chose for the AST node in a given grammar clause. For example, in the first stage (without AST), a non-terminal say program intended to represent the whole program could be declared as

non terminal            program;

Once the parsing seems to work, one can start thinking about the AST. A plausible name for the class representing the root node of the AST, i.e., the whole program, may be Program. When adding the creation of such a node in the action part of the non-terminal program, the return type is no longer “nothing”, so one needs (re-)declare the non-terminal as

non terminal Program            program;

in the cup-or yacc-file.

What one should not build gradually, I think, is to first do the lexer and then the parser. In a standard lex/yacc set-up, both scanner and parser work so close together that testing the lexer without the parser is not worth the effort (and the lexer is mostly not the source of much trouble anyway).

Further general advice

Testing

The “stay healthy and integrated while growing the project” makes sense only if one continuously tests (and builds and compiles). When new productions and clauses are added, they need to be tested, resp. the test-file(s) need to be changed, since such grammar changes change the (current state) of the language.

Versioning

Well, the handin procedure of the oblig is via git, so it’s natural to use that right from the beginning as well. It’s part of the initial early integration because it’s part of the end-product, the delivery of the compiler. Not that it’s hard, and I guess most people have worked with git or similar tools, but at any rate, you don’t want to figure out how it works at deadline time.

Versioning has other obvious advantages (I won’t preach them here). If you happen to work in a group, and a few do, it’s more or less necessary for sharing the code base anyway. The project is small enough that one can focus on simple use of git, without all bells and whistles and fancy stuff.

Whether one commits often or not is a matter of taste, but in the spirit of staying healthy, one should not commit broken versions (which are those that don’t build) after having reached a first buildable integration, at least not in the main branch. Especially not if one works in a team.

Remembering and learning from mistakes

The project is not very big, and the time is not too long. Still, sometimes some trouble occurs, and one solves it one way or the other. For instance when building the grammar: perhaps one runs into some shift-reduce or reduce-reduce conflict. Sometimes one fiddles around and it goes away, without exactly knowing why, but at least the bug is gone. Better, of course, one realizes what made the conflict go away.

And it might be good to note the problem and the solution down in a way that allows to find it again. It might not be the last time one has to deal with such a conflict. And even if the time-span of the project is not very long, one forgets faster than one would wish…

Bug tracking

This remarks is similar to the previous one: keep an overview over things. I recommended “staying healthy”, that in a way means, don’t let bugs linger around, but deal with them. Especially those that are show-stoppers, breaking the build process. But still, there may be rough edges, or things that are not solved 100%, without stopping the show. They should be dealt with, but one can deal with only one thing at a time. Some some things are more important than others, so one does one thing first and postpone the other. That’s when one should write down what needs to be done later (the more concrete and where exactly, the better, especially if one knows how to deal with it roughly already. As said, one forgets faster than one thinks, and then later, unfortunately, one has to look-into it what the hell is wrong with some thing one stumbled upon earlier. And where was the troubling issue again?

I don’t say one should use bug tracker software, the project is manageable without, but if one postpones (or ignores) relevant stuff, one should take note of that.

What if working in a group?

In a group of, say, two, one may work shoulder-at-shoulder, sitting in front of the same screen. Some recommend that 4-eyes programming style of direct interaction.

But there are other ways of collaborating. As far as the parser task is concerned, splitting up in a way that one person does the parser and one does the lexer does not make sense. The only halfway meaningful split is: parser/lexer for one, and AST (and pretty printer) for another (perhaps also splitting AST and pretty printer, but the AST must be there before the pretty printer, so it not ideal).

When saying splitting, I don’t mean complete independence, one has to coordinate and exchange information. Especially at the beginning, one needs to have the early integrated version running for all members of the group. If one prefers independent development and sharing the load, the AST + pretty printer can somehow be developed independent from the parser. Earlier I have described one can gradually develop the parser without generating and AST but having some dummy actions first.

That way, the AST and the parser development can occur at the same time, and then one has afterwards to integrate them (hoping that this does not lead to pains).

Anyway, I am not sure if I wholeheartedly recommend that, it’s also that, when programming the oblig, it’s instructive to be hands-on and familiar with the whole parser, not just one half, simply for learning how the parser part works and how the AST works. But I don’t prescribe how one internally organizes group work. The only form of “collaboration” I don’t want is, one does all the work, and both put their names under the end product… Everyone has to contribute to some meaningful extent.

If the project would be larger, for instance a semester developing a more realistic compiler, and different groups working on separate phases (one does the type checker, one does the parser, one does the data-flow etc), then the approach, where everyone has one’s finger in the details of all parts is unrealistic, then it has to be divide-and-conquer (and a lot of coordination and interfacing). But a parser for a language of the given size is small enough to be conquered without being divided.

Final remarks

I have been involved (at different universities), with quite many software projects of that kind, like “lab work” or programming projects. Sometimes compiler-related projects, sometimes other things, and often collaborative projects. Collaborative in that not everyone programs the same stuff, or maximally groups with 2 people work jointly on a common piece of code, but a number of groups or individuals collaborating on a joint larger projects, tackling different parts. That requires more “management”, planning, organization, monitoring, coordination, meetings, interfaces, and a development strategy and plan.

My remarks here are mostly based on experience with earlier rounds of the compiler course, or those other programming course work, but also on other projects as well.

Not particularly related, 3 or 4 years ago, I read the book the pragmatic programmer (or listened to the audio book version), and I quite liked it. I don’t remember exactly what advice the book had to offer, but I remember that I agreed on many things, and some of the advice from that book reminded me to things like the ones written up here for the mandatory assignments in the compiler course (or other courses). Undoubtedly the book formulates such things better and deeper and more systematically (and perhaps with more experience), and of course contains much more information.

This post here is more narrow, writing up some of my own advice I sometimes communicated, illustrating a strategy in particular for tackling the parser and lexer systematically, if one is not yet sure-footed with all practicalities that come with the task. There’s nothing wrong with not programming the parser in one blow, and not “growing” it, if one is comfortable with the task an the technology; the task itself is not so big that it requires such a cautious small-step approach. But still it may be helpful, if one is new to the task.

How to write a parser for the first time, e.g. for the oblig

Martin Steffen

Bottom-up, integrate early, let it grow and stay healthy

What to leave out first and what to ``grow’’?

Parse only first, without AST

Further general advice

Testing

Versioning

Remembering and learning from mistakes

Bug tracking

What if working in a group?

Final remarks

You may also enjoy

Worklist algorithms

Regular expressions, context-free grammars, and recursion

Did Fortran kill a Mars Mission?

Full employment theorem for compiler writers