Regular expressions and context-free grammars

(Context-free) grammars specify the syntax of a programming language and also play a central role for parsing: the parser needs to implement the functionality to accept or reject a given input (a token stream coming from the lexer). When syntactically acceptable, the parser also typically builds up a abstract syntax tree to be handed over to subsequent compilation phases.

In their generality, context-free grammars are too expressive to be used for parsing. Thus one works in practice with restricted forms. For instance context-free grammars that can be processed by certain bottom-up parsers (typically with limited look-ahead, though unlimited look-ahead is a theoretical possibility) or by top-down parsers (again maybe with limited look-ahead).

We discussed lexing and parsing as two important early stages of a compiler. The work-load is clearly separated: the lexer deals with aspects covered by regular languages and the parser with context free languages (resp. restrictions thereof motivated by considerations of balancing being expressive vs. being still pragmatically useful and reasonably efficient). The two classes of languages correspond to two particular notations, namely regular expressions and context-free grammars (in (E)BNF notation maybe).

Those are declarative notations to specify the corresponding languages, i.e., they do not immediately correspond to a procedural execution mechanism that implements (the core of) a lexer resp. a parser. As well known, there are machine or automata models that correspond to regular and context free languages. Those are finite state automata and push-down automata. The latter are automata equipped with a stack, which is some unbounded extra memory that allows to capture the additional expressiveness needed for context-free languages and parsing.

Let’s ignore fine points here, for instance the fact that, as said, practical parsing never uses the full expressiveness of context-free languages, and hence has no need for the full power of push-down automata. In the lecture we don’t even formally define exactly such push-down automata in their generality. Though, specifically for bottom-up parsing we will encounter a construction resulting in a mechanism that uses a stack and which can consequently be understood as one particular restricted form of push-down automata. But we will just look at the construction itself and leave it at that.

Regular languages as restriction of context-free languages.

That’s easy to see. Regular languages are those describable by regular expressions, and context-free languages are those describable by context-free grammars. So to see that regular languages are a restriction of context-free one, one needs to argue that each regular expression can also be written as context-free grammar. To additionally see that it’s a real restriction would require to find at least one context-free language, that cannot be captured by a regular expression. We focus in this post on the first point and come back to the question

how to represent a regular expression as a context-free grammar

later. And it will be really easy.

Another argumentation that regular languages are a subset of context-free languages could go like this: finite-state automata are a mechanism to capture exactly regular languages and push-down-automata are a mechanism to capture exactly context-free languages. Push-down automata are nothing much else than a finite-state automaton with an additional stack. No one forces the automaton to make actual use of the additional memory, and ignoring the stack (never pushing anything or consulting the stack for decisions which step to take), turns it into a standard finite-state automaton. So it’s obvious that regular languages can be captured by push-down automata and thereby are a subset of context-free languages. To see that it’s a real restriction, similar like above, one would need to find one example of a language where using the stack is needed and where the language cannot be captured by a finite-state automaton (without stack).

Why do automata for context-free languages need a stack?

The lecture does not look into the the construction, how to turn an arbitrary CFG into a push-down automaton (we don’t even define exactly what a push-down automaton is and how it works). But still, why intuitively a stack?

Context-free grammars work with non-terminal symbols and terminal symbols. The non-terminal symbols can be seen as variables in that it does not matter how we call them. Except of course, there are smarter choices and not smart: a non-terminal intended to represent expressions is better called EXP or expr or similar, rather than X1, to help human readability, but otherwise it does not matter. The terminals correspond to tokens.

Languages (described by regular expressions, context-free grammars, or whatever) is interesting only when they are infinite. If one has a “language” consisting of a finite number of “words”, one can simply make a list of all the words or elements (and probably one would not bother to call that a language, though with the terminology of language theory it would be).

To describe a concept consisting of an infinite number of elements one needs some scheme or notation that represents repetition, like adding an element, and then another, and another etc. That means some form of recursion, or, iteration or loops etc.

Indeed, all interesting context-free grammars are recursive. If not, they would describe a finite language, and that would be uninteresting and it would be an extravagant waste to use a context-free language for that purpose. The following (small) grammar illustrates that, it’s one of the many variations of capturing expressions we encounter in the lecture:

expr   ::=  expr "+" term     | expr "-" term | term
term   ::=  term "*" factor   | factor
factor ::=  "(" expr ")"      | number 

The syntax of regular expressions in its basic form does not support variables, which in the context of context-free grammars are called non-terminals. Of course “variables” are a super-convenient mechanism. They allow (at least in declarative notations or mechanisms) to give a name to something. That’s useful in more than one aspect. One is that (properly) naming things enhances readability for humans. And it’s a form of abstraction: a name is a short form or abbreviation for something more complex. If one remembers and understands what a name stands for, one can use the name instead of the complex expression or “thing” it represents. Or maybe that’s not two different reasons for being useful, more like two aspects of the same thing.

Be it as it may, since it’s useful, extended regular expressions support “naming” (for instance tools like lex and friends). In the lecture, we had an example using quite a number of variables or names in connection with giving names to various formats involving “numbers”:

digit     ::= ["0"-"9"]
nat       ::= digit+
signednat ::= ("+"|"-")nat | nat
frac      ::= signednat ("." nat)?
float     ::= frac("E" signednat)?

We use notationally the same symbol ::= we used for the rules or productions of the context-free grammar from above.

Comparing the two definitions, there are similarities and a crucial difference. Following standard terminology for grammars, let’s call the names or variables non-terminals also for the extended regular expressions. In both examples, the left-hand sides of the equation system consists of one non-terminal, only. The right-hand sides contains words containing both non-terminals and terminals (and in the case of extended regular expressions, some other special-purpose meta-level notation, like “?” “+” etc., but that’s not relevant for the discussion here). So much about the similarities.

What’s different in the the context-free grammar clearly uses recursion (as is expected for context-free grammars): the recursion is direct, <expr> is defined in terms of itself, and indirect: <expr> is defined in terms of <term>, which is defined in terms of <factor> which in turn is defined in terms of <expr> again. For the numbers example, that’s not the case. Non-terminals are defined in terms of other non-terminals, but not recursively: digit is used in the definition <nat> which is used in the definition if signednat, etc., it a cascade of definitions, but there is no cycle (= recursion) in the equation system.

That is also the reason why using definitions in such a (non-recursive) way will not enhance the expressive power of the regular-expression formalism. One can simply reduce some concept, say the one called signednat by replacing the non-terminals on the right-hand side (the 2 mentionings of nat in this case), but their resp. definition, and then replace the non-terminal of that definition (digit) again until all non-terminals are gone. That replacement or substitution process will find an end, since each definition only uses non-terminals defined strictly earlier (since no recursion is allowed).

Recursion needs stacks

Functions procedures calling each other (even without recursion) builds up a stack, (resp. it un-builds it when returning from a call). That reflects the LIFO discipline of calls and returns under execution. That call-stack is typically “implicit” and manage internally when the a program in the given programming language runs. The part of the compiler responsible for (the design of) that stack memory and other memory management arrangements is called the run-time environment (there will be a chapter about that, too). It’s also known that one can turn a recursive program into one without recursion; perhaps one works with a language so ancient or restricted that it does not support recursion or not even procedure calls. In that case one has to implement a stack data structure oneself, and push and pop off arguments oneself (instead of leaving that part to the run-time environment).

That brings us to one of the original question: what’s the connection of push-down automata (automata with a stack) and parser machines for context-free grammars? Grammars are recursive definitions, respectively describe recursive data structures. One particular grammar describes a tree data structure reflecting the grammar and whose instances are aptly called syntax trees. Working with recursive data structures such as trees involve recursion. That’s most visible in a top-down parsing method called recursive descent. That will be discussed later in the parts about parsing, and it involves an realization of a parser where each non-terminal is represented by a procedure responsible for parsing that particular non-terminal. For instance, in our example, there would be a procedure f-exp to parse expression, that procedure would recursively call f-expr and f-term because expressions are defined using <expr> and <term> (and f-term would be the function to parse <term>. Since the grammar is recursive, that will lead to a parser-implementation having a number of mutually recursive parser functions, calling each other, when presented with an input to parse. Thus, invoking the parser function leads to number of recursive calls (determined by the input) and this works with a (call-)stack

No recursion, no stack

What about regular expressions? As explained, a crucial difference between context-free grammars and regular expressions, even if one allows oneself the luxury of defining variables or non-terminals, is the absence of recursion. Indeed, if one consults a book like Compiler Construction, Principles and Practice on finds it explicitly stated that recursion is absent from regular expressions.

There’s a point to it (and so far we basically seemed to elaborate on that point). However, as we likewise said, in order to capture infinite collections of things (like languages) in a finite description, one needs a way to express repetition or recursion or iteration.

The standard notation in regular expressions for that is the Kleene star, of course. That’s not part of the core notation of context-free grammars. But one can easily achieve the same by recursion. So a regular expression like r* can be of course be capture by

A  ::=  r A

where A is a non-terminal and the production is (directly) recursive in A. If we have recursion, there’s no need for the Kleene star. So we have answered also question raised at the beginning of this text, namely a demonstration that any regular expression can be equivalently represented by a context-free grammar (the other operators are no challenge at all).

Right-recursion and iteration

Looking more carefully at the representation of Kleene star, we see that the translation is of a particular form, it’s a recursion but the non-terminal which is used recursively occurs only once and, what is more, recurs only at the end of the right hand side. Not occurring at all would be fine as all, as then there’s no recursion.

Another name for that form or recursion is right recursion or actually tail recursion! The above right-recursive example is a situation of direct right-recursion, but one can generalize that also to indirect left recursion, like when A is defined by a right-hand side with B at the tail position and B is defined by a right-hand side with A at the end, etc.

The word tail-recursion is more used for programs where functions call each other recursively, but only at the very and of the function bodies. But it’s an analogous thing.

What do we know about recursion and tail-recursion? We know that recursion in a programming language involves a stack as explained, and we know that if one has a tail-recursive situation, a good compiler or interpreter can execute the program without making use of a stack! That trick, avoiding the stack, is called tail-call optimization. In the context of some programming languages, it’s also conventional to call tail-recursive situations not recursive, but iterative (Scheme for instance), even if there’s no loop or some other iterative construct involved.

All that discussion about comparing regular expressions, finite-state automata, stack-automata and context-free languages can be summarized as follows:

The Kleene star can be seen as iterative or looping construct. It can be captured by (stack-less) finite state automata. Recursive definitions, central for context-free grammars, need automata with a stack. Right-recursive grammars are analogous to tail-recursion, and are equivalent to regular expressions (and consequently don’t need a stack).

A small final remark: What about left-recursion? Obviously the Kleene-star of regular expressions could also be captured by left- instead of right-recursion. However, there’s no corresponding recursion scheme (“head-recursion” instead of tail-recursion) as it makes no sense in general for a function to call itself recursively immediate at the beginning of its body.