How to implement regular expression matching

In the lecture we covered the principles behind recognizing regular languages. That’s central for the the scanner or lexer, but has wider applications, as well. Indeed, many programming languages offer corresponding functionality in some library or other.

The principles as covered in the lecture are important, good to know, and, hopefully, clear: turn the regular expression to a (non-deterministic) machine, determinize that, and, maybe minimize it in a last step. That gives an automaton that can be ``directly’’ incorporated into an implementation of a regular-expression matcher, resp. a lexer.

We slightly brushed on “practical” aspects for realizing an implementation of a resulting DFA. The coverage, however, was perfunctory at best, pointing out rather obvious things like that states and transitions could be arranged in a table or a two-dimensional array. Who would have thought.

The post I link in here digs in a bit deeper into available implementations:

Regular Expression Matching Can Be Simple And Fast (but is slow in Java, Perl, PHP, Python, Ruby, …)

The post’s author is Russ Cox, maybe best known for his involvement in the Go programming language.

The title of the article sums up its core message in one sentence. I won’t repeat obviously here the whole text and argumentation.

In a way, the ``controversy part’’ of the paper is a dispute about not much (besides that, there is also technical content). In the lecture, we covered NFAs and DFAs. We remarked that NFAs are ill-suited as basis for recognizing regular expression. The inherent determinism leads to backtracking. Using recursive backtracking for regular expression matching (as is done in the support in many popular languages), that leads to implementations that can be painfully slow. At least on some pathological examples, and the paper shows some. A DFA-based implementation, which does not require backtracking, has no such efficiency problems. And the paper laments a bit about that.

Not everyone agreed with the point made in this post. If there are pathological examples that derail, say, Perl’s regex-matcher, it is of little import, some would say, as long as ordinary regular expressions, such as those from typical applications, work with good or acceptable efficiency.

Why then would many languages like the one mentioned in that post use a technique, recursive backtracking, if more efficient ones are well-known, which on top of it are not harder to implement? Well, the answer is simple, as well. What is supported on Python, Perl, Java etc. is called regular expression matching, but it’s more. What is supported is especially a useful feature called backreferences. Regular expressions with backreferences no longer describe regular languages, backreferences add expressiveness. That results in ``truly extended regular expressions’’ not like some of the handy notations we had in the lecture, that are syntactic sugar. Those add convenience without making the notation more expressive. Backtracking is a natural way to supporting backreferences, at least no one has found something really better (though one can improve performance by employing techniques like memoization).

The paper also speaks about another feature often found in regular expression implementations, namely submatch extraction. That feature is used sometimes as argument for the need for recursive backtracking; R. Cox article describes, however, that for this feature, there are known solutions and implementations (though under-appreciated) that work along the lines we discussed in the lecture, avoiding backtracking.

The article has some more historic remark about earlier implementations (with and without recursive backtracking). Finally, I should mention: in the lecture, we said, one has to determinize the non-deterministic automaton for efficiency and to avoid back-tracking. I should mention: that determinization can be done ``virtually’’ or on-the-fly. The article by R. Cox mentions some implementations which do that: the non-deterministic automaton is not determinized up-front. Instead, when presented with a non-deterministic alternative, the algorithm explores all alternatives at the same time (in the article called ``simulating’’ the DFA).

I said, not everyone seemed to have agree with the argumentative points made in the paper. Some of the ``controversy’’ if one could call it like that, seems more about words: what are regular expression? If the majority of programmers ``know’’ regular expressions in the form as offered (under the name regular expression) by the libraries from Perl, Python, Java, etc. Then, that’s what regular expressions factually are. I don’t share that view, just because many people think, regular expressions support backreferences, which is a significant extension to classical expression, one should not make a very clear terminology more confusing. Regular expressions with backreferences are not nowaday’s regular expression, they are extended regular expression or regular expression with backreferences (or Java regular expression, Perl regular expressions, etc., for clarity’s sake.

How to implement regular expression matching

Martin Steffen

You may also enjoy

How to write a parser for the first time, e.g. for the oblig

How to prepare for an oral exam?

Did Fortran kill a Mars Mission?

Full employment theorem for compiler writers