Martin Steffen

In weeks 11 and 12, we take a look at the meta-circular interpreter from Chapter 4 of SICP. There are different parts we discusses, the central one the mutually recursive procedure eval and apply (resp. mc-eval and mc-apply when we want to use the names as used in the slides).

But another part was dealing how to represent expressions, the syntax of “our” Scheme. In the lecture, I also called it abstract syntax, a word that also SICP briefly mentions in that context (on page 364). When browsing through the code of evaluator.scm, it all looks rather detailed and concrete and not really very abstract. Calling it “abstract” may sound puzzling:

But abstract syntax is a known concept and terminology for basically every implementation of every programming language, where by implementation we mean an interpreter (meta-circular or just a ordinary one) or a compiler for a given language. The user of the programming language normally does not to bother about what’s known as abstract syntax, only those who implement a compiler or an interpreter need to define and work with that. The standard programmer of course need to know how to write syntactically correct programs in the chosen language. The syntax the programmer experiences is called surface syntax or user syntax or concrete syntax. Actually, it’s mostly just called syntax, as users might not be aware that there is a second, internal level of syntax called abstract syntax, thus no need to call the user-level syntax “concrete”. But a interpreter- or compiler-writer distinguishes both.

Lisp or Scheme syntax is simple. Scheme is also simple in that it supports only a limited selection of constructs to the user. Of course Scheme or Lisp distributions may ship with large and complex libraries, but the core is pretty slender (a little bit of syntactic sugar, like let, notwithstanding). But that’s not the main reason why, as we said, Scheme’s concrete syntax is simple. Concrete syntax is what the user sees, abstract syntax is an internal data structure or representation of that syntax. What the user sees is the code in form of a string or a stream of characters (in one or more files). A string or similar is likewise a data structure, but it’s a very basic one, and a string is actually very unstructured in the sense that its structure (being an array of characters maybe) has nothing to do with the syntactic structure of programs in a programming language.

Also a user reading the code (represented by a string) does typically not mentally perceive the code as a large array of characters. The trained eye perceives in a piece of code a procedure, a loop, and a further loop nested in the other one that contains a case distinction followed by another loop, etc. Of course, not all strings correspond to proper programs in a given language. For instance, the following code snippet would be a string that represents a syntactically correct (part of a) JavaScript program

for (let i = 0; i < 5; i++) {
  text += "The number is " + i + "<br>";
}

but violates the rules for user syntax for many other languages. User-syntax is often designed to make the code easy to read for the programmer (though what exactly that entails depends on the taste, preferences and exprerience of the programmer).

So, concrete syntax is concerned with which strings are syntactically allowed representations of programs and which not. While having a program represented as a string may be useful for the programmer (having file editors and and browers at hand, and being used to read texts, even code in text form), strings are a bad idea when running or interpreting a program.

At the beginning of the lecture, we explained what happens when running a Scheme program by the so-called substitution model (that was when we were still in a purely functional setting). That’s based on substitution or replacement, and illustrated in the book or the slides by replacing step after step a formal parameter by the corresponding argument in the body of a procedure. That’s a purely textual manipulation of symbols written on the slide or the book, and thus it can be seen illustrating string manipulation (string replacement and string matching). Indeed, one could in principle come up with an interpreter that works like that, massaging the string that is writting in the langauge’s concrete syntax. That would work for a purely functional setting; for languages supporting side-effects (almost every language, that is) interpretation purely based on substitution would break down, one would need to implement structures like environments and frame. Still, one could use substitution for stepping through the code while maintaining and updating the enviroments for book-keeping of the (mutable) content of variables.

At any rate, basing interpretation on string substitution is a terrible idea, thus it’s not done. There are at least 2 reasons for that. One we have mentioned: strings as data structure are too unstructured for the purpose. Substitution is about replacing one piece of syntax by another piece of syntax. The piece of syntax being replaced in our contex is replacing a variable by an expression, for instance by another procedure in the following situation

  ((lambda (f) (lambda (x)  (/ (f x) (f (+ x 1)))))
   (lambda (y) (* y y)))

The expressions ultimately stands for the function $\frac{x^2}{(x+1)^2}$. To think of the step from the above lambda expression to the subsequent one, replacing f:

  (lambda (x)  (/ ((lambda (y) (* y y)) x) ((lambda (y) (* y y)) (+ x 1)))))

as a string-manipulation is not helpful, and it’s not what a human reader normally does. The brain, with a little training and experience, is able to perceive the parts of string as anymous procedures, and mentally operate by replacing the formal parameter by the argument procedure to understand what’s going on. Thinking of it in terms of an array of characters is just not a useful level of abstraction. Even worse would be to think of it as manipulation of a bit-array (thinking of the characters as sequences of bits). On that even lower level, a computation step could be understood as a substitution of bit-sequence inside another, though maybe one should not use the word “understand” in that contect…. We said that when reading a piece of code like the above string, one perceives them (with some training) as procedures, applications, a multiplication expression etc, one can also say, one parses them into procedures etc. That use of the word “parse” fits to the definition the word (among slightly alternative readings of the word) one finds at Merriam-webster:

to divide (a sentence) into grammatical parts and identify the parts and their relations to each other.

The grammar of a language is concerned with syntactical aspects of a language. So instead of grammatical parts, one could also say syntactic parts. So parsing a sentence is concerned with its syntax (identifying its syntactic part resp. rejecting the sentence as not following the grammatical rules governing the language). It’s not concerned with the meaning of the sentencr (or rejecting it as meaningless, but otherwise syntactically correct).

Parsing, as defined by Merriam-Webster, is a word used by linguists an describes what linguists believe what happens when someone is reading a text or a sentence in a natural language. In order to ultimately understand a sentence, one needs to identify its separate parts, subsentences, nouns, verbs, particular words and how they arrange to a full sentence. That involves figuring out where individual words start and end and determining whether a word is an actual word (maybe as listed in a dictionary). For written text, determining where a word starts and ends is relatively straightforward, as they are separated typically by some space (or maybe by a full stop or comma, which structures also sentences). For spoken languages its more complex, but separating and identifying individual words and then identifying the syntactical or grammatical structure over the words needs still to be done. As said, linguists call that parsing.

And of course, the problem exists also for programming languages and the task analysing and breaking up a text, i.e., the source code of program into, its syntactic constitutents, that task is called, as discussed parsing, and it’s done by the parser.

What the parser thereby does is taking the unstructured string which is written in the concrete syntax or user syntax, and turns it as a result of the parsing process in a data structure respresented as abstract syntax. Those data structures are trees, and they are called, unsurprisingly abstract syntax trees. For a concrete program, the root node of the abstract syntax tree represents the whole program, and each child node in that tree represents the immediate syntactic substructures. For instance, a possible tree for the JavaScript for node may have two children, one representing the loop-condition and the other one the loop-body. If JavaScript had also other kinds of loops, maybe a repeat-until-loop, the node must also contain the information, that it’s indeed a while loop and not a repeat-until loop. Structurally both syntax constructs may be analogous (two children: one a predicate, another one for the body), but for executing the tree representing the loops, one needs to distinguish them, as both kinds of loops behave differently.

So, abstract syntax refers to trees, and it’s called “abstract” as in the step from concrete syntax to abstract syntax trees, some details are typically omitted. For instance the whole “placement” of the individual characters (line numbers) is irrelevant, newlines and comments are omitted. To stick with the JavaScript example: the fact that the body of the loop is written in concrete syntax with { and } as marking the beginning and the end of the body is irrelevant (as is the fact that one has to write a ; at the end of each statement) and will typically not be represented in the AST. Those details of the concrete syntax are left out, and the abstract syntax tree only contains the syntactic essence of the program, not its concrete textual string.

Now, how is it in Scheme? Scheme famously uses parenthetic prefix notation, something rarely used in other languages. As mentioned earlier, concrete syntax is meant to aid the coder and reader to read of a piece of code, so to say to make human parsing as easy as possible. Though, as also said, what’s easy and agreeable depends on personal experience and taste and for Lisp veterans, perhaps writing and reading piles of parentheses is easy and the most agreeable way of writing programs. For parsers, on the other hand, a parenthetic structure and the Lisp syntax is most agreeable indeed. For once it’s unambiguous. If your language allows to write a numerical expression like 2 + 3 * 4, most people would understand after some calculation that this represents or results in the number 14, since one has to perform the multiplication before the addition. That’s because most users are trained to read or parse such expressions in that particular way. Such syntactic ambiguities not only exist for mathematical infix symbols, but also other programming constructs may “suffer” from that. For the trained user, it’s hopefully not a problem, but for the parser it is, it need to be programmed to parse the concrete syntax “the right way”, that reflects the intentions of the designer of the language.

For Lisp or Scheme, there’s no such problem. The parentheses make very clear which things belong together and where an expression starts (at a left ( parenthesis) and where it ends (at the corresponding right ) parenthesis). In combination with the fact that this syntax is uniform makes parsing almost trivial. Uniform means there’s no special cases like other kinds of parentheses (like (..} for this purpose and {..}, <..>, and [ ..] for others, there is no sometimes infix notation, sometimes postfix. Nore are there hardly any restrictions on what kind of constructs can be used in combination with others. For instance in Java, one cannot define methods inside methods: one can (since quite some time) nest classes, but not methods. You can have a tuple as argument to a method, but not as return result. etc. All those are syntactic restrictions, that the parser has to implement, but for Scheme or Lisp, the problem is almost trivial. Amost as long as every opening parenthesis is matched by a closing one, the program is syntactically correct.

Even more: the parenthetic notation is not only a syntactic grouping mechanism, it’s also the notation for lists. And lists can be nested, in which case SICP calls it list structures. But as we discussed in the lecture, nested lists can be seen as trees. What it basically means is that, by coding in parenthetic prefix notation, the programmer directly writes abstract syntax trees. To say it differently, there’s no real distinction between abstract and concrete syntax, we also don’t have to puzzle how to best design abstract syntax trees, they are given anyway. All that benefits the parser and parser writer, whether it benefits the user may be up to debate, some people like the unabiguity and simplicity of Scheme syntax, some may prefer other notations.

Historically, this peculiar design of the (concrete) syntax is factually sort of an accident. Lisp is quite an old language, after Fortran the second oldest programming language still in use (Cobol is the third surviving veteran from approximately the same time, but came a bit later). Scheme was a pioneering design and was in many ways ahead of its time. For instance, supporting higher-order functions and garbage collection, when Fortran, for instance, not even supported recursion. It was also developed at a time, where hardware was seriously limited and where programming was not interactive. A Wikipedia article about the IBM 704, the machine used for the first Lisp (and Fortran) implementation, gives an impression. While at it: one type of instruction format from that particular machine also gave name to the functions car and cdr, as cons-cells were implemented making use of the corresponding parts in registers (the address format had an “address” and a “decrement” part).

Not only was hardware limited, concepts and experience how to develop a programming language were missing, no prior languages existed to improve on (except machine-specific assembler) as those were the very first higher-level programming languages ever, no conceptual framework, no text books, no computer science curriculum, no nothing, just bare metal… One conceptual thing that was not yet well developed was a clear notion of how to specify and discuss the syntax of a language. That came only a bit later, in the context of Algol 60, where for the first time, the syntax of a programming language was clearly written down in a document (using what is known as a (context-free) grammar, written in so-called BNF format; such notation of grammars is a domain-specific language to describe syntax). Before that the “syntax” of a language was just what happened to be accepted by the compiler or interpreter. Of course, the developers reflected on it, and try to make good decisions. But there was also not yet a coherent body of parser technology and theory, so one had to program some program that somehow allowed to input a program (maybe from a bundle of punch cards), and then pick it up from there. The developers of early Scheme (and Fortran and other languages prior to Algol 60) would no even think explicitly about abstract syntax trees and concrete syntax trees (at least not use those words).

read

S-expressions and M-expressions

Coming back to Lisp or Scheme-syntax: the parenthetic expressions that represent the syntax of programs as well as lists are called S-expressions, and they are the concrete as well as abstract syntax for Lisp. One way of seeing it was that Lisp was and still is actually lacking user-syntax and instead let the user directly code in abstract syntax. Other languages at that time did not do that, and with time and after Algol 60, people were starting to think more systematically about how to carefully craft syntax, how to systematically parse it, and understanding what can be done by a parser and what not. There was attempts or initiatives to equip Scheme with a user-level syntax, on top of the S-expressions as notation for abstract syntax trees. This attempt is known as M-expressions, but it actually fizzled out. As McCarthy seem to indicate in History of Lisp, users of Lisp had already gotten used to programming in S-expressions, and saw no benefit in adopting a different syntax and perhaps porting the growing code base to that newer format. In that sense the language design is a historic “accidence”: the abstract syntax came first, the initial and landmark design focused on the hard problems (higher-order function etc), not on notational matters, keeping parsing trivial, and despite some (feeble?) attempts to afterwards come up with a more conventional concrete syntax, Lisp had already taken off, and it was too late.

Abstract syntax as abstract data type

Of course

 (+ 2 ()))
. #%app: missing procedure expression;
 probably originally (), which is an illegal empty application in: (#%app)
> (+ 2 ))))
2
. read-syntax: unexpected `)`
> 
read-syntax: unexpected `)`

(cons 4)
. . mcons: arity mismatch;
 the expected number of arguments does not match the given number
  expected: 2
  given: 1
  arguments...:

2025 4
2024 9
2023 10
2022 1
2021 4

2022

Welcome to Compiler Construction, spring 2022

9 minute read

(Not so) abstract syntax trees

Martin Steffen

S-expressions and M-expressions

Abstract syntax as abstract data type

2025

How to write a parser for the first time, e.g. for the oblig

How to prepare for an oral exam?

Did Fortran kill a Mars Mission?

Full employment theorem for compiler writers

2024

Y Y?

Recursion, primitive or otherwise

Processes and procedures

What’s in a name?

Evaluation strategies

Welcome to Functional programming (IN2040), autumn 2024

Worklist algorithms

Regular expressions, context-free grammars, and recursion

How to prepare for an oral exam?

2023

(Not so) abstract syntax trees

Y not code up $n!$ with no recursion and no Y tricks either?

Y Y?

Processes and procedures

Recursion, primitive or otherwise

Evaluation strategies

Welcome to Functional programming (IN2040), autumn 2023

How to prepare for an oral exam

Type systems: the strong, the strict and the safe.

How to implement regular expression matching

2022

Welcome to Compiler Construction, spring 2022

2021

Fol Soundandcomplete

Bitcoin, blockchain & formal methods, seriously now?

Welcome to program verification, autumn 2021

Welcome to Compiler Construction, spring 2021