Parsing & Abstract Syntax Trees

The first phase of the compiler gave us a list of tokens. Lexical analysis read the input text and broke it down into Gom-aware tokens. In this chapter, we’ll understand parsing - the process of understanding structure in the input program. Structure helps us, the language implementors, know what the programmer desires to do in the program, and then act accordingly.

A language’s structure, also called its syntax, is defined by its grammar. In this course, we are implementing the lexical analyser and the parser by hand, which often is done using tools called parser generators. They take the language grammar as input and produce code for lexer and parser as output. Here, we’ll take the more manual route to get the feel of writing a parser from scratch.

Representing Syntax

A parser’s job is to read the tokens from the lexer and form a representation of the source program that makes it possible to programmatically interprete or operate on the source. It is yet another step in the compiler pipeline and hence the output from a parser will be used by the following steps, e.g. semantic analysis and code generation.

Let’s take a simple example before we come to parsing complete Gom programs. In fact, the following example is exactly the same if you write it in TypeScript.

let a = 1 + 2;

The above code, when passed through a lexical analyser, would produce these tokens:

let, a, =, 1, +, 2, ;

When you write this piece of code, you have a mental model that describes what’s happening here: a variable named a is being initialised with the value resulting from the addition expression 1 + 2. And this is similar to what the language would interprete adhering to its specification (grammar).

In the chapter Introduction to Gom we discussed examples from the Gom grammar, and here is how it represents a “let statement” and subsequent rules that it expands into (I’ve simplified the grammar slightly e.g. the expr rule expands only into addition):

letStatement = "let" , assignment+ , ";";
assignment   = identifier , "=" , expr;
expr         = addition;
addition     = term "+" term;
term         = identifier | numLiteral
identifier   = letter , (letter | digit)*;
numLiteral   = digit+;

Visually, the expansion of rules looks like follows:

Rules and Hierarchy - Introducing Syntax Trees

If you look closely at the figure above, the let statement can be represented as a hierarchical collection of rules and terminals.

This hierarchical representation is called a syntax tree. It is a tree data structure that represents the structure of the source code where each node represents a rule in the grammar, the children of the node represent the expansion of that rule and the leaves represent the terminal symbols. Syntax trees are crucial in the analysis and code generation phases of the compilers.

Parsing & Abstract Syntax Trees

Representing Syntax

Rules and Hierarchy - Introducing Syntax Trees

Code