Stages of a Compiler

A compiler can be considered as a multi-step pipeline which is fed the source program text as input and produces the desired intermediate code as output. For this course, we’ll be producing LLVM intermediate representation as the output, more on that later.

Think of reading a story, how do you process a paragraph of text? You interpret closely placed letters as words, closely placed words as sentences and derive what the author meant to say. Compiling programs is very similar. The following diagram shows the major steps in compilation:

Untitled (4).png

Lexical Analysis

We start with the source program text and first break it down into meaningful words, or tokens. Similar to the English language, tokens can be separated by spaces between them.

Untitled (5).png

Parsing

Often also called syntactic analysis, parsing receives the tokens from lexical analysis as input and produces a meaningful structure that represents the program. This structure is called the abstract syntax tree — abstract because it represents the “meaningful structure” irrespective of the source language. Consider it similar to identifying sentences and paragraphs in a story.

Untitled (6).png

Semantic Analysis

Now that we have a structure representing the meaning of the program, it’s time to check if the meaning is correct and if our language allows the behaviour expressed by the program. One of the major checks is type analysis. Entities in a program are associated with a type which represents the kind of data the entity can hold, or the kind of behaviour an entity can support.

A common use case for types is verifying arithmetic plausibility. A number can be added to another number, but not to a string (in most languages). Type checking verifies such rules and throws semantic errors if the program violates them.

Untitled (7).png

Intermediate Code Generation

Once everything is checked by semantic analysis, we are good to finally translate the program. We take the AST and generate target code, in our case LLVM intermediate representation (IR). IRs can be of various forms and types, sometimes graphical, stack-based or tree-like. LLVM IR is a popular compiler IR as it allows you to target multiple architectures using existing LLVM utilities which can process your IR.