A compiler can be considered as a multi-step pipeline which is fed the source program text as input and produces the desired intermediate code as output. For this course, we’ll be producing LLVM intermediate representation as the output, more on that later.
Think of reading a story, how do you process a paragraph of text? You interpret closely placed letters as words, closely placed words as sentences and derive what the author meant to say. Compiling programs is very similar. The following diagram shows the major steps in compilation:
We start with the source program text and first break it down into meaningful words, or tokens. Similar to the English language, tokens can be separated by spaces between them.
Often also called syntactic analysis, parsing receives the tokens from lexical analysis as input and produces a meaningful structure that represents the program. This structure is called the abstract syntax tree — abstract because it represents the “meaningful structure” irrespective of the source language. Consider it similar to identifying sentences and paragraphs in a story.
Now that we have a structure representing the meaning of the program, it’s time to check if the meaning is correct and if our language allows the behaviour expressed by the program. One of the major checks is type analysis. Entities in a program are associated with a type which represents the kind of data the entity can hold, or the kind of behaviour an entity can support.
A common use case for types is verifying arithmetic plausibility. A number can be added to another number, but not to a string (in most languages). Type checking verifies such rules and throws semantic errors if the program violates them.
Once everything is checked by semantic analysis, we are good to finally translate the program. We take the AST and generate target code, in our case LLVM intermediate representation (IR). IRs can be of various forms and types, sometimes graphical, stack-based or tree-like. LLVM IR is a popular compiler IR as it allows you to target multiple architectures using existing LLVM utilities which can process your IR.