Build Your Own Programming Language: Part 6 Intermediate Code Generation

Matthew MacFarquhar
8 min readJun 20, 2024

--

Introduction

In this series, I will be working through this book on building your own programming language. The book is quite large and in depth, but in these articles I will be distilling the big picture pieces of the book into bite size tutorials and walk throughs. The book offers tracks for building our new language using icon — an esoteric language that the author made contributions to — and java. I will be using Java throughout this tutorial.

In this section, we will learn about intermediate code generation. Intermediate code is not specific to any particular programming language or machine architecture, making it a universal and more abstract representation of the program’s logic. The intermediate code can then be further optimized and eventually translated into the machine code that a computer’s processor can execute. The intermediate code will give us a good view into how our variables will be laid out in memory, it will also allow us to see our code’s control flow to ensure everything looks good before we turn this human readable structure into machine code.

The code for this demo is found in this repo at the commit here

Introduction to Intermediate Code

In order to construct intermediate code within our syntax tree, we have to go through a few basics first.

Memory Regions

One of the major things that generating the intermediate code will show us is where our variables live in memory. In the book, we talk about four memory regions.

  • code: read-only and holds just the code
  • static: contains things like global variables and constants
  • stack: contains parameters, saved registers and local variables (they are things whose sizes are known at compile time)
  • heap: contains data objects that are allocated on demand at runtime (size may not be known at compile time)

We further segment these regions into:

“loc” for local variables (within the stack memory)

“global” for global variables (within the static memory)

“strings” for static read-only values (within static memory)

“lab” a unique label pointing to a point in code (within code memory)

“class” a class region (within the heap)

“imm” a pseudo region usually used for temporary variables

Addresses

Now that we understand a little about regions of memory, lets represent an address using a new class.

public class Address {
public String region;
public int offset;

public Address(String region, int offset) {
this.region = region;
this.offset = offset;
}

public String regaddr() {
return region.equals("method") ? "loc" : region;
}

public String str() {
switch (region) {
case "lab": return "L" + offset;
case "loc": case "imm": case "method":
case "global": case "class": case "strings":
return regaddr() + ":" + offset;
}
return region;
}
public void print() { System.out.print(str()); }
}

This class simply takes in a region where the address lives and an offset from the start of that region. We have a nice str() function which will print an informative line about this address when asked.

Now that we have addresses, we should do something with them right? A common set of instructions for intermediate code usually only have three addresses or less. For example, a = b + c relies on addresses a, b and c. We will encapsulate an instruction like this in its own class.

public class ThreeAddressCode {
String opcode;
Address operandOne;
Address operandTwo;
Address operandThree;

public ThreeAddressCode(String opcode, Address operandOne, Address operandTwo, Address operandThree) {
this.opcode = opcode;
this.operandOne = operandOne;
this.operandTwo = operandTwo;
this.operandThree = operandThree;
}

public void print() {
switch (opcode) {
case "proc":
System.out.println(opcode + "\t" + operandOne.region + ",0,0");
break;
case "end":
case ".code":
case ".global":
case ".string":
System.out.println(opcode);
break;
case "LAB": System.out.println("L" + operandOne.offset + ":");
break;
default:
System.out.print("\t" + opcode + "\t");
if (operandOne != null)
System.out.print(operandOne.str());
if (operandTwo != null)
System.out.print("," + operandTwo.str());
if (operandThree != null)
System.out.print("," + operandThree.str());
System.out.println("");
}
}

Our ThreeAddressCode is made up of up to three addresses and an operation being preformed on them. Some operations — assignment being one — can have less than three operands, so this class takes anywhere from 0 to 3 operands.

Adding Addresses

We give the tree nodes the power to create a new ThreeAddressCode during their intermediate code generation.

public ArrayList<ThreeAddressCode> generate(String opcode, Address... addresses) {
switch (addresses.length) {
case 3:
return new ArrayList<>(List.of(new ThreeAddressCode(opcode, addresses[0], addresses[1], addresses[2])));
case 2:
return new ArrayList<>(List.of(new ThreeAddressCode(opcode, addresses[0], addresses[1], null)));
case 1:
return new ArrayList<>(List.of(new ThreeAddressCode(opcode, addresses[0], null, null)));
case 0:
return new ArrayList<>(List.of(new ThreeAddressCode(opcode, null, null, null)));
default:
j0.semerror("generate(): wrong number of arguments");
return new ArrayList<>();
}
}

and we add the following attributes to the tree node (which we will populate later).

ArrayList<ThreeAddressCode> intermediateCode;
Address address;
Address first;
Address follow;
Address onTrue;
Address onFalse;

We will also add an address field to the symtab entry class

Address address;

In the symbol table class, we add a count field to track the offsets we have allocated in memory already, and create a new function to allocate local variables.

public Address generateLocal() {
String tempVariable = "__local$" + count;
insert(tempVariable, false, null, new TypeInfo("int"));
return table.get(tempVariable).address;
}

This function will call our symbol table’s insert function which now looks like this.

public void insert(String s, boolean isConst, symtab sub, TypeInfo typeInfo) {
if (table.containsKey(s)) {
j0.semerror("redeclaration of " + s);
} else {
if (sub != null) {
sub.parent = this;
}
table.put(s, new symtabEntry(s, this, sub, isConst, typeInfo, new Address(scope, count)));
count += 8;
}
}

Summary

  • Intermediate code allows us to visualize control flow and memory allocations in a machine architecture agnostic fashion.
  • Instructions in code can be expressed with an opcode and up to three memory addresses the opcode acts upon.

Generating Intermediate Code

We will now generate our intermediate code using a series of… syntax tree traversals!

Generating Labels for Control Flow

We need an entry point into different code segments, this pointer will be called first and we will generate the first pointer for the nodes which require it using generateFirst.

public void generateFirst() {
if (kids != null) {
for (tree kid : kids) {
kid.generateFirst();
}
}

switch (sym) {
case "UnaryExpr":
if (kids[1].first != null) {
first = kids[1].first;
} else {
first = generateLabel();
}
break;
case "AddExpr":
case "MulExpr":
case "RelExpr":
if (kids[0].first != null) {
first = kids[0].first;
} else if (kids[1].first != null) {
first = kids[1].first;
}else {
first = generateLabel();
}
break;
case "Block": case "WhileStmt":
if (kids[0].first != null) {
first = kids[0].first;
} else {
first = generateLabel();
}
break;
case "BlockStmts":
if (kids[1].first == null) {
kids[1].first = generateLabel();
}
if (kids[0].first != null) {
first = kids[0].first;
} else {
first = kids[1].first;
}
break;
default:
if (kids != null) {
for(tree k : kids) {
if (k.first != null) {
first = k.first;
break;
}
}
}
}
}

We swim down in a post-order traversal to generate the children’s first attribute. Depending on the symbol, we will either synthesize the first attribute from our children, call generateLabel to build our own starting point instruction, or have no first attribute populated.

public Address generateLabel() {
return new Address("lab", serial.getid());
}

We will also need to populate pointers for where to go after a code section is complete — follow.

public void generateFollow() {
switch (sym) {
case "MethodDecl":
kids[1].follow = follow = generateLabel();
break;
case "BlockStmts":
kids[0].follow = kids[1].first;
kids[1].follow = follow;
break;
case "Block":
kids[0].follow = follow;
}

if (kids != null) {
for (tree kid : kids) {
kid.generateFollow();
}
}
}

In all cases we swim down the tree in a pre-order traversal and set the follow of our children using some address that we have in the parent node — be it our follow, first or a new label.

Generating Intermediate Code for Expressions

We will have one method which will delegate individual code generation to helper functions.

public void generateCode() {
if (kids != null) {
for (tree kid : kids) {
kid.generateCode();
}
}

switch (sym) {
case "ClassDecl": { generateClassDecl(); break; }
case "AddExpr": { generateAddExpr(); break; }
case "RelExpr": { generateRelExpr(); break; }
case "WhileStmt": { generateWhileStmt(); break; }
case "MethodCall": { generateMethodCall(); break; }
case "Assignment": { generateAssignment(); break; }
case "MethodDecl": { generateMethodDecl(); break; }
case "QualifiedName": { generateQualifiedName(); break; }
case "token": { generateToken(); break; }
default:
intermediateCode = new ArrayList<>();
if (kids != null) {
for (tree kid : kids) {
intermediateCode.addAll(kid.intermediateCode);
}
}
}
}

We always generate the children’s intermediate code first. In the default case, our intermediate code is a simple concatenation of all our children’s intermediate code.

All the helper functions are found on github, but for an example we will look at generateAddExpr

private void generateAddExpr() {
address = generateLocal();
intermediateCode = new ArrayList<>();
intermediateCode.addAll(kids[0].intermediateCode);
intermediateCode.addAll(kids[1].intermediateCode);
String operation = rule == 1320 ? "ADD" : "SUB";
intermediateCode.addAll(generate(operation, address, kids[0].address, kids[1].address));
}

we add all the children intermediate code for the add expression and then synthesize a new three address code for our AddExpr based on the rule we followed (ADD or SUB).

Essentially we synthesize the intermediate code at the leaves, and then combine the code as we swim up the tree, adding code for the internal nodes as we swim up.

Generating Label Targets For Conditional Expressions

Now that we have created some labels, we can use them for our GOTO instructions when we have branches by populating our onTrue and onFalse targets.

public void generateTargets() {
switch (sym) {
case "IfThenStmt": case "WhileStmt":
kids[0].onTrue = kids[1].first;
kids[0].onFalse = follow;
break;
case "CondAndExpr":
kids[0].onTrue = kids[1].first;
kids[0].onFalse = onFalse;
kids[1].onTrue = onTrue;
kids[1].onFalse = onFalse;
break;
}

if (kids != null) {
for (tree kid : kids) {
kid.generateTargets();
}
}
}

When we see the conditional statements, we populate our onTrue and onFalse targets to point to appropriate code addresses for the logic of the conditional.

For example, this is how the relExpr uses the targets

private void generateRelExpr() {
String operation = "";
switch (kids[1].tok.cat) {
case '<':
operation = "BLT";
break;
case '>':
operation = "BGT";
break;
case parser.LESSTHANOREQUAL:
operation = "BLE";
break;
case parser.GREATERTHANOREQUAL:
operation = "BGE";
break;
}

intermediateCode = new ArrayList<>();
intermediateCode.addAll(kids[0].intermediateCode);
intermediateCode.addAll(kids[2].intermediateCode);
intermediateCode.addAll(generate(operation, onTrue, kids[0].address, kids[2].address));
intermediateCode.addAll(generate("GOTO", onFalse));
}

It adds all the children code, and then creates its own address for checking a condition and going to the onTrue address if true. Then, it creates a GOTO for the false case address.

Testing

We then can call all of this from j0.genCode (which we have also added a call to from the j0gram.y semantic action for the root).

public static void gencode(parserVal r) {
tree root = (tree)(r.obj);
root.generateFirst();
root.generateFollow();
root.generateTargets();
root.generateCode();
if (root.intermediateCode != null) {
for(int i = 0; i < root.intermediateCode.size(); i++) {
root.intermediateCode.get(i).print();
}
}
}

There are quite a few helper generate methods to fill out, but once they are all done, we can run our code against a file like.

public class hello {
public static void main(String argv[]) {
int x;
x = argv.length;
x = x + 2;
while (x > 3) {
System.out.println("hello, jzero!");
x = x - 1;
}
}
}

To get back

.string
L0:
string "hello, jzero!"
.global
global global:8,hello
global global:0,System
.code
proc main,0,0
ASIZE loc:24,loc:8
ASN loc:16,loc:24
ADD loc:32,loc:16,imm:2
ASN loc:16,loc:32
L138:
BGT L139,loc:16,imm:3
GOTO L140
L139:
PARM strings:0
PARM loc:40
CALL class__println,imm:1
SUB loc:48,loc:16,imm:1
ASN loc:16,loc:48
GOTO L138
L140:
RET

Summary

  • We can create labels and targets in our syntax tree to encode and construct our program’s control flow
  • We can preform tree traversals to synthesize the intermediate code of the program from the bottom up or from the top down.

Conclusion

In this section, we preformed some intermediate code generation to create language agnostic program instructions. We talked about how intermediate code is helpful for debugging our compiler and generating a human readable outline of our program before it turns into machine code. We once again used our trusty friend tree traversals to populate control labels and the intermediate code throughout our syntax tree — which then allowed us to print all our intermediate code out from the root.

--

--

Matthew MacFarquhar
Matthew MacFarquhar

Written by Matthew MacFarquhar

I am a software engineer working for Amazon living in SF/NYC.

No responses yet