Build Your Own Programming Language: Part 5 Type Checking
Introduction
In this series, I will be working through this book on building your own programming language. The book is quite large and in depth, but in these articles I will be distilling the big picture pieces of the book into bite size tutorials and walk throughs. The book offers tracks for building our new language using icon — an esoteric language that the author made contributions to — and java. I will be using Java throughout this tutorial.
In this section, we will learn about types and type checking, this is one of the crucial parts of compilation in a statically typed language such as J0 and java. We will discuss how to represent and populate types in our symbol table and syntax tree. We will also see how to check types for operations like assignments, additions and relational checks. Once we have a grasp of basic type checking, we will move on to more complex type checks for arrays, methods and classes.
The code for this demo is found in this repo at the commit here
Type Representation
Our type info for a basic set of atomic types (e.x. int, char, double float) could simply be numerical codes — and that would be nice and easy and this article would be a fraction of the size. However, we also have a requirement to represent complex types (e.g. an array of ints or a class that has a float field and a string field). For that reason, we will encapsulate our types with a class called TypeInfo.
Basic TypeInfo
public class TypeInfo {
String baseType;
public TypeInfo() {
this.baseType = "unknown";
}
public TypeInfo(String baseType) {
this.baseType = baseType;
}
public String getBaseType() {
return baseType;
}
}
Our basic Type Info is extremely simple and has a single field for the name of the type, this can be atomic types like “int” or “boolean”.
Complex Types
Our complex types will extend the base TypeInfo to add extra functionality
public class ArrayType extends TypeInfo {
TypeInfo elementType;
public ArrayType(TypeInfo typeInfo) {
super("array");
this.elementType = typeInfo;
}
}
The ArrayType takes in a type info for what it is an array of, and in our outer ArrayType — which is also a TypeInfo itself — we give it a baseType of “array”.
public class MethodType extends TypeInfo {
TypeInfo[] parameters;
TypeInfo returnType;
MethodType(TypeInfo[] parameters, TypeInfo returnType) {
super("method");
this.parameters = parameters;
this.returnType = returnType;
}
}
Our MethodType has fields for an array of the TypeInfos of its parameters and also the TypeInfo of what it returns.
public class Parameter {
String name;
TypeInfo paramType;
public Parameter(String name, TypeInfo paramType) {
this.name = name;
this.paramType = paramType;
}
}
public class ClassType extends TypeInfo {
String name;
symtab symbolTable;
Parameter[] methods;
Parameter[] fields;
TypeInfo[] constructors;
public ClassType(String name) {
super("class");
this.name = name;
}
public ClassType(String name, symtab symbolTable) {
super("class");
this.name = name;
this.symbolTable = symbolTable;
}
public ClassType(String name, symtab symbolTable, Parameter[] methods, Parameter[] fields, TypeInfo[] constructors) {
super("class");
this.name = name;
this.symbolTable = symbolTable;
this.methods = methods;
this.fields = fields;
this.constructors = constructors;
}
}
The class type is the most intricate of the three complex types, it has its own symbol table and utilizes a new class called Parameter which is a wrapper around TypeInfo to house the name of the Parameter.
Where Types Live?
So, where will these types live? Like all our previous work, type checking will be done via a tree traversal so it makes sense to store the type info inside the tree nodes with a new typeInfo field in our tree class. We also may need to look up a type when accessing a symbol in our execution flow, for that reason we also add a typeInfo field in the symtabEntry class.
The symtabEntry type info will be given at construction time — so we add it to the constructor as a parameter. The tree node’s typeInfo will need to be synthesized during a tree traversal.
Summary
- Basic types are really just the name of the type i.e. int, bool, etc…
- Complex types extend TypeInfo and add new properties to completely represent all the types their signatures are made up of
- TypeInfo lives in our syntax tree nodes and also in our symbol table entries for our declared symbols.
Generating Types
We will generate types in a post-order traversal (i.e. leaf nodes have types created first and then internal nodes use those types to synthesize their types).
Leaf Node Types
Our leaf nodes are tokens, so the type generation for tokens naturally lives in our Token class constructor.
public token(int c, String s, int l) {
cat = c; text = s; lineno = l;
id = serial.getid();
switch (cat) {
case parser.INT:
typeInfo = new TypeInfo("int");
break;
case parser.DOUBLE:
typeInfo = new TypeInfo("double");
break;
case parser.BOOL:
typeInfo = new TypeInfo("boolean");
break;
case parser.VOID:
typeInfo = new TypeInfo("void");
break;
case parser.INTLIT:
typeInfo = new TypeInfo("int");
break;
case parser.DOUBLELIT:
typeInfo = new TypeInfo("double");
break;
case parser.STRINGLIT:
typeInfo = new ClassType("String");
break;
case parser.BOOLLIT:
typeInfo = new TypeInfo("boolean");
break;
case parser.NULLVAL:
typeInfo = new TypeInfo("null");
break;
case '=':
case '+':
case '-':
typeInfo = new TypeInfo("n/a");
break;
}
}
The token class gets a new field for type info, which we will populate on construction using the above case logic.
Since our Tokens can also be Identifiers, a simple field get is not enough for retrieving the token’s type.
public TypeInfo type(symtab symbolTable) {
symtabEntry result;
if (typeInfo != null) {
return typeInfo;
} else if (cat == parser.IDENTIFIER && (result = symbolTable.lookup(text)) != null) {
typeInfo = result.typeInfo;
return typeInfo;
} else {
System.out.println(symbolTable.table.keySet());
j0.semerror("cannot check the type of " + text + " " + cat);
return null;
}
}
If the Token instance has a type already, we can just return it. If the token is an IDENTIFIER — meaning it could be a symbol in the provided symbol table — we look up our token’s text in the table to see if it has an entry from which we can grab the type (every entry in the symbol table will have a type after we update the populateSymbolTables method to do so as seen here). We also memoize the type info into our token so we don’t need to look it up again if requested and finally we return the type info.
Synthesizing Variable Types
Now, we must figure out how to add types to variable declarations, these can be for local variables, method parameters, field declarations and methods themselves.
We will first add a semantic action in the j0gram.y for the rules which would create variables by inserting a new j0.calcType method like below.
FieldDecl: Type VarDecls ';' {
$$=j0.node("FieldDecl",1030,$1,$2);
j0.calcType($$);};
The code for j0’s calcType relies on a tree node’s calculation of its type and then assigns the type to our symbol.
public static void calcType(parserVal parserVal) {
tree tree = (tree) parserVal.obj;
tree.kids[0].calcType();
tree.kids[1].assignType(tree.kids[0].typeInfo);
}
These rules synthesize down to the form Type IDENTIFIER. So we extract the Type from the first child and give it to the second child.
Let’s look at how assignment is done.
public void assignType(TypeInfo typeInfo) {
this.typeInfo = typeInfo;
switch (sym) {
case "VarDeclarator":
kids[0].assignType(new ArrayType(typeInfo));
return;
case "MethodDeclarator":
TypeInfo[] paramList;
if (kids[1] != null) {
paramList = kids[1].makeSignature();
} else {
paramList = new TypeInfo[0];
}
this.typeInfo = new MethodType(paramList, typeInfo);
kids[0].typeInfo = this.typeInfo;
return;
case "token":
if (tok.cat != parser.IDENTIFIER) {
j0.semerror("eh?" + tok.cat);
} else {
return;
}
default:
j0.semerror("don't know how to assign type " + sym);
}
for (tree kid : kids) {
kid.assignType(typeInfo);
}
}
We swim down the “to be assigned” side and assign the type info to the children stopping when we see a token VarDeclarator or MethodDeclarator. We will discuss the complex type assignment for MethodDeclarators later in this article.
Now let’s venture down the other side of the tree to calculate the type.
public void calcType() {
for (int i = 0; i < nkids; i++) {
kids[i].calcType();
}
switch (sym) {
case "FieldDecl":
typeInfo = kids[0].typeInfo;
return;
case "token":
if ((typeInfo = tok.typeInfo) != null) {
return;
}
switch (tok.cat) {
case parser.IDENTIFIER:
typeInfo = new ClassType(tok.text);
return;
case parser.INT:
typeInfo = new TypeInfo(tok.text);
return;
default:
j0.semerror("don't know the type of " + tok.text);
}
default:
j0.semerror("don't know the type of " + sym);
}
}
Calculating the type will be a post-order traversal, we will calculate the type of our leaves (hopefully they would have already been populated on token construction). If our token is an identifier or int then we construct and return the created type (we would expand this to include other atomics in a full language to include other token types).
Summary
- Leaf nodes (i.e. tokens) can have their types easily constructed based on their category
- Variable declarations require us to synthesize types from one half of the sub tree and pass it down the other half
Checking Types
Now that we have created and synthesized types for our tree nodes, it is now time to check types are compliant in cases like assignments, additions and relational operations.
Check Types Traversal
Like with so many of our compilation operations, type checking is also a tree traversal starting from j0.semantic with this checkType on the root.
root.checkType(false);
public void checkType(boolean inCodeBlock) {
if (checkKids(inCodeBlock) || !inCodeBlock) {
return;
}
switch (sym) {
case "Assignment":
typeInfo = checkTypes(kids[0].typeInfo, kids[2].typeInfo);
break;
case "AddExpr":
typeInfo = checkTypes(kids[0].typeInfo, kids[1].typeInfo);
break;
case "ArgList":
case "Block":
case "BlockStmts":
typeInfo = null;
break;
case "MethodCall":
if (rule == 1290) {
symtabEntry symtabEntry;
MethodType methodType;
if (kids[0].sym.equals("QualifiedName")) {
methodType = (MethodType) kids[0].dequalify();
checkSignature(methodType);
} else {
if (!kids[0].sym.equals("token")) {
j0.semerror("can't check type of " + kids[0].sym);
}
if (kids[0].tok.cat == parser.IDENTIFIER) {
symtabEntry = stab.lookup(kids[0].tok.text);
if (symtabEntry != null) {
if (!(symtabEntry.typeInfo instanceof MethodType)) {
j0.semerror("method expected, got " + symtabEntry.typeInfo.getBaseType());
}
methodType = (MethodType) symtabEntry.typeInfo;
checkSignature(methodType);
}
} else {
j0.semerror("can't typecheck token " + kids[0].tok.cat);
}
}
}
break;
case "QualifiedName":
if (kids[0].typeInfo instanceof ClassType) {
ClassType classType = (ClassType) (kids[0].typeInfo);
typeInfo = (classType.symbolTable.lookup(kids[1].tok.text)).typeInfo;
} else {
j0.semerror("illegal . on " + kids[0].typeInfo.getBaseType());
}
break;
case "ArrayCreation":
typeInfo = new ArrayType(kids[0].typeInfo);
break;
case "ArrayAccess":
if (kids[0].typeInfo.getBaseType().startsWith("array ")) {
if (kids[1].typeInfo.getBaseType().equals("int")) {
typeInfo = ((ArrayType)(kids[0].typeInfo)).elementType;
} else {
j0.semerror("subscripting array with " + kids[1].typeInfo.getBaseType());
}
}
break;
case "ReturnStmt":
symtabEntry symtabEntry = stab.lookup("return");
if (symtabEntry == null) {
j0.semerror("stab did not find a return type");
}
TypeInfo returnType = symtabEntry.typeInfo;
if (kids[0].typeInfo != null) {
typeInfo = checkTypes(returnType, kids[0].typeInfo);
} else if (!returnType.getBaseType().equals("void")) {
j0.semerror("void return from non-void method");
}
break;
case "InstanceCreation":
symtabEntry resultValue = stab.lookup(kids[0].tok.text);
if (resultValue == null) {
j0.semerror("unknown type " + kids[0].tok.text);
}
assert resultValue != null;
typeInfo = resultValue.typeInfo;
if (typeInfo == null) {
j0.semerror(kids[0].tok.text + " has unknown type");
}
break;
case "token":
typeInfo = tok.type(stab);
break;
default:
j0.semerror("cannot check type of " + sym);
}
}
Each of these cases synthesizes the type of the node based on what it is and its children nodes. Some of these will utilize the symbol table as well to determine their types.
We short circuit this logic if a call to checkKids returns false
private boolean checkKids(boolean inCodeBlock) {
switch (sym) {
case "MethodDecl":
kids[1].checkType(true);
return true;
case "LocalVarDecl":
kids[1].checkType(false);
return true;
case "FieldAccess":
kids[0].checkType(inCodeBlock);
return true;
case "QualifiedName":
kids[0].checkType(inCodeBlock);
return false;
default:
if (kids != null) {
for (tree kid : kids) {
kid.checkType(inCodeBlock);
}
}
return false;
}
}
In our default case we fully traverse the children checking the types along the way. For the non-default cases in our switch, we only want to traverse one side of the tree. We also return false if we should skip the further evaluations in checkType (this is the case for when we recurse our kids or when we are looking at a qualified name).
Our real type checking logic lives inside checkTypes.
private TypeInfo checkTypes(TypeInfo operand1, TypeInfo operand2) {
String operator = getOp();
if (operand1 instanceof MethodType) {
operand1 = ((MethodType) operand1).returnType;
}
if (operand2 instanceof MethodType) {
operand2 = ((MethodType) operand2).returnType;
}
switch (operator) {
case "=":
case "+":
case "-":
case "param":
case "return":
tree tokenKid;
if ((tokenKid = findToken()) != null) {
System.out.print("line " + tokenKid.tok.lineno + ": ");
}
if (operand1.getBaseType().equals(operand2.getBaseType()) && ( operand1.getBaseType().equals("int") || operand1.getBaseType().equals("String"))) {
System.out.println("typecheck " + operator + " on a " + operand2.getBaseType() + " and a " + operand1.getBaseType() + " -> OK");
return operand1;
} else if (operand1.getBaseType().equals("class") && operand2.getBaseType().equals("class") && operator.equals("param") &&
((ClassType)operand1).name.equals(((ClassType)operand2).name)) {
System.out.println("typecheck " + operator + " on a " + ((ClassType) operand2).name + " and a " + ((ClassType) operand1).name + " -> OK");
return operand1;
} else if (operand1.getBaseType().equals("array") && operand2.getBaseType().equals("array") && operator.equals("=") &&
checkTypes(((ArrayType)(operand1)).elementType, ((ArrayType)(operand2)).elementType) != null) {
return operand1;
} else {
j0.semerror("typecheck " + operator + " on a " + operand2.getBaseType() + " and a " + operand1.getBaseType() + " -> FAIL");
}
default:
j0.semerror("don't know how to check " + operator);
return null;
}
}
This method determines what operation we are preforming on our two operands (we would expand these operators for a full set if this was a full language implementation). Using the operation, we explicitly dictate what operands are valid and what the output type of that evaluation should be (e.x. int + int should be int but int < int should be bool).
Our findToken helper swims down our tree to find the first token in the subtree (this is purely for generating a nice print message pointing back to the line number).
Our get op function is a simple map (with some rule checks on certain symbols).
private String getOp() {
switch (sym) {
case "Assignment":
return "=";
case "AddExpr":
return rule == 1320 ? "+" : "-";
case "MethodCall":
return "param";
case "ReturnStmt":
return "return";
default:
return sym;
}
}
Testing
Now we can build and test our type checker. Below is an example file
public class hello {
public static void main(String argv[]) {
int x;
x = 0;
x = x + "hello";
}
}
and the resulting type checker output
line 4: typecheck = on a int and a int -> OK
line 5: semantic error: typecheck + on a class and a int -> FAIL
Summary
- We generate internal node types by running a post-order traversal on our syntax tree
- We generate and verify types for internal nodes using operations preformed on operands
Complex Types
We have three complex types that will need some extra logic in order to properly synthesize and check their types.
Array Type
For the first time in a while, we will need to add a new token to our javalex.l file for the “new” keyword.
"new" { return j0.scan(parser.NEW); }
We will also create some new rules for arrays in our j0gram.y
Primary: Literal | FieldAccess | MethodCall | ArrayAccess | '(' Expr ')' {
$$=$2;} | ArrayCreation | InstanceCreation;
ArrayCreation: NEW Type '[' Expr ']' {
$$=j0.node("ArrayCreation", 1260, $2, $4);};
LeftHandSide: Name | FieldAccess | ArrayAccess ;
ArrayAccess: Name '[' Expr ']' {
$$=j0.node("ArrayAccess", 1390, $1, $3);};
In checkType inside of tree, we must add some cases for array creation and array accesses.
case "ArrayCreation":
typeInfo = new ArrayType(kids[0].typeInfo);
break;
case "ArrayAccess":
if (kids[0].typeInfo.getBaseType().startsWith("array ")) {
if (kids[1].typeInfo.getBaseType().equals("int")) {
typeInfo = ((ArrayType)(kids[0].typeInfo)).elementType;
} else {
j0.semerror("subscripting array with " + kids[1].typeInfo.getBaseType());
}
}
break;
Array creation just creates an ArrayType based on the element type, array access makes sure we are only accessing with an int and then returns the elementType.
Method Type
We already have a MethodHeader rule, we just need to make sure it calls calcType in its semantic actions.
MethodHeader: PUBLIC STATIC MethodReturnVal MethodDeclarator {
$$=j0.node("MethodHeader",1070,$3,$4);
j0.calcType($$);};
We need to add MethodDeclarator to the cases for assignType
case "MethodDeclarator":
TypeInfo[] paramList;
if (kids[1] != null) {
paramList = kids[1].makeSignature();
} else {
paramList = new TypeInfo[0];
}
this.typeInfo = new MethodType(paramList, typeInfo);
kids[0].typeInfo = this.typeInfo;
This creates our method type using the type synthesize from our sibling branch as the return value, we also have a list of parameters which we generate a TypeInfo array for using the makeSignature helper function.
private TypeInfo[] makeSignature() {
switch (sym) {
case "FormalParm":
return new TypeInfo[]{kids[0].typeInfo};
case "FormalParmList":
TypeInfo[] typeInfo1 = kids[0].makeSignature();
TypeInfo[] typeInfo2 = kids[1].makeSignature();
TypeInfo[] typeInfos = new TypeInfo[typeInfo1.length + typeInfo2.length];
System.arraycopy(typeInfo1, 0, typeInfos, 0, typeInfo1.length);
System.arraycopy(typeInfo2, 0, typeInfos, typeInfo1.length, typeInfo2.length);
return typeInfos;
default:
return null;
}
}
This function basically compiles a list of all our method params’ type infos and returns them.
The next piece of the puzzle is to check the provided parameter types when a call is made.
case "MethodCall":
if (rule == 1290) {
symtabEntry symtabEntry;
MethodType methodType;
if (kids[0].sym.equals("QualifiedName")) {
methodType = (MethodType) kids[0].dequalify();
checkSignature(methodType);
} else {
if (!kids[0].sym.equals("token")) {
j0.semerror("can't check type of " + kids[0].sym);
}
if (kids[0].tok.cat == parser.IDENTIFIER) {
symtabEntry = stab.lookup(kids[0].tok.text);
if (symtabEntry != null) {
if (!(symtabEntry.typeInfo instanceof MethodType)) {
j0.semerror("method expected, got " + symtabEntry.typeInfo.getBaseType());
}
methodType = (MethodType) symtabEntry.typeInfo;
checkSignature(methodType);
}
} else {
j0.semerror("can't typecheck token " + kids[0].tok.cat);
}
}
}
break;
This code works to drill down to the method type for the node we are at. For QualifiedNames like System.out.println() it uses a helper function dequalify, to get to get down to the actual method.
Once we have the methodType, we call the helper function checkSignature.
private void checkSignature(MethodType signature) {
int expectedNumberOfParams = signature.parameters.length;
int actualNUmberOfParams = 1;
tree kid = kids[1];
if (kid == null && expectedNumberOfParams != 0) {
j0.semerror("0 params, expected " + expectedNumberOfParams);
} else if (kid != null) {
while (kid.sym.equals("ArgList")) {
actualNUmberOfParams++;
kid = kid.kids[0];
}
if (actualNUmberOfParams != expectedNumberOfParams) {
j0.semerror(actualNUmberOfParams + " parameters, expected " + expectedNumberOfParams);
}
kid = kids[1];
int paramCursor = expectedNumberOfParams - 1;
while (kid.sym.equals("ArgList")) {
checkTypes(kid.kids[1].typeInfo, signature.parameters[paramCursor]);
kid = kid.kids[0];
paramCursor--;
}
checkTypes(kid.typeInfo, signature.parameters[0]);
}
typeInfo = signature.returnType;
}
This method verifies that we have provided the correct number of parameters first. Then, it compares the type of the expected parameters and what we have provided to ensure the types all match up. Finally, we return the actual return type of the method.
Along with type checking the call signature for the method, we must make sure the return value is a valid type for whatever operation we want to do with it.
case "ReturnStmt":
symtabEntry symtabEntry = stab.lookup("return");
if (symtabEntry == null) {
j0.semerror("stab did not find a return type");
}
TypeInfo returnType = symtabEntry.typeInfo;
if (kids[0].typeInfo != null) {
typeInfo = checkTypes(returnType, kids[0].typeInfo);
} else if (!returnType.getBaseType().equals("void")) {
j0.semerror("void return from non-void method");
}
break;
This ReturnStmt case lives inside our checkType function, as part of our populateSymbolTables, we insert a fake “return” symbol for our methods. This allows us to grab the method return type to see if it matches what we expect it to be for our operation with ease.
Class Type
For our class type, we will need a new symbol called Instance Creation
Primary: Literal | FieldAccess | MethodCall | ArrayAccess | '(' Expr ')' {
$$=$2;} | ArrayCreation | InstanceCreation;
InstanceCreation: NEW Name '(' ArgListOpt ')' {
$$=j0.node("InstanceCreation", 1261, $2, $4);};
We will also create a makeClass method on the tree class, we will call this on the root in j0.semantic between the checkSymbolTables call and the checkType call.
public static void semantic(parserVal rootParserVal) {
//...
root.checkSymbolTables();
root.makeClass();
root.checkType(false);
}
public void makeClass() {
if (sym.equals("ClassDecl")) {
// counts number of fields and methods this class has
int numMethods = 0;
int numFields = 0;
symtabEntry resultEntry = stab.lookup(kids[0].tok.text);
for (String key : resultEntry.subscopeSymbolTable.table.keySet()) {
symtabEntry symtabEntry = resultEntry.subscopeSymbolTable.table.get(key);
if (symtabEntry.typeInfo.getBaseType().startsWith("method ")) {
numMethods++;
} else {
numFields++;
}
}
//populates our fields and methods
Parameter fields[] = new Parameter[numFields];
Parameter methods[] = new Parameter[numMethods];
int methodCur = 0;
int fieldCur = 0;
for (String key : resultEntry.subscopeSymbolTable.table.keySet()) {
symtabEntry symtabEntry = resultEntry.subscopeSymbolTable.table.get(key);
if (symtabEntry.typeInfo.getBaseType().startsWith("method ")) {
methods[methodCur++] = new Parameter(key, symtabEntry.typeInfo);
} else {
fields[fieldCur++] = new Parameter(key, symtabEntry.typeInfo);
}
}
resultEntry.typeInfo = new ClassType(kids[0].tok.text, resultEntry.subscopeSymbolTable, fields, methods, new TypeInfo[0]);
} else {
for (int i = 0; i < nkids; i++) {
if (kids[i].nkids > 0) {
kids[i].makeClass();
}
}
}
}
In this tree traversal, we swim down the tree finding all the ClassDecl nodes. For each of these nodes, we sum up the methods and fields the class has and then populate them using the symbol table entries’ type infos. We then create our ClassType for these nodes using that information.
Our class method calls are already type checked thanks to our MethodCall logic and our dequalify function. All that we need to type check specifically for classes is the creation of them.
case "InstanceCreation":
symtabEntry resultValue = stab.lookup(kids[0].tok.text);
if (resultValue == null) {
j0.semerror("unknown type " + kids[0].tok.text);
}
assert resultValue != null;
typeInfo = resultValue.typeInfo;
if (typeInfo == null) {
j0.semerror(kids[0].tok.text + " has unknown type");
}
break;
We assume the type info is already generated and in the symbol table because we have called the makeClass function on the root. Therefore, we can just grab that type knowing it is already present.
Summary
- Creating complex types requires us to write out how types are checked and synthesized or assigned.
- Array Types are extremely easy and can pull their type directly from what they are an array of.
- Class types required us to swim through the tree and create the ClassType in their symbol tables. But after that, checking the types became a simple lookup.
- Method Types are the most complex, they require an assignment step to swim down the return and parameter types to generate their type signature. On checking calling methods, we have to match the parameter types given to the signature and dequalify any instance methods. We also need to populate and check our symbol table for the method return type.
Conclusion
In this section, we talked about and implemented basic and advanced type synthesis by using tree traversals. We then used these generated types to check operations preformed on our operands and flag any illegal actions. At this point we have finished the front half of compilation where the source code is analyzed, we have a good set of semantic validation checks and have created some useful data structures for our next steps of generating lower level code.