Build Your Own Programming Language: Part 4 The Symbol Table
Introduction
In this series, I will be working through this book on building your own programming language. The book is quite large and in depth, but in these articles I will be distilling the big picture pieces of the book into bite size tutorials and walk throughs. The book offers tracks for building our new language using icon — an esoteric language that the author made contributions to — and java. I will be using Java throughout this tutorial.
In this section, we will learn about symbol tables. Symbol tables will be the first thing we build by traversing our syntax tree, they will be very useful for determining things like scope and values of variables as well as identifying errors like re-declarations of variables or usage of undeclared ones.
The code for this demo is found in this repo at the commit here
Building The Symbol Table
Before we can use the symbol tables we build for anything useful, we will first have to determine their structures and create them. One mistake that beginner language designers make that the book calls out is creating a single symbol table for our entire program. This is an anti-pattern and we will instead be building a symbol table per scope and linking them together with parent and child pointers.
Symbol table structure
At the end of the day, symbol tables are really just HashMaps with the key being the symbol and the value being an entry containing data about that symbol.
We will start looking at the entry.
public class symtabEntry {
String sym;
symtab parentSymbolTable;
symtab subscopeSymbolTable;
boolean isConst;
public symtabEntry(String sym, symtab parentSymbolTable, symtab subscopeSymbolTable, boolean isConst) {
this.sym = sym;
this.parentSymbolTable = parentSymbolTable;
this.subscopeSymbolTable = subscopeSymbolTable;
this.isConst = isConst;
}
public symtabEntry(String sym, symtab parentSymbolTable, boolean isConst) {
this.sym = sym;
this.parentSymbolTable = parentSymbolTable;
this.isConst = isConst;
}
public void print(int level) {
for (int i = 0; i < level; i++) {
System.out.print(" ");
}
System.out.print(sym);
if (isConst) {
System.out.print(" (const)");
}
System.out.println();
if (subscopeSymbolTable != null) {
subscopeSymbolTable.print(level + 1);
}
}
}
Our entry is quite simple for now: it has the symbol, a reference to the table it is in, a boolean to tell us if this symbol is readOnly or not and — if this symbol has some sub-scope like if it were a class or method — a link to the sub scope symbol table.
The isConst is a good example of what you would store in a symbol table entry, other examples could be access modifies (public/private) or static modifiers. Our language is simple so we won’t need to add this extra information.
Now we can build the symbol table, it is really just a light wrapper around a hash map.
import java.util.HashMap;
public class symtab {
String scope;
symtab parent;
HashMap<String, symtabEntry> table;
symtab(String sc) {
this.scope = sc;
this.table = new HashMap<String,symtabEntry>();
}
public symtab(String scope, symtab parent) {
this.scope = scope;
this.parent = parent;
this.table = new HashMap<String, symtabEntry>();
}
//NOTE: in the github version this function does not
// check parent scopes when doing lookups, I fixed that in later revisions
// to match this
public symtabEntry lookup(String sym) {
symtabEntry result;
result = table.get(sym);
if (result != null) {
return result;
} else if (parent != null) {
return parent.lookup(sym);
} else {
return null;
}
}
public void insert(String scope, boolean isConst, symtab sub) {
if (table.containsKey(scope)) {
j0.semerror("redeclaration of " + scope);
} else {
sub.parent = this;
table.put(scope, new symtabEntry(scope, this, sub, isConst));
}
}
public void insert(String scope, boolean isConst) {
if (table.containsKey(scope)) {
j0.semerror("redeclaration of " + scope);
} else {
table.put(scope, new symtabEntry(scope, this, isConst));
}
}
public void print() {
print(0);
}
public void print(int level) {
for (int i = 0; i < level; i++) {
System.out.print(" ");
}
System.out.println(scope + " - " + table.size() + " symbols");
for (symtabEntry entry : table.values()) {
entry.print(level + 1);
}
}
}
We have a scope name which will let us encode if this table is for some class, method, global or some other user defined scope. We also have a reference to our parent symbol table — which comes in handy when we traverse up our parent chain to preform a symbol lookup.
Our insertion logic is literally just a HashMap insert with some wrapper to produce an error if we try to re-declare the symbol within a single table.
Making The Symbol Tables
Now that we have the structure set up, we are ready to build up the symbol tables in our code.
For things like built-ins and global functions, we can create symbol tables before we even start parsing our tree.
public static void semantic(parserVal rootParserVal) {
tree root = (tree)(rootParserVal.obj);
symtab outSymTab, systemSymTab;
globalSymTab = new symtab("global");
systemSymTab = new symtab("class");
outSymTab = new symtab("class");
outSymTab.insert("println", false);
systemSymTab.insert("out", false, outSymTab);
globalSymTab.insert("System", false, systemSymTab);
root.makeSymbolTables(globalSymTab);
root.populateSymbolTables();
root.checkSymbolTables();
globalSymTab.print();
}
Inside of J0.semantic, we have created symbol tables for global, system and out. We have linked global -> System -> out tables together and then inserted a println symbol into out. This allows us to recognize the classic java System.out.println syntax when checking for undeclared variable usage.
We also call makeSymbolTables on our root syntax tree node — passing in the global symbol table. This will build out our symbol tables that are created in our code.
public void makeSymbolTables(symtab curr) {
stab = curr;
switch (sym) {
case "ClassDecl":
curr = new symtab("class", curr);
break;
case "MethodDecl":
curr = new symtab("method", curr);
break;
}
for (int i = 0; i < nkids; i++) {
kids[i].makeSymbolTables(curr);
}
}
This function will recurse through our syntax tree, we assign each tree node the symbol table we are currently on when we arrive at the node. We create new symbol tables if we have a sub-scope like a class or method. When we create these sub-scopes, we pass in our current scope for them to use as their parent and continue down the tree. At the end, we will have a syntax tree where each node has a reference to its symbol table that it can use to preform symbol lookups.
Populating The Symbol Tables
Now our symbol tables are built and hooked up to our syntax tree nodes, but they have no symbols in them.
public void populateSymbolTables() {
switch (sym) {
case "ClassDecl":
stab.insert(kids[0].tok.text, false, kids[0].stab);
break;
case "FieldDecl":
case "LocalVarDecl":
tree kid = kids[1];
while (kid != null && kid.sym.equals("VarDecls")) {
insertVarDeclarator(kid.kids[1]);
kid = kids[0];
}
return;
case "MethodDecl":
stab.insert(kids[0].kids[1].kids[0].tok.text, false, kids[0].stab);
break;
case "FormalParam":
insertVarDeclarator(kids[1]);
return;
}
for (int i = 0; i < nkids; i++) {
tree kid = kids[i];
kid.populateSymbolTables();
}
}
private void insertVarDeclarator(tree varDeclarator) {
if (varDeclarator.tok != null) {
stab.insert(varDeclarator.tok.text, false);
} else {
insertVarDeclarator(varDeclarator.kids[0]);
}
}
These two methods will traverse our tree once again, this time inserting values into the current tree node’s symbol table when it sees any declarations of methods, variables, classes, etc… After this is complete, we have a completely populated symbol table with all of our declarations properly stored at each of their scope levels.
Summary
- We should have multiple symbol tables — one for each level of scope
- A symbol table is just a HashMap where the key is the declared name of the thing and the value is an entry we create which can hold information about the item we declared
- Symbol tables should link to their children and parent tables to make lookups of symbols easy
- We can create and populate our symbol tables by traversing our syntax tree and preforming operations on our tables when we see specific node types.
Symbol Analysis
Lets put our hard work to good use and actually do something with our symbol tables.
Undeclared Variable Usage
We have already seen re-declaration checking, it was a very simple guard that lived inside our symbol table class’s insert method and would complain if the key already exists.
The other half of the equation is making sure we are not using any symbols which do not actually exist.
public void checkSymbolTables() {
checkCodeBlocks();
}
private void checkCodeBlocks() {
tree kid;
if (sym.equals("MethodDecl")) {
kids[1].checkBlock();
} else {
for (int i = 0; i < nkids; i++) {
kid = kids[i];
kid.checkCodeBlocks();
}
}
}
These functions will be called starting at the root, we will recurse on our children nodes until we find a method declaration, which will prompt us to check the scope of its code block execution for usage of undeclared variables.
private void checkBlock() {
switch (sym) {
case "IDENTIFIER":
if (stab.lookup(tok.text) == null) {
j0.semerror("undeclared variable " + tok.text);
}
break;
case "FieldAccess":
case "QualifiedName":
kids[0].checkBlock();
break;
case "MethodCall":
kids[0].checkBlock();
if (rule == 1290) {
kids[1].checkBlock();
} else {
kids[2].checkBlock();
}
break;
case "LocalVarDecl":
break;
default:
for (int i = 0; i < nkids; i++) {
kids[i].checkBlock();
}
}
}
This function also traverses the tree, but this time within an execution block. If we find a declaration, we break since this would be when we are putting something in the symbol table which is not what we are checking for. We only traverse the nodes which may preform accesses of symbols: MethodCall, QualifiedName and FieldAccess.
When we drill down to the underlying IDENTIFIER — which in essence is what our symbols are — we do a lookup to ensure that that symbol is somewhere in our hierarchy of symbol tables. If it is not, we will report the undeclared variable error.
Visualizing the Tables
It is always nice to get a visual understanding of what our data structures look like. For our symbol tables we have the following function in our symbol table class.
public void print(int level) {
for (int i = 0; i < level; i++) {
System.out.print(" ");
}
System.out.println(scope + " - " + table.size() + " symbols");
for (symtabEntry entry : table.values()) {
entry.print(level + 1);
}
}
Which will print some spaces for each level we go into the scope hierarchy, then print some of its information and finally ask the entries to print their information.
public void print(int level) {
for (int i = 0; i < level; i++) {
System.out.print(" ");
}
System.out.print(sym);
if (isConst) {
System.out.print(" (const)");
}
System.out.println();
if (subscopeSymbolTable != null) {
subscopeSymbolTable.print(level + 1);
}
}
The entries will print their information, as well as setting off prints for any sub-scope symbol tables they have.
Below is what we get for our helloWorld.java symbol table print out.
global - 2 symbols
hello
class - 1 symbols
main
method - 0 symbols
System
class - 1 symbols
out
class - 1 symbols
println
Summary
- Symbol tables can be used to analyze re-declared or undeclared variables
- We can traverse our symbol tables in a similar manner as we would our syntax tree
Conclusion
In this article we used our syntax tree to build another useful data structure for our compilation process — the symbol table. We used our symbol table to store information about our variable, class and method declarations and learned how to use the table to verify important semantic checks like re-declarations or usages of undeclared symbols.