Build Your Own Programming Language: Part 3 The Syntax Tree

Matthew MacFarquhar
9 min readJun 4, 2024

--

Introduction

In this series, I will be working through this book on building your own programming language. The book is quite large and in depth, but in these articles I will be distilling the big picture pieces of the book into bite size tutorials and walk throughs. The book offers tracks for building our new language using icon — an esoteric language that the author made contributions to — and java. I will be using Java throughout this tutorial.

In this section, we will learn about the Syntax Tree. The Syntax tree is the main data structure to be used by our later stages in our compiler. It is an easily readable and traversable data structure making it a great representation of our source code.

The code for this demo is found in this repo at the commit here

Make File

Before we jump into building out our tree, I want to make our lives a little easier to build our code. Previously, we would run a series of growing commands every time we want to create our lexer and parser. This is kind of annoying. Instead, we can specify all our build processes in a makefile which will only build targets when the dependencies of the target change.

all: java
BYSRC=parser.java parserVal.java Yylex.java
JSRC=j0.java tree.java token.java yyerror.java serial.java $(BYSRC)
BYJOPTS= -Jclass=parser
BYJIMPS= -Jyylex=j0.yylex -Jyyerror=yyerror.yyerror
java: j0.class

j: java
java j0 hello.java
dot -Tpng hello.java.dot > hello.png

j0.class: $(JSRC)
javac $(JSRC)

parser.java parserVal.java: j0gram.y
yacc.irix $(BYJOPTS) $(BYJIMPS) j0gram.y

Yylex.java: javalex.l
jflex javalex.l

Our makefile builds our j0.class. J0 depends on the JSRC — and transitively the BSRC — files (j0.java tree.java token.java yyerror.java serial.java parser.java parserVal.java Yylex.java).

Our Yylex.java needs to be recompiled every time javalex.l is changed, and is done so by running jflex.

Our parser and parserVal java files depend on j0gram.y and use yacc along with some arguments to generate themselves.

This makefile will make our lives a lot easier going forward when building our executables.

Creating The Syntax Tree

With that process improvement out of the way, let’s jump into building our Syntax Tree!

Defining Our Tree

import java.io.BufferedWriter;
import java.io.FileWriter;
import java.io.IOException;
import java.io.PrintWriter;

public class tree {
int id;
String sym;
int rule;
int nkids;
token tok;
tree[] kids;

public tree(String s, int r, token t) {
id = serial.getid();
sym = s;
rule = r;
tok = t;
}

public tree(String s, int r, tree[] t) {
id = serial.getid();
sym = s;
rule = r;
nkids = t.length;
kids = t;
}

public void print(int level) {
for (int i = 0; i < level; i++) {
System.out.print(" ");
}
if (tok != null) {
System.out.println(id + " " + tok.text + " (" + tok.cat + "): " + tok.lineno);
} else {
System.out.println(id + " " + sym + " (" + rule + "): " + nkids);
for (int i=0; i < nkids; i++) {
kids[i].print(level + 1);
}
}
}

}

(Note: we have a very simple static serial class to give the nodes a unique id)

Our Tree is incredibly simple, it has two constructors: one which takes a token for the leaf nodes and one which takes a list of other child subtrees for the non leaf nodes (aka internal nodes).

We have a helper function to print out a node, as you can see this print is recursive and prints the nodes children, so printing the root will print out our entire tree.

Building The Tree

Now we have to discuss how the tree is built during parsing.

First, we will update our scan method in J0

public static int scan(int cat) {
par.yylval = new parserVal(new tree("token", cat, new token(cat, yytext(), yylineno)));
return cat;
}

This wraps our tokens in a tree object (in this case they are the leaf nodes).

We have our leaves, now we will also expose a static function called node in j0, which our parser will use to build the internal nodes.

public static parserVal node(String s, int r, parserVal... p) {
tree[] t = new tree[p.length];
for (int i = 0; i < t.length; i++) {
t[i] = (tree)(p[i].obj);
}
return new parserVal(new tree(s,r,t));
}

This function will take in the symbol, rule # (which we will use in later sections) and a list of the children nodes (wrapped in parserVals). It will create a new tree node for this symbol and construct a list of its children nodes. Finally, it will return the resulting tree node wrapped inside a parserVal.

Lastly, we will create a print function that our root rule in our CFG will call

public static void print(parserVal root) {
((tree)root.obj).print(0);
}

We update our Context Free Grammar to call these methods in its semantic actions.

%token BREAK DOUBLE ELSE FOR IF INT RETURN VOID WHILE
%token IDENTIFIER CLASSNAME CLASS STRING BOOL
%token INTLIT DOUBLELIT STRINGLIT BOOLLIT NULLVAL
%token LESSTHANOREQUAL GREATERTHANOREQUAL
%token ISEQUALTO NOTEQUALTO LOGICALAND LOGICALOR
%token INCREMENT DECREMENT PUBLIC STATIC
%%
ClassDecl: PUBLIC CLASS IDENTIFIER ClassBody {
$$=j0.node("ClassDecl",1000,$3,$4);
j0.print($$);
} ;
ClassBody: '{' ClassBodyDecls '}' { $$=j0.node("ClassBody",1010,$2); }
| '{' '}' { $$=j0.node("ClassBody",1011); };
ClassBodyDecls: ClassBodyDecl
| ClassBodyDecls ClassBodyDecl {
$$=j0.node("ClassBodyDecls",1020,$1,$2); };
ClassBodyDecl: FieldDecl | MethodDecl | ConstructorDecl ;
FieldDecl: Type VarDecls ';' {
$$=j0.node("FieldDecl",1030,$1,$2); };
Type: INT | DOUBLE | BOOL | STRING | Name ;

Name: IDENTIFIER | QualifiedName ;
QualifiedName: Name '.' IDENTIFIER {
$$=j0.node("QualifiedName",1040,$1,$3);};

VarDecls: VarDeclarator | VarDecls ',' VarDeclarator {
$$=j0.node("VarDecls",1050,$1,$3); };
VarDeclarator: IDENTIFIER | VarDeclarator '[' ']' {
$$=j0.node("VarDeclarator",1060,$1); };

MethodReturnVal : Type | VOID ;
MethodDecl: MethodHeader Block {
$$=j0.node("MethodDecl",1380,$1,$2);
};
MethodHeader: PUBLIC STATIC MethodReturnVal MethodDeclarator {
$$=j0.node("MethodHeader",1070,$3,$4); };
MethodDeclarator: IDENTIFIER '(' FormalParmListOpt ')' {
$$=j0.node("MethodDeclarator",1080,$1,$3); };

FormalParmListOpt: FormalParmList | ;
FormalParmList: FormalParm | FormalParmList ',' FormalParm {
$$=j0.node("FormalParmList",1090,$1,$3); };
FormalParm: Type VarDeclarator {
$$=j0.node("FormalParm",1100,$1,$2);
};

ConstructorDecl: MethodDeclarator Block {
$$=j0.node("ConstructorDecl",1110,$1,$2); };

Block: '{' BlockStmtsOpt '}' {$$=j0.node("Block",1200,$2);};
BlockStmtsOpt: BlockStmts | ;
BlockStmts: BlockStmt | BlockStmts BlockStmt {
$$=j0.node("BlockStmts",1130,$1,$2); };
BlockStmt: LocalVarDeclStmt | Stmt ;

LocalVarDeclStmt: LocalVarDecl ';' ;
LocalVarDecl: Type VarDecls {
$$=j0.node("LocalVarDecl",1140,$1,$2); };

Stmt: Block | ';' | ExprStmt | BreakStmt | ReturnStmt
| IfThenStmt | IfThenElseStmt | IfThenElseIfStmt
| WhileStmt | ForStmt ;

ExprStmt: StmtExpr ';' ;

StmtExpr: Assignment | MethodCall ;

IfThenStmt: IF '(' Expr ')' Block {
$$=j0.node("IfThenStmt",1150,$3,$5); };
IfThenElseStmt: IF '(' Expr ')' Block ELSE Block {
$$=j0.node("IfThenElseStmt",1160,$3,$5,$7); };
IfThenElseIfStmt: IF '(' Expr ')' Block ElseIfSequence {
$$=j0.node("IfThenElseIfStmt",1170,$3,$5,$6); }
| IF '(' Expr ')' Block ElseIfSequence ELSE Block {
$$=j0.node("IfThenElseIfStmt",1171,$3,$5,$6,$8); };

ElseIfSequence: ElseIfStmt | ElseIfSequence ElseIfStmt {
$$=j0.node("ElseIfSequence",1180,$1,$2); };
ElseIfStmt: ELSE IfThenStmt {
$$=j0.node("ElseIfStmt",1190,$2); };
WhileStmt: WHILE '(' Expr ')' Stmt {
$$=j0.node("WhileStmt",1210,$3,$5); };

ForStmt: FOR '(' ForInit ';' ExprOpt ';' ForUpdate ')' Block {
$$=j0.node("ForStmt",1220,$3,$5,$7,$9); };
ForInit: StmtExprList | LocalVarDecl | ;
ExprOpt: Expr | ;
ForUpdate: StmtExprList | ;

StmtExprList: StmtExpr | StmtExprList ',' StmtExpr {
$$=j0.node("StmtExprList",1230,$1,$3); };

BreakStmt: BREAK ';' | BREAK IDENTIFIER ';' {
$$=j0.node("BreakStmt",1240,$2); };
ReturnStmt: RETURN ExprOpt ';' {
$$=j0.node("ReturnStmt",1250,$2); };

Primary: Literal | FieldAccess | MethodCall | '(' Expr ')' {
$$=$2;};
Literal: INTLIT | DOUBLELIT | BOOLLIT | STRINGLIT | NULLVAL ;

ArgList: Expr | ArgList ',' Expr {
$$=j0.node("ArgList",1270,$1,$3); };
FieldAccess: Primary '.' IDENTIFIER {
$$=j0.node("FieldAccess",1280,$1,$3); };

ArgListOpt: ArgList | ;
MethodCall: Name '(' ArgListOpt ')' {
$$=j0.node("MethodCall",1290,$1,$3); }
| Primary '.' IDENTIFIER '(' ArgListOpt ')' {
$$=j0.node("MethodCall",1291,$1,$3,$5); }
;

PostFixExpr: Primary | Name ;
UnaryExpr: '-' UnaryExpr {
$$=j0.node("UnaryExpr",1300,$1,$2); }
| '!' UnaryExpr {
$$=j0.node("UnaryExpr",1301,$1,$2); }
| PostFixExpr ;
MulExpr: UnaryExpr
| MulExpr '*' UnaryExpr {
$$=j0.node("MulExpr",1310,$1,$3); }
| MulExpr '/' UnaryExpr {
$$=j0.node("MulExpr",1311,$1,$3); }
| MulExpr '%' UnaryExpr {
$$=j0.node("MulExpr",1312,$1,$3); };
AddExpr: MulExpr
| AddExpr '+' MulExpr {
$$=j0.node("AddExpr",1320,$1,$3); }
| AddExpr '-' MulExpr {
$$=j0.node("AddExpr",1321,$1,$3); };
RelOp: LESSTHANOREQUAL | GREATERTHANOREQUAL | '<' | '>' ;
RelExpr: AddExpr | RelExpr RelOp AddExpr {
$$=j0.node("RelExpr",1330,$1,$2,$3); };

EqExpr: RelExpr
| EqExpr ISEQUALTO RelExpr {
$$=j0.node("EqExpr",1340,$1,$3); }
| EqExpr NOTEQUALTO RelExpr {
$$=j0.node("EqExpr",1341,$1,$3); };
CondAndExpr: EqExpr | CondAndExpr LOGICALAND EqExpr {
$$=j0.node("CondAndExpr", 1350, $1, $3); };
CondOrExpr: CondAndExpr | CondOrExpr LOGICALOR CondAndExpr {
$$=j0.node("CondOrExpr", 1360, $1, $3); };

Expr: CondOrExpr | Assignment ;
Assignment: LeftHandSide AssignOp Expr {
$$=j0.node("Assignment",1370, $1, $2, $3); };
LeftHandSide: Name | FieldAccess ;
AssignOp: '=' | INCREMENT | DECREMENT ;

After we completely swim up the production rules, we would have generated a complete syntax tree, which we also print out. We can now run java j0 hello.java (after having run make). It should print out the following ASCII representation of the tree.

63 ClassDecl (1000): 2
6 hello (266): 1
62 ClassBody (1010): 1
59 MethodDecl (1380): 2
32 MethodHeader (1070): 2
14 void (264): 2
31 MethodDeclarator (1080): 2
16 main (266): 2
30 FormalParm (1100): 2
20 String (266): 2
27 VarDeclarator (1060): 1
22 argv (266): 2
58 Block (1200): 1
53 MethodCall (1290): 2
46 QualifiedName (1040): 2
41 QualifiedName (1040): 2
36 System (266): 3
40 out (266): 3
45 println (266): 3
50 "hello, j0" (273): 3

Summary

  • Leaves are tree nodes with tokens, internal nodes are tree nodes with children nodes
  • We can create our tree nodes using semantic actions interleaved into the scanning and parsing stages
  • We can traverse our tree recursively to do whatever we want — in this case we are just printing it out.

Printing an Image of the Tree using GraphViz

Now we will work on making some modifications to the tree class to generate the image at the top of this article.

import java.io.BufferedWriter;
import java.io.FileWriter;
import java.io.IOException;
import java.io.PrintWriter;

public class tree {
int id;
String sym;
int rule;
int nkids;
token tok;
tree[] kids;

public tree(String s, int r, token t) {
id = serial.getid();
sym = s;
rule = r;
tok = t;
}

public tree(String s, int r, tree[] t) {
id = serial.getid();
sym = s;
rule = r;
nkids = t.length;
kids = t;
}

public void print(int level) {
for (int i = 0; i < level; i++) {
System.out.print(" ");
}
if (tok != null) {
System.out.println(id + " " + tok.text + " (" + tok.cat + "): " + tok.lineno);
} else {
System.out.println(id + " " + sym + " (" + rule + "): " + nkids);
for (int i=0; i < nkids; i++) {
kids[i].print(level + 1);
}
}
}

public void printGraph(String filename) {
try {
PrintWriter pw = new PrintWriter(new BufferedWriter(new FileWriter(filename)));
pw.printf("digraph {\n");
int j = 0;
printGraph(pw);
pw.printf("}\n");
pw.close();
} catch (IOException ioException) {
System.err.println("printgraph exception");
System.exit(1);
}
}

int j;
private void printGraph(PrintWriter pw) {
if (tok != null) {
printLeaf(pw);
return;
}
printBranch(pw);

for (int i = 0; i < nkids; i++) {
if (kids[i] != null) {
pw.printf("N%d -> N%d;\n", id, kids[i].id);
kids[i].printGraph(pw);
} else {
pw.printf("N%d -> N%d_%d;\n", id, id, j);
pw.printf("N%d%d [label=\"Empty rule\"];\n", id, j);
j++;
}
}
}

private void printLeaf(PrintWriter pw) {
String s = parser.yyname[tok.cat];
printBranch(pw);
pw.printf("N%d [shape=box style=dotted label=\" %s \\n", id, s);
pw.printf("text = %s \\l lineno = %d \\l\"];\n", escape(tok.text), tok.lineno);
}

private void printBranch(PrintWriter pw) {
pw.printf("N%d ",id);
pw.printf("[shape=box label=\"%s",prettyPrintName());
if (tok != null)
pw.printf("struct token* leaf %d", tok.id);
pw.printf("\"];\n");
}

private String escape(String s) {
if (s.charAt(0) == '\"')
return "\\"+s.substring(0, s.length()-1)+"\\\"";
else return s;
}

private String prettyPrintName() {
if (tok == null) {
return sym +"#"+(rule%10);
} else {
return escape(tok.text)+":"+tok.cat;
}
}
}

The functions printGraph, printBranch and printLeaf construct a .dot file corresponding to the DOT language (since we will be using the DOT executable to generate our png).

We also have to switch out our logic in our j0 print function to be

public static void print(parserVal root) {
((tree) root.obj).printGraph(yyfilename + ".dot");
}

After we make and run j0 against our hello.java we will get a dot file that looks like this.

Which we can then turn into a png by running

dot -Tpng hello.java.dot > hello.png

Conclusion

In this article we enhanced our parser and scanner to generate nodes for our Syntax Tree by utilizing semantic actions. We went through a basic traversal of our tree to print it out and generate the dot language file which could be fed into the DOT tool to create a visual representation of our tree. The Syntax Tree is the foundation on which we will build the rest of our compiler logic as we go through this book.

--

--

Matthew MacFarquhar
Matthew MacFarquhar

Written by Matthew MacFarquhar

I am a software engineer working for Amazon living in SF/NYC.

Responses (1)