Build Your Own Programming Language: Part 1 The Scanner

9 min readMay 25, 2024

Introduction

In this series, I will be working through this book on building your own programming language. The book is quite large and in depth, but in these articles I will be distilling the big picture pieces of the book into bite size tutorials and walk throughs. The book offers tracks for building our new language using icon — an esoteric language that the author made contributions to — and java. I will be using Java throughout this tutorial.

In this section, we will learn about building a source code scanner. A scanner — often also referred to as a lexical analyzer or lexer for short — has the job of taking in our source code and breaking it into meaningful tokens which contain useful information about the source code we will use later on.

A token will usually have at least an integer category which defines if this token is a variable, a certain keyword, a function name or something else AND the actual text and/or value the token holds (i.e. “myVariable”).

The code for this demo is found in this repo at the commit here

GitHub - mattmacf98/build-your-own-programming-language at 6e3b3dbef88255ff7afe31c022498f035d2dbea3

Contribute to mattmacf98/build-your-own-programming-language development by creating an account on GitHub.

github.com

A Simple Scanner

The Lex File

The book starts us off with a dead simple lex file nnws.l

%%
%int
%%
[a-zA-Z]+  { return 1; }
[0-9]+     { return 2; }
[ \t\r\n]+ {  }
.          { simple.lexErr("unrecognized character"); }

The lex file contains a set of regex expressions to tell us what token category to return when we match its patterns. When we see a sequence of alphabetical characters, we return a 1, and when we see numerical characters we return a 2. Whitespaces will be skipped over, and any other characters we see are treated as an error. The header section tells us what the lexer will return (which is an int).

Generating Lexer Code

This book does not go over writing a scanner — which is quite an involved undertaking. Instead, we will use JFlex and our lex file to generate some code under a new class called Yylex.

To generate our code, we will run

jflex nnws.l

This will generate a simple Yylex for us. However, it is missing one thing. In our lex file, when we have an unrecognized character, we called a function on simple.lexErr. Our generated Yylex expects a simple class with such a method, so we must now build it.

Supporting Semantic Actions

When our lexer finds a match, it executes what is in between the {}, for most of our rules these were simple returns and did not require any custom code. For our error handling, we decided to delegate to a function, we will encapsulate that function in a new class called Simple.

import java.io.FileReader;

public class simple {
    static Yylex lex;
    public static void main(String[] argv) throws Exception {
        lex = new Yylex(new FileReader(argv[0]));
        int i;
        while ((i=lex.yylex()) != Yylex.YYEOF) {
            System.out.println("token " + i + ": " + yytext());
        }
    }

    public static String yytext() {
        return lex.yytext();
    }

    public static void lexErr(String s) {
        System.err.println(s + ": " + yytext());
        System.exit(1);
    }
}

This class will also be our entry point for running our lexical analysis.

First, let’s take a look at our lexErr function. We make it static so Yylex can execute it without holding a Simple object, all it does is prints the error and the character that caused the error before exiting.

When we call this class’s main function, we will read a provided source code file into our Yylex constructor and then scan through each sequence in our source code that our lexer matches to print out a list of tokens.

Running this on something like

Dorrie is 1 fine puppy

Would result in us printing

token 1: Dorrie
token 1: is
token 2: 1
token 1: fine
token 1: puppy

Summary

We can create .l lex files which contain regexes found in our source code to be matched to semantic actions in our host code
We can use a code generation tool like JFlex to create a Lexer for us in Java by providing the .l lex file, it will be called Yylex
We can provide our own semantic actions for use by Yylex such as instructing it how to print our scanning errors
To Scan the source code, we will continuously call Yylex to get the next matching token until we reach the end or encounter an error

J0 Scanner

This book focuses on making a language which is a subset of java called j0.

J0 Lex File

%%
%int
%%
"/*"([^*]|"*"+[^/*])*"*"+"/" { j0.comment(); }
"//".*\r?\n                  { j0.comment(); }
[ \t\r\f]+                   { j0.whitespace(); }
\n             { j0.newline(); }
"break"                { return j0.scan(parser.BREAK); }
"double"               { return j0.scan(parser.DOUBLE); }
"else"                 { return j0.scan(parser.ELSE); }
"false"                { return j0.scan(parser.BOOLLIT); }
"for"                  { return j0.scan(parser.FOR); }
"if"                   { return j0.scan(parser.IF); }
"int"                  { return j0.scan(parser.INT); }
"null"                 { return j0.scan(parser.NULLVAL); }
"public"               { return j0.scan(parser.PUBLIC); }
"return"               { return j0.scan(parser.RETURN); }
"static"               { return j0.scan(parser.STATIC); }
"string"               { return j0.scan(parser.STRING); }
"true"                 { return j0.scan(parser.BOOLLIT); }
"bool"                 { return j0.scan(parser.BOOL); }
"void"                 { return j0.scan(parser.VOID); }
"while"                { return j0.scan(parser.WHILE); }
"class"                { return j0.scan(parser.CLASS); }
"("                    { return j0.scan(j0.ord("("));}
")"                    { return j0.scan(j0.ord(")"));}
"["                    { return j0.scan(j0.ord("["));}
"]"                    { return j0.scan(j0.ord("]"));}
"{"                    { return j0.scan(j0.ord("{"));}
"}"                    { return j0.scan(j0.ord("}"));}
";"                    { return j0.scan(j0.ord(";"));}
":"                    { return j0.scan(j0.ord(":"));}
"!"                    { return j0.scan(j0.ord("!"));}
"*"                    { return j0.scan(j0.ord("*"));}
"/"                    { return j0.scan(j0.ord("/"));}
"%"                    { return j0.scan(j0.ord("%"));}
"+"                    { return j0.scan(j0.ord("+"));}
"-"                    { return j0.scan(j0.ord("-"));}
"<"                    { return j0.scan(j0.ord("<"));}
"<="                   { return j0.scan(parser.LESSTHANOREQUAL);}
">"                    { return j0.scan(j0.ord(">"));}
">="                   { return j0.scan(parser.GREATERTHANOREQUAL);}
"=="                   { return j0.scan(parser.ISEQUALTO);}
"!="                   { return j0.scan(parser.NOTEQUALTO);}
"&&"                   { return j0.scan(parser.LOGICALAND);}
"||"                   { return j0.scan(parser.LOGICALOR);}
"="                    { return j0.scan(j0.ord("=")); }
"+="                   { return j0.scan(parser.INCREMENT); }
"-="                   { return j0.scan(parser.DECREMENT); }
","                    { return j0.scan(j0.ord(",")); }
"."                    { return j0.scan(j0.ord(".")); }
[a-zA-Z_][a-zA-Z0-9_]* { return j0.scan(parser.IDENTIFIER); }
[0-9]+                 { return j0.scan(parser.INTLIT); }
[0-9]*"."[0-9]*([eE][+-]?[0-9]+)? { return j0.scan(parser.DOUBLELIT); }
([0-9]+)([eE][+-]?([0-9]+))  { return j0.scan(parser.DOUBLELIT); }
\"[^\"]*\"     { return j0.scan(parser.STRINGLIT); }
.              { j0.lexErr("unrecognized character"); }

This lex file is quite a lot larger than our first one, it contains every regex match we will need to build our language.

A couple things to note are

Every semantic action now delegates to j0, this allows us to build more complex and useful tokens than just ints
We have semantic actions for dealing with comments, whitespaces and newlines which we will need to implement
The J0.scan semantic action utilizes some enum values defined in parser and a J0.ord function for indicating the numerical category of the matched token

Like before, we can generate our Yylex by running

jflex javalex.l

J0 Dependencies

Before we jump into writing J0, we will need to set up some dependent classes it uses.

First, we will need our parser enum which we reference from the lexer

public class parser {
    public final static short BREAK=257;
    public final static short PUBLIC=258;
    public final static short DOUBLE=259;
    public final static short ELSE=260;
    public final static short FOR=261;
    public final static short IF=262;
    public final static short INT=263;
    public final static short RETURN=264;
    public final static short VOID=265;
    public final static short WHILE=266;
    public final static short IDENTIFIER=267;
    public final static short CLASSNAME=268;
    public final static short CLASS=269;
    public final static short STATIC=270;
    public final static short STRING=273;
    public final static short BOOL=274;
    public final static short INTLIT=275;
    public final static short DOUBLELIT=276;
    public final static short STRINGLIT=277;
    public final static short BOOLLIT=278;
    public final static short NULLVAL=280;
    public final static short LESSTHANOREQUAL=298;
    public final static short GREATERTHANOREQUAL=300;
    public final static short ISEQUALTO=301;
    public final static short NOTEQUALTO=302;
    public final static short LOGICALAND=303;
    public final static short LOGICALOR=304;
    public final static short INCREMENT=306;
    public final static short DECREMENT=307;
    public final static short YYERRCODE=256;
}

Notice how we start after 255 so the tokens 0–254 can be held by the ordinal parsing of char -> int

We will also be utilizing a more complex token and therefore will encapsulate this complexity in its own token class.

public class token {
    public int cat;
    public String text;
    public int lineno;
    public int colno;
    public int ival;
    public String sval;
    public double dval;
    public token(int c, String s, int ln, int col) {
        cat = c;
        text = s;
        lineno = ln;
        colno = col;

        switch (cat) {
            case parser.INTLIT:
                ival = Integer.parseInt(s);
                break;
            case parser.DOUBLELIT:
                dval = Double.parseDouble(s);
                break;
            case parser.STRINGLIT:
                sval = deEscape(s);
                break;
        }
    }

    private String deEscape(String sin) {
        StringBuilder sout = new StringBuilder();
        sin = sin.substring(1, sin.length() - 1);
        int i = 0;
        while (sin.length() > 0) {
            char c = sin.charAt(0);
            if (c =='\\') {
               sin = sin.substring(1);
               if (sin.length() < 1) {
                   j0.lexErr("malformed string literal");
               } else {
                   c = sin.charAt(0);
                   switch (c) {
                       case 't':
                           sout.append('\t');
                           break;
                       case 'n':
                           sout.append('\n');
                           break;
                       default:
                           j0.lexErr("unrecognized escape");
                   }
               }
            } else {
                sout.append(c);
            }
            sin = sin.substring(1);
        }

        return sout.toString();
    }
}

This token stores information about the token text, the category int, the line number the token was found, its column number as well as the value it holds if it is a literal — being int, double, or String. Ints and Doubles can be parsed easily, but for strings we create a special method to deal with quotation marks and backslashes.

J0

Let’s remind ourselves of what J0 must do for us.

It must provide functions to deal with errors in scanning: lexErr()
It must provide functions to deal with comments, whitespaces and newlines: comment(), whitespace(), and newline()
It must provide an ord function to take in some char and return an int for it
It must provide a scan function which will be the workhorse of the semantic actions we preform

import java.io.FileReader;

public class j0 {
    static Yylex lex;
    public static int yylineno;
    public static int yycolno;
    public static token yylval;

    public static void main(String[] argv) throws Exception {
        lex = new Yylex(new FileReader(argv[0]));
        yylineno = 1;
        yycolno = 1;
        int i;
        while ((i=lex.yylex()) != Yylex.YYEOF) {
            System.out.println("token " + i + ":" + yylineno + " " + yytext());
        }
    }

    public static String yytext() {
        return lex.yytext();
    }

    public static void lexErr(String s) {
        System.err.println(s + ": line " + yylineno + ": " + yytext());
        System.exit(1);
    }

    public static int scan(int cat) {
        yylval = new token(cat, yytext(), yylineno, yycolno);
        yycolno += yytext().length();
        return cat;
    }

    public static short ord(String s) {
        return (short) s.charAt(0);
    }

    public static void newline() {
        yylineno++;
        yycolno = 1;
    }

    public static void whitespace() {
        yycolno += yytext().length();
    }

    public static void comment() {
        String s = yytext();
        int len = s.length();
        for (int i = 0; i < len; i++) {
            if (s.charAt(i) == '\n') {
                yylineno++;
                yycolno = 1;
            } else {
                yycolno++;
            }
        }
    }
}

We have now defined all the helper functions needed by our lex file’s semantic actions. You may notice we are not doing much with the token yet — just saving it to yyval. It will come in handy later when we move farther along in our language’s processes.

To see everything we just did, we can scan some example java code using our J0 scanner.

public class hello {
    public static void main(String[] argv) {
        System.out.println("hello, jzero!");
    }
}

Then we can run this to build and run J0 on this java file

jflex javalex.l
javac j0.java Yylex.java
javac token.java parser.java
java j0 hello.java

You should see a list of tokens printed out alongside their line number and text.

Summary

Helper functions in the semantic actions of a lex file are very helpful and allow us to preform complex actions on a match
Tokens should store lots of useful information like literal values, line numbers, and the token text

Conclusion

In this article we explored building a Scanner for our very own programming language! We used tools to transform our regex based lex file into a full fledged java lexer which we could inject helper functionality into. While this auto-generation works well for small toy examples like this, there are cases where regular expressions will not be enough to deal with the edge cases in your language.In such cases, you will need to build your scanner by hand. Creating the scanner by hand makes it more readable and also gives you more power to create interesting quirks and features in your language.

Build Your Own Programming Language: Part 1 The Scanner

Introduction

GitHub - mattmacf98/build-your-own-programming-language at 6e3b3dbef88255ff7afe31c022498f035d2dbea3

Contribute to mattmacf98/build-your-own-programming-language development by creating an account on GitHub.

A Simple Scanner

The Lex File

Generating Lexer Code

Supporting Semantic Actions

Summary

J0 Scanner

J0 Lex File

J0 Dependencies

J0

Summary

Conclusion

Written by Matthew MacFarquhar

No responses yet