Part 4 of kscript (Parsing actual expressions): Writing a dynamic, interpreted, duck-typed language

 

This is part 4 of my series on kscript.

This post will deal with beginning to parse expressions. Now, things start to get wormy as we invent our own syntax for the language

Up until now, we’ve just implemented things in C. But, we really want to start implementing things outside of C, in our new language, kscript.

Luckily, we can dust off some parsing algorithms, and we can transform text into instructions for the computer. At this point in the conversation, many people will suggest using yacc, bison, lex, flex, etc, etc, and start talking about EBNF grammars.

I’m not going to do that at all. We’re going to be writing the parser, scanner, tokenizer, etc all from scratch. It’ll support functions, operators, operator precedence, errors, helpful error messages, types, and more, and using no external libraries or tools other than what we’ve written so far, and the C standard library.

We’ll also have to start an AST implementation so that we can have have it in a computer readable format.

In order to do all of this, we will use a modified Shunting Yard Algorithm, by the programming-god Dijkstra himself. It is a shame how few people know of this algorithm. It is so beautiful, simple, and efficient, it’s truly a shame to see people shy away from using it for compilers.

In any case, we’ll get to implementing it in a bit. First, we need some definitions:

  • an ident: anything that can be an identifier, such as a name, variable, function, etc (regex: [a-zA-Z_][a-zA-Z0-9_]*), can start with any alphabetical character (or underscore), followed by alphanumerics, digits, or underscore
  • Constant: can be integer, float, or string. Examples: 234, 2.34, "hello2", etc

Implementing an AST library

What Even is an AST?

An AST (short for Abstract Syntax Tree) is a tree that represents computations. This means that an AST doesn’t do any of the computation, it just is in a computer (and somewhat human) readable format.

An ast can contain other ASTs, or it can be a terminal AST (like a constant integer, a variable name, etc). ASTs that contain others (like operators, function calls, assignment) must recursively evaluate their children, and then use their values to do something with. L

For example, take the expression: a = 3 * b ** 5 + c. We want to represent this in a way a computer can understand.

Here’s a way both us and our computers can understand:

AST

The top of the tree represents the last operation to be done. The nodes at the bottom represent our building blocks. Think of how PEMDAS works, you evaluate whatever you can first. Take a concrete example like 3 + 4 * 2. You start evaluating 4 * 2 == 8, then 3 + (4 * 2) == 3 + (8) == 3 + 8 == 11.The computer needs a way to know how to get the 8 for the right hand side of the + operator. But, in order to do that, it needs to first find the value of the sub expression 4 * 2. Eventually, we find that we can do this, because we know how to calculate a number times a number. Then the result “bubbles” back up, and combines with the literal 3, since we know how to add a number times a number.

So, the process for humans and computers to evaluate an expression isn’t too complicated after all, it just uses a bit of recursion. Go back to our picture, and realize that now the computer can start at the bottom, replacing what it knows. Here’s a (very bad) illustration of that:

AST

Eventually, I want to make a video explaining this, and I can show the process by hand, but for now you can check out the drawing. Just work your way upwards, and bubble up the result recursively to the next highest node, and then repeat your steps until you are at the top.

AST Library

Let’s begin by implementing an AST library. It’ll be very similar to our object interface via ks_obj. It will use tagged anonymous unions, as well as an internal struct ks_ast, but the main ks_ast type will just be a pointer.;

Here’s what we’re starting out with:

// src/kscript.h

// we do the same as with `ks_obj`, having a pointer as the main type
typedef struct ks_ast* ks_ast;

enum {
    // none-type, shouldn't exist
    KS_AST_NONE = 0,

    // a constant int literal
    KS_AST_CONST_INT,

    // a constant float literal
    KS_AST_CONST_FLOAT,

    // a constant string literal
    KS_AST_CONST_STR,

    // assigns a value to an AST
    KS_AST_ASSIGN,

    // a variable reference
    KS_AST_VAR,

    // a 'call', i.e. some sort of method like A(B, C)
    KS_AST_CALL

};

// internal structure, use `ks_ast` as a pointer to it
struct ks_ast {

    uint16_t type;

    // anonymous tagged union
    union {
        // if type==KS_AST_CONST_INT, this is that value
        ks_int _int;
        // if type==KS_AST_CONST_FLOAT, this is that value
        ks_float _float;
        // if type==KS_AST_CONST_STR, this is that value
        ks_str _str;
        // if type==KS_AST_VAR, this is the name the code references
        ks_str _var;

        // if type==KS_AST_CALL, this contains the relevant details
        struct {
            // example: A(B, C)
            // lhs = A
            // args = B, C (len=2)
            
            // the object being called
            ks_ast lhs;

            // number of arguments
            int args_n;

            // an array of arguments
            ks_ast* args;

        } _call;

        // if type==KS_AST_ASSIGN, this contains the relevant details
        struct {
            // example: A = B
            // lhs = A
            // rhs = B

            // the thing being assigned to.
            // NOTE: This has some restrictions, it must be a valid assignable expression
            //   such as a KS_AST_VAR
            ks_ast lhs;

            // any value-holding type of AST
            ks_ast rhs;

        } _assign;

    };
};

We have support for constants, variables, and function calls. Some ASTs will have binary operators (like +, -, etc), but we will just translate these into functions as AST’s, as follows:

  • a+b will be __add__(a, b)
  • a-b will be __sub__(a, b)
  • a*b will be __mul__(a, b)
  • a/b will be __div__(a, b)
  • a%b will be __mod__(a, b)
  • a**b will be __pow__(a, b)
  • etc…

We won’t describe them all just yet, but essentially these will be handled via function overloads of these function names, similar to how python handles operator overloading.

A call is just a function call. But, keep in mind that functions are objects too, and so (a[i])(2, 3) could be valid, as a could be a list of functions. So, the lhs of a function call could be an AST itself.

We also have support for assignment, to represent A=B. This is a special case that can’t just be handled with a function, as it changes what the symbol type and value is. In the future, we’ll be able to test whether or not the left handed side is valid (think about it, is 3=a+b a valid expression? We can’t assign to a constant, and we also can’t assign to things like a+b=3).

In order to construct these AST’s, we’ll define some utility functions:

// src/kscript.h

// constructs new ast's
ks_ast ks_ast_new_int(ks_int val);
ks_ast ks_ast_new_float(ks_float val);
ks_ast ks_ast_new_str(ks_str val);
ks_ast ks_ast_new_var(ks_str name);
ks_ast ks_ast_new_call(ks_ast lhs, int args_n, ks_ast* args);
ks_ast ks_ast_new_assign(ks_ast lhs, ks_ast rhs);

// frees resources
void ks_ast_free(ks_ast ast);

To see the full implementation, check https://github.com/ChemicalDevelopment/kscript/blob/master/src/ast.c.

Now, let’s actually define a way to run these AST’s. Our (first) method will be called “walking” the AST, because we will basically just recursively walk down our tree.

We’ll add a few things to our sources to implement basic arithmetic.

// src/kscript.c
...

// add(A, B) == A+B
ks_obj ks_std_add(int args_n, ks_obj* args) {
    if (args_n != 2) {
        ks_error("add takes %d args, was given %d", 2, args_n);
        return ks_obj_new_none();
    }

    // for now, just implement integers, and assume all arguments are integers
    return ks_obj_new_int(args[0]->_int + args[1]->_int);
}

// mul(A, B) == A*B
ks_obj ks_std_mul(int args_n, ks_obj* args) {
    if (args_n != 2) {
        ks_error("add takes %d args, was given %d", 2, args_n);
        return ks_obj_new_none();
    }

    // for now, just implement integers, and assume all arguments are integers
    return ks_obj_new_int(args[0]->_int * args[1]->_int);
}
// pow(A, B) == A**B
ks_obj ks_std_pow(int args_n, ks_obj* args) {
    if (args_n != 2) {
        ks_error("add takes %d args, was given %d", 2, args_n);
        return ks_obj_new_none();
    }

    // for now, just implement integers, and assume all arguments are integers
    return ks_obj_new_int((ks_int)(pow(args[0]->_int, args[1]->_int)));
}


// evaluates an AST, returning the result as an object
ks_obj ks_eval(ks_ast ast) {
    if (ast->type == KS_AST_CONST_INT) {
        return ks_obj_new_int(ast->_int);
    } else if (ast->type == KS_AST_CALL) {
        // evaluate arguments
        ks_obj* args = malloc(sizeof(ks_obj) * ast->_call.args_n);

        // compute all arguments first, then call the function
        int i;
        for (i = 0; i < ast->_call.args_n; ++i) {
            args[i] = ks_eval(ast->_call.args[i]);
        }

        // evaluate the function
        ks_obj func = ks_eval(ast->_call.lhs);

        // call the function
        ks_obj res = func->_cfunc(ast->_call.args_n, args);

        free(args);
        return res;
    } else if (ast->type == KS_AST_CONST) {
        return ast->_const;
    } else {
        ks_error("Unknown type!");
    }
}

...

And, since we use pow(), we will need math.h, so:

// src/kscript.h
...
#include <math.h>
...

Let’s also change our main method to test out our expression:

// src/kscript.c
...

    // numbers
    ks_ast n2 = ks_ast_new_int(2);
    ks_ast n3 = ks_ast_new_int(3);
    ks_ast n5 = ks_ast_new_int(5);

    // our other variables (but here, represented as ASTs, not objects)
    ks_ast b = ks_ast_new_int(2);
    ks_ast c = ks_ast_new_int(3);

    // the functions we use (we wrap them as an AST, but as a constant, so they just
    //   evaluate to the function object when `eval` is called on the AST)
    ks_ast func_add = ks_ast_new_const(ks_obj_new_cfunc(ks_std_add));
    ks_ast func_mul = ks_ast_new_const(ks_obj_new_cfunc(ks_std_mul));
    ks_ast func_pow = ks_ast_new_const(ks_obj_new_cfunc(ks_std_pow));
    ks_ast func_print = ks_ast_new_const(ks_obj_new_cfunc(ks_std_print));

    // construct expression manually... we'll get to parsing in a bit
    ks_ast expr = ks_ast_new_call(func_print, 1, (ks_ast[]){
        ks_ast_new_call(func_add, 2, (ks_ast[]){ 
            ks_ast_new_call(func_mul, 2, (ks_ast[]){
                n3,
                ks_ast_new_call(func_pow, 2, (ks_ast[]){
                    b,
                    n5
                })
            }),
            c
        }
    )});

    // evaluate it
    ks_eval(expr);

    // since this expression contains all others, the `free` works on all at once
    ks_ast_free(expr);

    return 0;
}
...

If we run with make && ./kscript, we’ll have an error:

$ make && ./kscript
cc -O3 -std=c99 -L./ src/kscript.o -lkscript -o kscript
src/kscript.o: In function `ks_std_pow':
kscript.c:(.text+0x11c): undefined reference to `pow'
collect2: error: ld returned 1 exit status
Makefile:68: recipe for target 'kscript' failed
make: *** [kscript] Error 1

This is because we used the math library for the pow function, but we did not link it (ahh, C).

So, hop into the makefile, and edit the last rule, the one that builds kscript executable:

# Makefile
...
# rule to build the executable (no extension) from the library and it's `.o`'s
#   since we require a library, and object files, we don't use `$^`, but just build
#   explicitly
$(kscript_exe): $(kscript_o) $(libkscript_so)
	$(CC) $(CFLAGS) -L./ $(kscript_o) -lkscript -lm -o $@
...

See how -lm was added after -lkscript. Now it will be linked. Building and running now works:

$ make && ./kscript
cc -O3 -std=c99 -fPIC src/kscript.c -c -o src/kscript.o
cc -O3 -std=c99 -L./ src/kscript.o -lkscript -lm -o kscript
99

And it is correct! our expression prints as 99.

If you want to see the code at this point, please see here.

Next time, we’ll implement a little bit of parsing, so we don’t have to construct long expressions manually