Part 2 of kscript (Object interface, the meat): Writing a dynamic, interpreted, duck-typed language

 

This is part 2 of my series on kscript.

In this episode, we’ll be implementing a generic object type in C! So, buckle up and get ready for some structs and unions

Basic Types

To shorten our code (and thus become more 1337), we will abbreviate kscript with ks. So, we use ks_info(message), etc. The type names for our library will also begin with ks_.

Before we have a generic object type, we must define a few specific types. In our language, I think we should have a few base classes:

  • int for integer values (whole numbers)
  • float for floating-point values (decimal numbers)
  • str for strings (like "hello world", "cade\nbrown", etc)

Since C’s NULL-terminated strings kind of suck to work with, we’ll be defining our own string type! We’ll call it ks_str, and it will internally have a NULL-terminated string, but we will also keep the length with it (as well as a max length, so if the string gets bigger, it can check whether or not it has exceeded its max length).

In C, to define a type like this, you would type:

// src/kscript.h

// string type for `kscript`. It's based off code I wrote for EZC, almost exactly
typedef struct {

    // this is the internal C-style null-terminated string (which may be `NULL`)
    char* _;

    // this is the length of the string (not including NULL-terminator)
    int len;

    // this is the maximum length the string has been, so we don't resize until necessary
    int max_len;

} ks_str;

We’re also going to define some macros to signify an empty string, string constants:

// src/kscript.h

// this represents the `NULL` string, which is also valid as a starting string
#define KS_STR_EMPTY ((ks_str){ ._ = NULL, .len = 0, .max_len = 0 })

// represents a view of a C-string. i.e. nothing is copied, and any modifications made
//   stay in the original string
#define KS_STR_VIEW(_charp, _len) ((ks_str){ ._ = (char*)(_charp), .len = (int)(_len), .max_len = (int)(_len) })

// useful for string constants, like `KS_STR_CONST("Hello World")`
#define KS_STR_CONST(_charp) KS_STR_VIEW(_charp, strlen(_charp))

And finally, we are going to introduce a bunch of string functions for basic manipulation:

// src/kscript.h

// copies `len` bytes from charp, and then NULL-terminate
void ks_str_copy_cp(ks_str* str, char* charp, int len);
// copies a string to another string
void ks_str_copy(ks_str* str, ks_str from);
// concatenates two strings into another one
void ks_str_concat(ks_str* str, ks_str A, ks_str B);
// appends an entire string
void ks_str_append(ks_str* str, ks_str A);
// appends a character to the string
void ks_str_append_c(ks_str* str, char c);
// frees a string and its resources
void ks_str_free(ks_str* str);
// compares two strings, should be equivalent to `strcmp(A._, B._)`
int ks_str_cmp(ks_str A, ks_str B);
// whether or not the two strings are equal
#define ks_str_eq(_A, _B) (ks_str_cmp((_A), (_B)) == 0)

We might add more later. The implementation details are pretty basic, but you can check it out here if you want to see how it works

For completeness, we’ll also typedef some integers and floating-point types so we can call them ks_int and ks_float from now on. I’m going to #include <stdint.h> now, so we can choose exactly what size integer we want:

// src/kscript.h

#include <stdint.h>

// the main integer type in `kscript` (signed, 64 bit)
typedef int64_t ks_int;

// main floating point type in `kscript`. I like `double` because it has more accuracy
//   by default
typedef double ks_float;

Okay, now we have a few base types! Obviously, in the end, we’ll want to have user-defined types, generated types, parent and child types with inheritance. But, we are starting with the basics, and building up.

Everything will be an object in this language, so its about time to define an object in this languge!

Object-type

Now, if you’ve ever made a language in C or C++, you know that once you leave a function or local scope, the local variables are destroyed. For example:


char* getmessage() {
  return "Hello!";
}

int main() {
  printf ("%s\n", getmessage());
}

Is undefined behavior, because getmessage() returns a pointer to a local value "Hello!", which is garbage as soon as getmessage() returns. So, getmessage() can generate a message, but once it returns a reference to it, that reference is not valid! If you’re coming from Python, or even C++ using std::string, or most other higher level languages this may not be obvious as to why it won’t work. This is just how things work in C. Since a string is really just a pointer, when a local variable is created, the pointer to it is just valid as long as the data it points to is valid.(see https://stackoverflow.com/questions/1496313/returning-c-string-from-a-function).

Think now of writing functions in our programming language. Say we want to create an object and add it to a linked list. If we create a string locally, we would need to use malloc(N) to allocate our memory (malloc is a C function that returns a pointer to a block of memory that can be used as a string, list of numbers, or anything else), and then return that pointer. And whoever receives that string must call free(ptr), or there would be what is called a memory leak. This means that, if this was done in a loop, there would be unbalanced memory allocation. Think about if you turned on the water in your sink, and never turned it off! Eventually, you will run out of room and it will overflow. In C, you will run out of memory, and your program will crash.

I’ve said all this just to prepare you for something some people may wonder the reasoning for. Essentially, here’s the definition for a kscript object:

// make the name `ks_obj` be a pointer to an internal structure
typedef struct ks_obj* ks_obj;

// the internal storage of an object. However, most code should just use
//   `ks_obj` (no struct), as it will be a pointer.
struct ks_obj {
    ...
};

By default, when you declare a ks_obj, you will be declaring a pointer to a struct ks_obj, the internal dataset.

ks_obj is NOT the same as struct ks_obj

When you see code in this tutorial, it is important to note that struct ks_obj is the literal data of the object, whereas ks_obj represents a pointer to that data. So, you can pass around the pointer (just ks_obj), so there’s just one copy of the data, that will always be updated. So, when we implement lists, the list just stores a ks_obj, so if any changes are made, the list doesn’t have to do anything.

I’ll explain more as we go on, but here’s the full type definitions:

// src/kscript.h

// make the name `ks_obj` be a pointer to an internal structure
typedef struct ks_obj* ks_obj;

// types of objects
enum {
    // the none-type, null-type, etc
    KS_TYPE_NONE = 0,

    // builtin integer type
    KS_TYPE_INT,

    // builtin floating point type
    KS_TYPE_FLOAT,

    // builtin string type
    KS_TYPE_STR,


    // this isn't a type, but is just the starting point for custom types. So you can test
    //   if `obj->type >= KS_TYPE_CUSTOM` to determine whether or not it is a built-in type
    KS_TYPE_CUSTOM
    
};

// the internal storage of an object. However, most code should just use
//   `ks_obj` (no struct), as it will be a pointer.
struct ks_obj {

    // one of the `KS_TYPE_*` enum values
    uint16_t type;

    // These will be used in the future; they will hold various info
    //   about the object, for GC, reference counting etc, but for now, will be 0
    uint16_t flags;

    // an anonymous tagged union
    union {

        // if type==KS_TYPE_INT, the value
        ks_int _int;
        // if type==KS_TYPE_FLOAT, the value
        ks_float _float;
        // if type==KS_TYPE_STR, the value
        ks_str _str;

        // misc. usage
        void* _ptr;

    };
};

Seems pretty simple right? We have 3 builtin types so far (int, float, str), and the ability to have 2**16 types (~65000), which should be plenty. We also reserve 16 bits for flags, as we’ll use these to write a garbage-collection algorithm in the future.

Tagged Union

We also have a union https://www.tutorialspoint.com/cprogramming/c_unions.htm. Specifically, I have an anonymous tagged union. This means that there is no special prefix for the union, and that which entry is valid in the union depends on a tag (in our case, the type member tells which one is currently used). We use this to “sneak in” the values for some types, instead of having a pointer to some data. This means that if we have an object that’s an int, we don’t have to dereference a pointer to get its data, its stored right with the object. This will be more efficient for types that are stored directly with the object.

To demonstrate, consider objects A and B. A is of type int, and B is of type str.

As a result, A._int gives the integer value, whereas B._str gives the string value. What does B._int give you? Garbage, because its not an integer! You need to be careful when using these unions, as they can be a big source of bugs. Always make sure the type of the object matches whatever field you are accessing

Why would we use this, you may ask? Well, to save space! If we implement this type without unions, the size of the data fields would be sizeof(ks_int)+sizeof(ks_float)+sizeof(ks_str)+sizeof(void*) which is 40 bytes. This will only grow, especially once more fields are added.

With unions, the size is just the size of the largest one: max(sizeof(ks_int), sizeof(ks_float), sizeof(ks_str), sizeof(void*)), which is only 16 bytes! So, now every object will require less than half the memory to store its data. This is very good, and will only get better the more builtin-types we add. So, we’ll just have to deal with the danger in the meantime. That’s okay though, we want our language to be efficient!

Object functions

So, now we need to know how to create, manage, and destroy objects, irrespective of their type. Let’s define some signatures of functions that will construct some primitives:

// src/kscript.h

// returns a new integer with specified value
ks_obj ks_obj_new_int(ks_int val);
// returns a new float with specified value
ks_obj ks_obj_new_float(ks_float val);
// returns a new string with specified value
ks_obj ks_obj_new_str(ks_str val);
// frees an object and its resources
void ks_obj_free(ks_obj obj);

And, to actually implement these functions, we have:


ks_obj ks_obj_new_int(ks_int val) {
    ks_obj ret = (ks_obj)malloc(sizeof(struct ks_obj));
    ret->type = KS_TYPE_INT;
    ret->_int = val;
    return ret;
}

ks_obj ks_obj_new_float(ks_float val) {
    ks_obj ret = (ks_obj)malloc(sizeof(struct ks_obj));
    ret->type = KS_TYPE_FLOAT;
    ret->_float = val;
    return ret;
}

// NOTE: the object return has its string copied from `val`, i.e. 
//   they are not the same anymore
ks_obj ks_obj_new_str(ks_str val) {
    ks_obj ret = (ks_obj)malloc(sizeof(struct ks_obj));
    ret->type = KS_TYPE_STR;
    ret->_str = KS_STR_EMPTY;
    ks_str_copy(&ret->_str, val);
    return ret;
}

void ks_obj_free(ks_obj obj) {

    // do nothing if given NULL
    if (obj != NULL) {
        // some types (int, float) don't need to be free'd, so do nothing
        if (obj->type == KS_TYPE_STR) {
            ks_str_free(&obj->_str);
        }

        free(obj);
    }
}

So, whenever we construct an object, we tell it what type it is, then set the appropriate data for it.

When freeing, we assume the object is valid, (if it’s NULL, we don’t do anything), if its a type that needs its data free’d, we free that (currently, only strings need to be freed), then we free the object itself (which we malloc'd in the new functions)

Eventually, we’ll have a way for extensions & types to be written in C and loaded dynamically (like how packages such as numpy are written in C and interface with Python). These will describe types generically, but for now we do it very concretely.

Just to test this back, lets modify our src/kscript.c file to make sure this is working:

// kscript.c

#include "kscript.h"

int main(int argc, char** argv) {

    ks_obj sconst = ks_obj_new_str(KS_STR_CONST("I AM OBJECT"));

    if (sconst->type == KS_TYPE_STR) {
        ks_info("%s", sconst->_str._);
    } else {
        ks_error("Type wasn't str!");
    }

    // always free when you're done!
    ks_obj_free(sconst);

    return 0;
}

Remember to add obj.c to the makefile:

...
libkscript_src := $(addprefix src/, log.c str.c obj.c)
...

Now, run make && ./kscript:

$ make && ./kscript
cc -O3 -std=c99 -fPIC src/kscript.c -c -o src/kscript.o
cc -O3 -std=c99 -L./ src/kscript.o -lkscript -o kscript
INFO : I AM OBJECT

And it worked correctly!

So, we understand how objects can work in C, and how kscript’s objects can be used, including string literals.

Next, we’ll look at functions that operate on those objects!

Source for this part: https://github.com/ChemicalDevelopment/kscript/tree/ceeb27a6cc40e0922c0c9b3cc7128082baffeb4f