cade.site -- Kata (#0) - Setup

Home | About | Contact | Archive

Kata (#0) - Setup

Let’s learn how to implement a system programming language. This series will cover the design (including documentation, standardization, etc.), as well as the practical implementation. The first part will just be defining some terms and getting a project set up. We’ll be using the C++ programming language and the LLVM library to implement our compiler.

I will be calling my language Kata, but you should of course think of your own name. And, feel free to make slight adjustments if you are following along – I’ll try and explain how you could change the syntax, semantics, or adding other features. Feel free to contact me if you have any problems, preferably by creating an issue on the GitHub repo.

Creating A Language

Creating a language is a big undertaking, but not so big that you can’t do it alone. To follow this series, you should be proficient in some programming language (I’ll be using C++), but the concepts covered are general and can be implemented in your language of choice. By the end of the series, you’ll be a master of another language as well (hopefully, since you created it!). Since you can’t start by writing a self-hosting compiler, I recommend starting by following this guide and implementing a compiler in C++. Then, you can write a self-hosting compiler in the language you develop (if you want – you don’t have to make a self-hosting one at all).

I’m writing this tutorial because most tutorials I’ve seen online have either been about a toy language that wouldn’t be used in any real applications, or most tutorials have not covered some of the more auxiliary parts of language design (including design patterns, documentation, and standardization). So, if you’ve already created a language, or want a more fully featured tutorial, this is the series for you!

I’m choosing C++ as the initial implementation language, since it is very ubiquitous, and it is similar to what Kata will be like. That way, once I implement the compiler in C++, I can basically translate the code from C++ to Kata without a huge paradigm shift. If you’re using a higher level language like Python, then you’ll have to completely rewrite the compiler in the language itself, without all of the nice memory management of Python that you’ll have initially. However, you can definitely do this tutorial in Python, it’s just a personal preference. Of course, you don’t have to write a self hosting compiler (and indeed, most of this series will be using LLVM for a middle- and back-end instead)

Purpose

When creating a language, you should have a well defined purpose or use case for the language. If you’re just doing it for educational purposes, that is a fine purpose, but you should still think of actual use cases of your language. For example, if its just a subset of C, then it probably doesn’t need to be a language – since C already exists.

Kata’s purpose is to be a system programming language, with static typing and templates. It will support modules (which are similar to namespaces in C++). Kata will also have strong meta-programming abilities, such as looping over structure members, type inspection, and better systems of compile-time checking than C++ has. It also seeks to support device offloading of computational kernels (like CUDA/HIP/OpenMP) within the language. I’m also correcting some awkward syntax elements present in C/C++, including the horrible precedence choices, and weirdness in declarations (int *x, y;). Kata will use a more natural operator precedence, and declarations will have the type completely on the left hand side. Further, pointer types will have the & operator before it. So, a pointer to an int is declared &int x;.

On a personal note, I am planning on eventually writing my own operating system from scratch, and I’d like to be able to write the kernel in this language, as well as the standard utilities.

I’ll be honest: I really really dislike C++. Everything about it, from the way it is too backwards compatible with C, to the never ending standards updates (which are not portable across compilers, since it takes 10+ years for most compilers to adopt newer standards), template errors and standard types resulting in indecipherable error messages, and the general compilation system being too C-like, it is a slow, clunky, and inelegant language. And, elegance is a great reason to create a programming language. With that being said, I didn’t just write this to rip on C++, so we’ll move on from that.

Going forward, we need to realize that most features in C++/Rust/other system programming languages existed for a reason, and we need to make sure we fulfill those use case in order to be a successful system language ourself. Otherwise, there is a big risk that no one will want to use our language because it doesn’t do what they need. I advise that you think about your language (if you want it to be used by other people) and consider the use case and purpose. Once you’ve got that, follow along.

Of course, if you’re just doing this for educational purposes, you can follow along and copy me anyway. You might learn something that can help you in your language

Setting Up

I’m hosting my progress on my GitHub, which you can view all the source code on, and build for yourself.

My directory structure for the project is as follows:

./
├── ckc
│   └── Makefile
├── docs           
│   └── Makefile
└── README.md

ckc will be the directory of the initial implementation of the compiler, written in C++

docs will be for documenting our language and standard library

docs - Documentation

Documenting our language will be important for users who wish to write compliant programs which can be ported across CPUs, arcitectures, operating systems, and even compilers (think if someone else wrote a Kata compiler… we’d want to know existing programs could be compiled by both). We’ll be using Texinfo, which is the official format for GNU projects. I like Texinfo because it can output to HTML (single monolithic page, or a directory structure), PDF, DVI, and more without changing any of the input. This will make it easy to release PDFs as well as link the same material as an HTML page on your blog or docs site.

Here’s what I started with:

kata.texi

\input texinfo
@set VERSION 0.0.1-rc1
@set UPDATED 2020-10-20
@settitle Kata

@copying
Copyright @copyright{} 2020 ChemicalDevelopment.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.3 or any later version published by the Free Software Foundation.

@end copying

@documentdescription
Official documentation of the Kata programming language
@end documentdescription

@finalout
@setchapternewpage on

@node The Kata Programming Language
@ifnottex
@top The Kata Programming Language
@end ifnottex

@titlepage
@title The Kata Programming Language
@subtitle System programming language
@subtitle Version: @value{VERSION}
@subtitle Updated: @value{UPDATED}
@author Cade Brown @email{cade@@cade.site}
@c @author Other Name @email{a@@b.c}

@page
@vskip 0pt plus 1filll
@insertcopying
@end titlepage

@ifnothtml
@contents
@end ifnothtml

@menu
* Introduction::       Rationale, goals, and target audience.
* Index::              Complete index.
@end menu

@node Introduction
@chapter Introduction

Kata is a system programming language meant to be both extensible and portable across architectures and operating systems.

@node Index
@unnumbered Index

@printindex cp

@bye

(I posted this as a good ‘Hello World’ that you can use for any project – use Texinfo, it’s so easy!)

makefile

# makefile - builds Kata documentation

MAKEINFO       ?= makeinfo

src_TEXI       := kata.texi

out_PDF        := kata.pdf
out_HTML_MONO  := kata.html
out_HTML_DIR   := kata


out_ALL        := $(out_PDF) $(out_HTML_MONO) $(out_HTML_DIR)

default: all

clean:
	rm -rf $(wildcard $(out_ALL) *.aux *.cp *.cps *.index *.log *.toc *.info)

all: $(out_ALL)

.PHONY: default clean all

$(out_PDF): $(src_TEXI)
	$(MAKEINFO) --pdf $^ -o $@

$(out_HTML_MONO): $(src_TEXI)
	$(MAKEINFO) --no-split --html $^ -o $@

$(out_HTML_DIR): $(src_TEXI)
	$(MAKEINFO) --html $^ -o $@

From here, you can use texi commands in the content body.

To build, first cd docs, and then run make kata.html. This will build a monolithic single page documentation (currently just an intro) and store it. The problem with this method is that you have to re-run make kata.html every time you change it. If you’re on Debian/Ubuntu, you can sudo apt install inotify-tools, and run: while inotifywait -e close_write kata.texi; do make kata.html; done. This will automatically re-run the command every time you save the kata.texi file.

Then, you can open that HTML file directly in the browser. For me, the URL is file:///home/cade/projects/Kata/docs/kata.html, but that will depend on your system. Try dragging and dropping it on a new tab (NOTE: using the full directory structure, this method may not work since pages link to other pages. You can use python3 -mhttp.server to serve it from a local directory)

Also, try make kata.pdf. The output should be a professional(albeit empty – we’ll populate it throughout the series) PDF document.

Compiler Structure

We need to decide how programs and libraries will work in our language. This includes how files should be organized, as well as how individual files should be compiled, and how they should be linked together. It also includes how libraries and binaries can be exported and included in other projects, and distributed.

We’ll wait on this a bit, but just be thinking of how you’d like to structure it. Since there are going to be dynamic templates, we can’t compile to object files and link them together – we’ll need to include the source code from the file that defines the template in all the files that use it. We can only compile and link the parts that are not templates, or those that only use instantiated templates. However, I don’t expect that our compiler will take as long as g++ compiling C++ code, since there are huge header files that are needed, as well as 30 years of Frankenstein’s-monster-level standards (among other things) causing it to run slowly.

Once we have basic expressions compiling (in a few parts), we’ll add example projects and libraries, completely written in Kata, and from there we can see exactly how to compile them.

ckc - First Compiler

The first compiler implementation, which we’ll write in C++, we’ll call ckc for C++ Kata Compiler. We’ll set up the structure in this blog but actually start with the heavy lifting next time.

The ckc/ directory will hold a makefile as well as the source for the initial compiler.

ckc/makefile

# makefile - builds the C++ Kata compiler


# -- Programs --

LLVM_CONFIG      ?= llvm-config


# -- Compiler Flags --

FPIC             ?= -c -fPIC

CXXFLAGS         += -I. -g -std=c++11 $(shell $(LLVM_CONFIG) --cxxflags $(LLVM_CONFIG_ARGS))

LDFLAGS          +=

LIBS             += $(shell $(LLVM_CONFIG) --libs --ldflags core native) -lm


# -- Sources --

src_CC         := $(wildcard *.cc)
src_HH         := $(wildcard *.hh)


# -- Outputs --

src_O          := $(patsubst %.cc,%.o,$(src_CC))


# -- Rules --

ckc: $(src_O)
	$(CXX) $(LDFLAGS) $^ $(LIBS) -o $@

%.o: %.cc $(src_HH)
	$(CXX) $(CXXFLAGS) $(CFLAGS) $(FPIC) $< -o $@

clean: FORCE
	rm -f $(wildcard $(src_O) ckc)

.PHONY: clean FORCE

This will be familiar to anyone who’s written their own C/C++ projects. If you aren’t familiar with the above, see A Simple Makefile Tutorial. Essentially, this file describes compiling all of the .cc source files into .o object files, then combining them to create the ckc binary.

I’ve already included LLVM configuration details, even though we won’t be using LLVM in the first few parts of the tutorial – just make sure you have some version of LLVM installed for your OS, which you can find instructions for on Google. I will be testing on multiple versions of LLVM, but I’ll default to LLVM 9.

ckc/kc.hh

/* kc.hh - Kata C++ Compiler Header
 *
 * @author:    Cade Brown <cade@cade.site>
 */

#pragma once
#ifndef KC_HH__
#define KC_HH__

/* C std */
#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
#include <stddef.h>

#include <string.h>
#include <assert.h>

/* C++ std */
#include <string>
#include <vector>
#include <unordered_map>
#include <unordered_set>
#include <utility>

namespace kc {

/* TODO: implement */

}; /* namespace kc */

#endif /* KC_HH__ */

This is the header file – it will be included by all the source files, and merely defines the data types and system libraries we are using. Right now, it’s pretty empty, but in the future this will be where we define our types, functions, and other components of our compiler. Put all your includes, data structures, and so forth in this file (within the namespace kc). Other files will include this, and have a statement using namespace kc, so we won’t need to use kc:: before every function/type, but it’s always good practice to put things in the correct namespace.

ckc/main.c

/* main.c - main entry point for the compiler
 *
 * @author:    Cade Brown <cade@cade.site>
 */
#include <kc.hh>

int main(int argc, char** argv) {
    if (argc != 1 + 1) {
        fprintf(stderr, "usage: %s [file]\n", argv[0]);
        return 1;
    }
    char* file = argv[1];

    fprintf(stderr, "compiling kata source in '%s'\n", file);
	return 0;
}

This defines the actual entry point; we ensure that the user called the utility correctly, and if not, we print a usage message and exit.

Building and running this example is done in the shell:

$ make
g++ -I. -g -std=c++11 ... -c -fPIC main.cc -o main.o
g++ -I. -g -std=c++11 ... -c -fPIC types.cc -o types.o
g++ ... main.o types.o ... -lLLVM-9 -lm -o ckc
$ ./ckc
usage: ./ckc [file]
$ ./ckc src.kata
compiling kata source in 'src.kata'

(the outputs may not exactly match, since this will depend on your LLVM version, and environment)

Obviously, it doesn’t do anything yet, but we’ll solve that in the next post, when we actually start tokenizing and parsing


Cade Brown's personal blog, licensed under CC-BY 4.0.