dotLang compiler with LLVM – Shunting-yard algorithm

This is the third post in a series of posts explaining about my adventure in writing a compiler for my own language: dotLang. You can see previous posts here.

Initially, I was relying heavily on EBNF notation to explain the notation for the language. This proved to be very useful in all areas except for expressions (which is an important part of the language). The reason is that, we need to pay attention to precedence and associativity in an expression (how are we going to parse 1+2*3?) So, a normal EBNF rule to explain all different operators that can be combined would not be helpful. You will need to write operators in different levels to take into account their precedence.

This is the EBNF notation for an expression (excluding some advanced operators which are not the focus of the project at the moment):

Expression         = EqExpression     { ("and"|"or"|"xor") EqExpression }
EqExpression       = CmpExpression    { ("="|"!=") CmpExpression }
CmpExpression      = ShiftExpression  { (">"|"<"|">="|"<=") ShiftExpression }
ShiftExpression    = AddExpression    { (">>"|"<<"|"^") AddExpression }
AddExpression      = MulExpression    { ("+"|"-") MulExpression }
MulExpression      = UnaryExpression  { ("*"|"/"|"%"|"%%") UnaryExpression }
UnaryExpression    = ["not"|"-"]      PrimaryExpression
PrimaryExpression  = ( BINDING_NAME | "(" Expression ")" | ExpressionLiteral )
                     {  "(" Expression* ")" | "." Expression | "[" Expression "]" }
                     (*  function call      / struct access    / seq/map access    *)

As you can see I have to define seven different rules so that I can have a higher precedence for not compared to >. And this is continued in the code of the compiler, making it full of different structs (Yes, I’m using C) nested into each other. Although the recursive nature of this structure makes it easy to traverse into the data, but fixing a bug or adding something new is very difficult.

Because of that, I have now switched to another algorithm and another notation. I use Reverse Polish Notation (RPN) to represent an expression (which is essentially a linked list of elements) and Shunting-yard algorithm to translate a textual representation of an expression into RPN notation. So for example a+b*c will become a b c * +. The advantage of this representation is that it’s much more easier to parse and generate code because we don’t need to keep a state of the current status. Just push operands, and pop them when you see an operator and push the result again.

This is the data structure that I have for an expression:

typedef struct Expression
{
 ExpressionNode *first_node;
} Expression;

typedef struct ExpressionNode
{
 char token[32];
 TokenKind kind;
 struct ExpressionNode* next;
} ExpressionNode;

Basically, an expression is just a singly linked list. Each node has a token and it’s kind. So the token can be a and kind can be IDENTIFIER.

I have tried to make the parser as dumb as possible. So the code would be as simple as possible. This allows me to enhance/optimize/refactor the code in future more easily.

Here you can see the code that parses an expression from the input file. and here is the code that given the linked list representing an expression, invokes appropriate LLVM API to generate executable code.

One of the complexities involved in parsing the expression is a function call. Naturally, function calls are in “prefix” format where name of the function comes before operands. On the other hand, normal math expressions are infix. And when I want to convert all of these to postfix notation, I have to use different approaches to handle math expressions and function calls.

As a result, I chose to treat comma , as a special operator which means “join previous two operands as function arguments”. And have two types of function calls: simple function call (where there is no argument) and normal function call (which has at least one argument).

So process(x,y,z) becomes x y , z , process. So when processing the postfix notation, after seeing a comma, we know that previous two operands were function arguments.

In the next post, I will explain how function declaration is done in LLVM.

 

 

dotLang – Writing a Compiler using LLVM – Part 2

This is the second post in a series which explains how I am writing a compiler for dotLang in C using LLVM.

So the basic compiler is supposed to compile a simple function (called main) with simple expressions and possibly bindings.

Generally, a compiler can be considered like a data pipeline. At the beginning of the pipeline we have the source code file written in dotLang, and at the end of the pipeline, we have a compiled executable.

The important steps that happen in that pipeline are:

  1. Lexer: We need a module to read from input file and give us tokens. Tokens are smallest units of the source code (identifiers, operators, numbers, …).
  2. Parser: This component receives data from lexer and creates a syntax tree. A syntax tree is an abstract representation of the source code which is simpler to process and generate native code.
  3. Code generator: This part, receives a syntax tree and parses the tree, node by node. For each node, it makes calls to LLVM library to generate appropriate native code.

As the simplest example, suppose we have this source code:

 

main := () -> 1+2

Now the first part (Lexer), will give us these tokens: main, :=, (), ->, 1, +, 2

The second part will create this syntax tree:

main (Binding) ------> Operator: Plus |-------> 1
                                      |-------> 2

The third part will make these calls (ignoring the boilerplate we need to do to define and setup an LLVM module):

LLVMValueRef a = LLVMConstInt(int_type, 1, true);
LLVMValueRef b = LLVMConstInt(int_type, 2, true);
LLVMBuildAdd(builder, a, b, "add");

So at the end, the problem can be decomposed into writing a lexer, parser, and a code generator.

Of course, the devil is in the details. So there are some issues that you have to think about:

  1. What are your operators and their precedence?
  2. Where/how are you going to allocate memory for local variables?
  3. How are you going to manage and reclaim their memory?
  4. There might be some ambiguous cases where you really don’t know what does a specific token mean.

So you will need some kind of grammar for your language. This will help you with items #1 and 4 on the above issues. For #2 and #3, you will need to make some decisions and evaluate different pros/cons for each option that you have.

For dotLang, I decided to go with Reference-Counting garbage collector and use stack memory for primitive local variables (int, char, float, and bool).

This is another example of a more complex source code that dotLang can compile right now:

main := () -> int
{
  :: betha()/11
}

betha := () -> int
{
  #this will return 33
  ddd := 3
  :: (2+gamma()+20)+alpha()+ddd-3
}

alpha := () -> int
{
  #this will return 1
  :: 11-gamma()
}

gamma := () -> int
{
  #this will return 10
  :: 1+2+3+(1*4)
}

You can see the full source code in v0.2.0 release of the project.

One of the interesting points when parsing source code similar to above example, is that by default, you cannot call a function before defining it. But as you can see I call gamma inside alpha or betha inside main. So how does that work?

The answer is using a two-phase code generation. In the first phase, I just add functions (their name, input and output type), and in the second step, I add instructions to their body. This way, when I want to compile a call to gamma inside alpha I have some reference for that function which I can use to generate the call.

To keep track of local variables, I use a hashtable. One good thing about LLVM is that, almost everything is stored as LLVMValueRef. Functions, values, operation results, … are all accessible as instances of this type. So when you add two integers, each of them is LLVMValueRef and the result of add is also of the same type. So you can easily mix different operations together.

In the source code, basic_parsers.c implements basic Lex operations. I named them as parsers because Lex and Parse steps are two connected and mixed steps. The actual parsers are in parsers.c file. Code generation is in compilers.c and expression_compiler.h.

dotLang – Writing a Compiler using LLVM – Part 1

So, I have been working on “dotLang” for more than 2 years now. It has been a great journey up to now. I have been working mostly on the language spec (memory model, type system, function call dispatch, operators, …). But it’s time to move on to the next step which is writing the compiler. I will try to update this series regularly with the latest status on writing the compiler.

I chose to write the compiler in C because of its simplicity. That is my main goal in creating dotLang so it made sense for me. Also I will be using LLVM to generate native code because I am not going to invent the wheel again!

The first step is writing the grammar of the language. Now this grammar will definitely get updated during the compiler implementation project but right now it can be viewed here.

I have used a modified EBNF notation to describe the grammer. First I was thinking that a grammar is not really an important part but soon I found out that it will act like a map in an uncharted territory. It helps you maintain the big picture while working on the smallest details. Another important advantage is that it makes you think really carefully about the notations and semantics in your language and this helps you find out if there are any inconsistencies or ambiguities in the rules of your language.

So the basic building block of the language is a Module. A module can contain three different elements:

  1. Named type definitions: Assigning a name for a specific type (For example “MyInt := int” ).
  2. Imports: Including definitions from another module (“_ := @{"module"}“, here underscore means we want to filter the output of the command and want to import everything).
  3. Bindings: Defining a function or a value to be used/invoked later (Bindings are immutable data definitions, e.g. “PI := 3.1415” or “inc := (x:int) -> x+1“).

A binding in dotLang can be any immutable value. Note that even functions are considered values in dotLang.

I will start with manually writing a lexer and parser. The parser will be a Recursive Descent parser. I tried to use Flex/Bison before but did not really like the huge volume of code they generated. Also, it was difficult to track errors in the code being compiled.

About LLVM one thing that really shocked me was how scarce the documentation/reference material is for C bindings. So all I got was something like this:

Well, does not say a lot of things about how the function works, what are its inputs and outputs. There is also a ton of blog posts for writing a simple tiny compiler using LLVM C bindings but they don’t compiler with the current version of LLVM. I guess they are so busy adding new functionalities or removing/deprecating stuff in LLVM that they do not have enough time to document their code.

However, I managed to find some sort of documentation/manual which proves to be very useful when implementing the compiler.

Here is the list:

In the next post, I will start to explain the process of writing the compiler.