dotLang compiler with LLVM – Shunting-yard algorithm

This is the third post in a series of posts explaining about my adventure in writing a compiler for my own language: dotLang. You can see previous posts here.

Initially, I was relying heavily on EBNF notation to explain the notation for the language. This proved to be very useful in all areas except for expressions (which is an important part of the language). The reason is that, we need to pay attention to precedence and associativity in an expression (how are we going to parse 1+2*3?) So, a normal EBNF rule to explain all different operators that can be combined would not be helpful. You will need to write operators in different levels to take into account their precedence.

This is the EBNF notation for an expression (excluding some advanced operators which are not the focus of the project at the moment):

Expression         = EqExpression     { ("and"|"or"|"xor") EqExpression }
EqExpression       = CmpExpression    { ("="|"!=") CmpExpression }
CmpExpression      = ShiftExpression  { (">"|"<"|">="|"<=") ShiftExpression }
ShiftExpression    = AddExpression    { (">>"|"<<"|"^") AddExpression }
AddExpression      = MulExpression    { ("+"|"-") MulExpression }
MulExpression      = UnaryExpression  { ("*"|"/"|"%"|"%%") UnaryExpression }
UnaryExpression    = ["not"|"-"]      PrimaryExpression
PrimaryExpression  = ( BINDING_NAME | "(" Expression ")" | ExpressionLiteral )
                     {  "(" Expression* ")" | "." Expression | "[" Expression "]" }
                     (*  function call      / struct access    / seq/map access    *)

As you can see I have to define seven different rules so that I can have a higher precedence for not compared to >. And this is continued in the code of the compiler, making it full of different structs (Yes, I’m using C) nested into each other. Although the recursive nature of this structure makes it easy to traverse into the data, but fixing a bug or adding something new is very difficult.

Because of that, I have now switched to another algorithm and another notation. I use Reverse Polish Notation (RPN) to represent an expression (which is essentially a linked list of elements) and Shunting-yard algorithm to translate a textual representation of an expression into RPN notation. So for example a+b*c will become a b c * +. The advantage of this representation is that it’s much more easier to parse and generate code because we don’t need to keep a state of the current status. Just push operands, and pop them when you see an operator and push the result again.

This is the data structure that I have for an expression:

typedef struct Expression
 ExpressionNode *first_node;
} Expression;

typedef struct ExpressionNode
 char token[32];
 TokenKind kind;
 struct ExpressionNode* next;
} ExpressionNode;

Basically, an expression is just a singly linked list. Each node has a token and it’s kind. So the token can be a and kind can be IDENTIFIER.

I have tried to make the parser as dumb as possible. So the code would be as simple as possible. This allows me to enhance/optimize/refactor the code in future more easily.

Here you can see the code that parses an expression from the input file. and here is the code that given the linked list representing an expression, invokes appropriate LLVM API to generate executable code.

One of the complexities involved in parsing the expression is a function call. Naturally, function calls are in “prefix” format where name of the function comes before operands. On the other hand, normal math expressions are infix. And when I want to convert all of these to postfix notation, I have to use different approaches to handle math expressions and function calls.

As a result, I chose to treat comma , as a special operator which means “join previous two operands as function arguments”. And have two types of function calls: simple function call (where there is no argument) and normal function call (which has at least one argument).

So process(x,y,z) becomes x y , z , process. So when processing the postfix notation, after seeing a comma, we know that previous two operands were function arguments.

In the next post, I will explain how function declaration is done in LLVM.



dotLang – Writing a Compiler using LLVM – Part 2

This is the second post in a series which explains how I am writing a compiler for dotLang in C using LLVM.

So the basic compiler is supposed to compile a simple function (called main) with simple expressions and possibly bindings.

Generally, a compiler can be considered like a data pipeline. At the beginning of the pipeline we have the source code file written in dotLang, and at the end of the pipeline, we have a compiled executable.

The important steps that happen in that pipeline are:

  1. Lexer: We need a module to read from input file and give us tokens. Tokens are smallest units of the source code (identifiers, operators, numbers, …).
  2. Parser: This component receives data from lexer and creates a syntax tree. A syntax tree is an abstract representation of the source code which is simpler to process and generate native code.
  3. Code generator: This part, receives a syntax tree and parses the tree, node by node. For each node, it makes calls to LLVM library to generate appropriate native code.

As the simplest example, suppose we have this source code:


main := () -> 1+2

Now the first part (Lexer), will give us these tokens: main, :=, (), ->, 1, +, 2

The second part will create this syntax tree:

main (Binding) ------> Operator: Plus |-------> 1
                                      |-------> 2

The third part will make these calls (ignoring the boilerplate we need to do to define and setup an LLVM module):

LLVMValueRef a = LLVMConstInt(int_type, 1, true);
LLVMValueRef b = LLVMConstInt(int_type, 2, true);
LLVMBuildAdd(builder, a, b, "add");

So at the end, the problem can be decomposed into writing a lexer, parser, and a code generator.

Of course, the devil is in the details. So there are some issues that you have to think about:

  1. What are your operators and their precedence?
  2. Where/how are you going to allocate memory for local variables?
  3. How are you going to manage and reclaim their memory?
  4. There might be some ambiguous cases where you really don’t know what does a specific token mean.

So you will need some kind of grammar for your language. This will help you with items #1 and 4 on the above issues. For #2 and #3, you will need to make some decisions and evaluate different pros/cons for each option that you have.

For dotLang, I decided to go with Reference-Counting garbage collector and use stack memory for primitive local variables (int, char, float, and bool).

This is another example of a more complex source code that dotLang can compile right now:

main := () -> int
  :: betha()/11

betha := () -> int
  #this will return 33
  ddd := 3
  :: (2+gamma()+20)+alpha()+ddd-3

alpha := () -> int
  #this will return 1
  :: 11-gamma()

gamma := () -> int
  #this will return 10
  :: 1+2+3+(1*4)

You can see the full source code in v0.2.0 release of the project.

One of the interesting points when parsing source code similar to above example, is that by default, you cannot call a function before defining it. But as you can see I call gamma inside alpha or betha inside main. So how does that work?

The answer is using a two-phase code generation. In the first phase, I just add functions (their name, input and output type), and in the second step, I add instructions to their body. This way, when I want to compile a call to gamma inside alpha I have some reference for that function which I can use to generate the call.

To keep track of local variables, I use a hashtable. One good thing about LLVM is that, almost everything is stored as LLVMValueRef. Functions, values, operation results, … are all accessible as instances of this type. So when you add two integers, each of them is LLVMValueRef and the result of add is also of the same type. So you can easily mix different operations together.

In the source code, basic_parsers.c implements basic Lex operations. I named them as parsers because Lex and Parse steps are two connected and mixed steps. The actual parsers are in parsers.c file. Code generation is in compilers.c and expression_compiler.h.

dotLang – Writing a Compiler using LLVM – Part 1

So, I have been working on “dotLang” for more than 2 years now. It has been a great journey up to now. I have been working mostly on the language spec (memory model, type system, function call dispatch, operators, …). But it’s time to move on to the next step which is writing the compiler. I will try to update this series regularly with the latest status on writing the compiler.

I chose to write the compiler in C because of its simplicity. That is my main goal in creating dotLang so it made sense for me. Also I will be using LLVM to generate native code because I am not going to invent the wheel again!

The first step is writing the grammar of the language. Now this grammar will definitely get updated during the compiler implementation project but right now it can be viewed here.

I have used a modified EBNF notation to describe the grammer. First I was thinking that a grammar is not really an important part but soon I found out that it will act like a map in an uncharted territory. It helps you maintain the big picture while working on the smallest details. Another important advantage is that it makes you think really carefully about the notations and semantics in your language and this helps you find out if there are any inconsistencies or ambiguities in the rules of your language.

So the basic building block of the language is a Module. A module can contain three different elements:

  1. Named type definitions: Assigning a name for a specific type (For example “MyInt := int” ).
  2. Imports: Including definitions from another module (“_ := @{"module"}“, here underscore means we want to filter the output of the command and want to import everything).
  3. Bindings: Defining a function or a value to be used/invoked later (Bindings are immutable data definitions, e.g. “PI := 3.1415” or “inc := (x:int) -> x+1“).

A binding in dotLang can be any immutable value. Note that even functions are considered values in dotLang.

I will start with manually writing a lexer and parser. The parser will be a Recursive Descent parser. I tried to use Flex/Bison before but did not really like the huge volume of code they generated. Also, it was difficult to track errors in the code being compiled.

About LLVM one thing that really shocked me was how scarce the documentation/reference material is for C bindings. So all I got was something like this:

Well, does not say a lot of things about how the function works, what are its inputs and outputs. There is also a ton of blog posts for writing a simple tiny compiler using LLVM C bindings but they don’t compiler with the current version of LLVM. I guess they are so busy adding new functionalities or removing/deprecating stuff in LLVM that they do not have enough time to document their code.

However, I managed to find some sort of documentation/manual which proves to be very useful when implementing the compiler.

Here is the list:

In the next post, I will start to explain the process of writing the compiler.

Some useful .bashrc tips

When cleaning up my .bashrc file, I noticed some useful shortcuts that might be useful for others. Here they are:

SSH with .bashrc

Sometimes I need to SSH to a new machine and will need my .bashrc file there. This short function, transfers my local .bashrc file to the remote server and then does the SSH:

function xs() {
if [ "$#" -eq 0 ]; then
    echo "usage: xs user@host other_args"



echo "Setting up remote machine..."
scp $@ ~/.bashrc $host:/tmp/.bashrc_temp &gt; /dev/null

echo "Connecting to $host"
ssh -A -t $@ $host "bash --rcfile /tmp/.bashrc_temp ; rm /tmp/.bashrc_temp"


Alias to invoke Perl debugger and a REPL for running perl code snippets:

alias perld='PERLDB_OPTS="windowSize=20 arrayDepth=300 hashDepth=300 dumpDepth=100" perl -MPerlIO='"'"'via;$DB::deep=9999'"'"' -d $@'
alias repl='PERLDB_OPTS="windowSize=20 arrayDepth=300 hashDepth=300 dumpDepth=5" perl \
-MPerlIO='"'"'via;$DB::deep=9999; \
'"'"' -de0'


This simple function can be piped after an output command and hight a specific keyword. Useful when you are looking for a specific text but want to see the whole context in addition to the keyword:

#highlight the given text in the input (cat log.txt | highlight ERROR)
highlight () {
perl -pe "s/$1/\e[1;31;21m$&\e[0m/g"


Open the last output using vim:

You can use fc -s command to re-execute last command and feed it’s output to vim to open it. This might be useful in cases where you are looking for a file and then want to open it

alias viml='vim $(fc -s)'

Single letter aliases

Due to frequent usage, I have these 3 single-letter aliases:

alias g='git'

alias k='kubectl'

alias d='docker'

Code snippet: How to manually validate a Bitcoin block?

The other day, I was trying to understand how Bitcoin works. There are a lot of references that tell you some general description of the cryptocurrency, but a few of them get to the real details about the internals.

The blog post on [1] was really helpful, although it did not act as I wanted. After a little searching here and there, finally, I managed to prepare a Python code snippet you can use to manually make sure a Bitcoin block is valid.

This code represents the underlying design of a Bitcoin mining algorithm. Miners try with different arguments (most importantly, nonce) to come up with a hash result which starts with enough zeros.

Here is how you can use this code:

You need to provide six inputs to the code:

  1. version: The version number (Normally 2 but can differ in some cases)
  2. prev_block: The hash of the previous block
  3. merkle_root: A hash of the Merkle root
  4. date_time: Time in which the block is mined
  5. bits: A parameter related to difficulty level
  6. nonceA random number generated as a part of mining the block

You can find these numbers easily in website for each block. If you replace their values inside below code, the result which is printed out should be equal to “Hash” parameter there.

This code snippet shows a block’s header is hashed to produce a hash value which if has a specific structure, will indicate the block is valid.

#Current values here are related to block number 474650
import hashlib, struct
import time

def reverse(a):
    return "".join(reversed([a[i:i+2] for i in range(0, len(a), 2)]))

version = "00000020"
prev_block = "0000000000000000012d11c46a420474875d0e3cfcdba19aac18df597fbb6d21"
merkle_root = "2f4f3f91ce5ffa9edcf728f9574f1e5ee1f97fe6142ee612da1db14f35f0d2db"
date_time = '07.07.2017 15:22:06' 
bits = 402754864
nonce = 999740600

pattern = '%d.%m.%Y %H:%M:%S'
epoch = int(time.mktime(time.strptime(date_time, pattern)))

header_hex = (version +
reverse(prev_block) +
reverse(hex(epoch).split('x')[-1]) +
reverse(hex(bits).split('x')[-1]) +

header_bin = header_hex.decode('hex')
hash = hashlib.sha256(hashlib.sha256(header_bin).digest()).digest()

print hash[::-1].encode('hex_codec')



Build your own website: Hugo + Github Pages + Namecheap


Did you know that you can have your own website without paying for hosting?

Ok, the purpose of this post is not promoting a cost-saving method, but to show you how you can setup your website using a static content generator (Hugo) and hosting in on connected to your own domain. So you don’t need to pay for and manage a hosting service.

What you need for this:

  1. A registered domain name (You can still use this with a sub-domain on website, but the process will be a bit different)
  2. An account in website
  3. A terminal application (Like iTerm on Mac or the command line on Windows)

Here is the list of steps you need to take:

  1. Setup DNS for your domain
  2. Setup your account
  3. Download Hugo on your computer and write something
  4. Update your account with the new content

And now the details!

Setup DNS

Details of this step depend on your domain registrar. For me, it is NameCheap.

You need to enter the address of as the hosting system for your domain. So if anyone types your domain name into their browser, they will be redirected to GitHub to show content.

First, you need to set “NAMESERVERS” option in the domain management page to “Namecheap BasicDNS”.

Screen Shot 2017-06-29 at 04.43.47

After that you can go to “Advanced DNS” page and update “HOST RECORDS” like below:

Screen Shot 2017-06-29 at 04.44.18

Here the numbers are specific to GitHub, so you will need to enter them exactly like this. “mahdix” is my username on the GitHub website. You will need to replace this with your own username.

Set up GitHub

After setting up your DNS records, you need to tell GitHub about your domain. So when it is referred to about a domain name, it knows what to show.

Go to your account, add a new repository by clicking “+” button on the top right of the screen, and selecting “New repository”. It’s better if you name this new repository same as your domain name but you can choose whatever name you like.  Make sure “Initialise this repository with a README” is checked, so your repository will be created with some basic content.

Screen Shot 2017-06-29 at 04.47.32

Get Hugo and write

First, you will need to open your console application. Make sure `git` is installed on your system. If it is not, follow instructions here to do the installation.

In the console, run `git clone` command to fetch a copy of your newly created repository.

Then you will need to run Hugo and write some content. You can follow instructions here as a starting guide.

When you are finished writing some content, you will need to push your changes to GitHub. This step informs GitHub about the new content you have just generated. This is done via `git push origin` command. Note that you may need to set up SSH keys if you are doing this for the first time (explained here).

Finish GitHub set up

Now, you need to tell GitHub about your domain name. Go to your GitHub repository, and go to “Settings” tab. There is a section called “Custom Domain”. You will need to enter your domain name there:

Screen Shot 2017-06-29 at 04.56.25

After you have done this step, your new content will be visible in a few minutes (This takes a bit longer only for the first time). Note that you only need to do this step once.

How to automate the publish process

There is a way to use some online services to automate the Hugo part. So you only need to make changes on GitHub website interface (write content, create new pages, …) and as soon as you submit your changes, they will be published for you. I will explain this in a follow-up post.


Must-have vim plugins

The other day I was cleaning up my .vimrc file and thought about writing a post about plugins that I use and the reason.

I believe in simplicity, especially in my work environment. That’s why I always try to customize as little as possible and try to use existing features. During past couple of years since I started using vim, I have found below plugins really useful.


This is the most important plugin that I use. It helps my install other plugins more easily and conveniently.

You just need to start your `.vimrc` file with specific format and then for each plugin you want to have, add a line in this format:

Plugin 'tpope/vim-commentary'

The text inside single quote is the GitHub address of the repository of the plugin (except

But I think having so few plugins, I may be able to get rid of this one in future.


This is the plugin which helps me comment/uncomment a line of a block of text. You simply need to use `gc` and the currently selected block will be commented (or commented out if it’s already commented). `gcc` does this for the current line.

You can combine this command with text movement verbs (e.g. `gc10j` to comment current and next 10 lines).

The good thing about this plugin is that it detects the correct commenting syntax so it doesn’t matter if you are working on a bash script or Python source code file. Just press the shortcut and this plugin will handle the rest.


I am always editing files and switching between buffers in vim. Unfortunately, standard features of vim are never enough for me (I wonder why they don’t provide better options built-in). That’s why I use this excellent plugin.

It uses `fzf` command line utility as a back-end and lets you search for a buffer, file, MRU or text in current or all buffers. It is a pretty handy tool for me.

Plugins that I used before but not now


This is a great plugin which lets you use <TAB> key in insert mode to invoke autocomplete feature of vim. But I find current shortcuts enough for my job, especially considering the fact that, due to vim being a text editor, I cannot expect any sophisticated autocomplete which makes this feature less useful for me.


This was the plugin I was using before fzf. It’s a very powerful plugin and lets you search for buffer, file, text and some other handy features.

The reason I switched to fzf was that it uses fzf command line utility which can be handy sometimes for me when I am working inside the terminal.


In a (failed) effort to replace CtrlP with a simpler alternative, I tried this one but gave up and returned to my favorite searcher plugin.

This is a useful plugin which helps search into most recently opened files and open them.


Another plugin which was supposed to replace fzf (CtrlP at that time) but did not succeed. It lets you search in current open buffers and switch to another buffer.


This plugin provides autocompletion for delimiters like single-quote, double-quote, braces, etc.

The reason I no longer use this plugin is that during my edits, there are lots of times that I need to insert a single delimiter, but this plugin comes into my way and forces insertion of an extra delimiter which I have to delete. Sometimes this can be frustrating, although some other times it can be handy.


This plugin highlights all operators (`()+-*/` and other similar operators) in your file. This can help you read a source code file more easily but my problem with this plugin was that it also highlights comment section of the code which makes reading comments annoying. Also having a lot of different colors in your text editor can sometimes be confusing.


This plugin checks the syntax of the current source code file and lets you know the error messages (if any). It highlights error lines and lets you easily navigate between errors. Personally, I prefer to use a terminal to check/compile my source code. And also having experience in working with dynamic typing languages, there is not much use for syntax checking as almost everything is evaluated at runtime (which I don’t like but that’s another story).

My first experiment writing code in Scala

(This post was written on Nov 17, 2016. For some reason I forgot to publish it at that time).

For the past few days, I have been assigned a small programming task which should be done in Scala. This is part of a hiring process and I have been asked to implement the server side of a tic-tac-toe game in Scala providing a REST API.

For the past few days, I have been assigned a small programming task which should be done in Scala. Now, this is the first time I am writing code in Scala language ever. So although I am not a professional Scala developer yet, I thought I should write down about my experience here.

The Good

I really like the language. It is simple and straight-forward. The principle of least astonishment was obeyed and I had very little surprises when I wanted to see how something is done, which is really good.

One of the positive points was the fact that there is no separation between primitive and Object-based data types. So we have “Int” and not “int” + “Integer” (Like Java). This helps a lot.

I also liked the way objects are defined: Like a function which accepts a set of inputs:

class Game(val player1: String, val player2: String) { ... }

Also the `var` and `val` separation is really encouraging the developer to make the best decision about the immutability of the data he is working with. It makes you think in a new way.

In Scala, you can use Java libraries easily without any pain. That gives you access to thousands of valuable frameworks and modules.

The Bad

Well, this section is mostly personal preference.

In Scala (unlike C and Java and similar languages), you should obey “variable name : type” rule which is not normal for me, considering the fact that I have worked for many years in C-like languages which have: “type variable_name” notation.

The “=” notation when defining a function which is returning a value, does not seem natural to me. Usually, you use “=” when you want to assign a value to a variable but the functions are not variables!

Same for the “<-” notation which is used to iterate over a collection.

Also I noticed there is no “break/continue” keywords in Scala. Well, it is possible to “simulate” them using a set of built-in keywords and exceptions, but seriously! exceptions are not invented to break out of a for loop! You should not use them to control program execution.

In defense of deterministic password managers

tldr; Although they have some shortcomings, deterministic password management is a great idea with a lot of benefits!

Today I read this post which was trying to convince people to switch to stateful password managers which store all passwords in some central data storage locally or on the cloud. I have to admit that the article had some valid points, but generally, it did poorly to convince me that deterministic password managers are bad.

First let me explain what a non-deterministic password manager is.

Normally, when you register in a website, you use a password manager to keep track of all of your passwords. So after signing up for a new email account (e.g., you will store email id and it’s password (which is chosen by you), in the password manager’s database.

How password manager works? [Image taken from website]
Obviously in this scheme, the central database becomes the most important component which needs extra protection. Where will this database be stored? Either in the service provider’s servers which is out of your control, or locally on your laptop or mobile phone. So if there is a data breach or your mobile phone is infected with a virus, your passwords can be in wrong hands! Of course there are a number of security measures to protect confidentiality of these data but as long as you have no control over them, there is a risk.

What is the solution? There has recently been a new way to manage password which I call deterministic password management (or DPM for short). This means that there is no central password database. Everything is inside your head! There is no need to have a cloud-based server to store your passwords. Generally, you have a basic passphrase which I call the “master password”. All other passwords are generated uniquely using a formula based on “master password” and some other public information. As a result, you don’t need to memorize 100 passwords or save them anywhere! Just memorize the master password and the reset can be automated.

For example for website “” and username “mahdix” the password will be calculated using some formula like this:

password = Hash(master_password, '', 'mahdix')


The most important advantage of this scheme is that there is no need to store sensitive data anywhere! I would like to stress the term “sensitive” because later I will refer to this feature. The other advantage is that your password will all be strong passwords because they are generated using a hash function which causes the result to be semi-random. There will no longer be passwords like “letmein” or “admin” in this scheme.

What are flaws of Deterministic Password Managers?

Now, back to the initial point! The blog post discusses some reasons why this type of password management is not secure and as the author claims: “has fatal flaws”.

There are four points discussed in the article:

  1. DPMs cannot accommodate varying password policies without keeping state: I disagree with this one. As long as the hash function is complex enough (which normally is), the generated password can be accoreding to every sane password policy. The examples that author provides: “password must be at least 8 characters long” or “password must be lowercase only” either are by default satisfied or don’t make sense! I have never seen a website which insists password must be lowercase.
  2. DPMs cannot handle recovation of exposed passwords without keeping state: I agree with this one but see point below.
  3. DPMs cannot store existing secrets: Agreed but it can be fixed without loosing security. See the point below.
  4. Exposed master password alone exposes all of your site passwords: I disagree with this point because it is not an issue of DPMs per-se. If your email account’s password is exposed, it will probably have similar effect! That is how email works. This is the natural side-effect of using email.

State vs Sensitive data

I would like to differentiate between state of a password manager with sensitive data (e.g. passwords). Of course for a traditional password manager like 1Password, these are the same: State of the password manager is the collection of stored password. So it is sensitive data by definition.

But for DPMs this is not the case. You can have some basic and minimal state which is not sensitive. What does that mean? It means that this state can be publicly exposed and available without loosing security of any of your accounts. The only important security feature of state in this sense is “Integrity”. They should not be altered without your consent. If this feature is satisfied then there is no problem for a DPM to have publicly available state.

For example you can store this state in your Google Drive or Dropbox or even on GitHub! Even if you want to maintain higher level of security you can encrypt those information using the “master password”, but that won’t be required.

Now, how can using state address the two problems stated above: Password revocation and storing existing secrets?

How to store existing secrets in a DPM?

Suppose that you already have an account with password: EP. Now you want to switch to DPM. Can you keep this password? It is possible using State.

Suppose that DLM expects the password of this account to be CP where CP is result of the Hash function formula of the system. It is enough to store “XP=CP^EP” in the State storage of the system (^ represents XOR operation). Now, it is impossible to retain any information about “EP” with having only “XP“. But user can easily calculate “EP” by calculating “CP” and then: “EP=CP^XP” (You can use any other function with similar property).

Using this method, a user can store his existing password without revealing any sensitive information. Same approach can be used to renew a password. Just keep a public parameter (like counter), in the State data. It won’t expose anything sensitive and can be used to renew the password. Just increment the counter and the password will be:

Password = Hash(master_password, website, user_id, counter)

This can easily be incorporated to any DLM and provide features to re-new password or add pre-existing passwords.

Of course there will always be a trade-off . So for above mentioned advantages, there are some drawbacks too. Most notably, if your master password is exposed or stolen, all of your accounts will be at risk. This is the natural consequence of using “Master Password”.