Languages Archive

gmc - Hacking a toy VM

gmc: Hacking a toy VM

If anyone remembers me from back in the “old code” days (circa 2006), there was a scripting language I used to use called GameMonkey Script. It was a lua-like language that was designed to be more familiar to C++ programmers. Back in the day, I used to contribute minor changes to the core language itself as well as write about it on GameDev.net.

GameMonkey itself is long dead, having been put into a maintence-only mode by one of the original authors (see Greg’s Github and abandoned completely by Matt, the other author. Lua has since hit version 5 and left it behind in terms of raw performance and adoption in the game industry.

However for me, GameMonkey (herein referred to as GM) was the project that kick-started my interest in virtual machines, compilers and other such things that certain programmers get a kick out of. I’ve previously implemented a version of the GM Virtual Machine in C# greenbeanscript and have many long lost projects where I’ve created some small VM to play around with things.

Newer languages such as Rust, Go and even Typescript have intrigued me with how they’ve approached syntax and other language design questions. As a result, I felt the stirrings to mess around in this space again after a long haitus.

First goals

Rather than do as some people would and jump right into LLVM, language theory and other such gubbins, I decided to take things slowly and get my head around building a basic VM and assembler. I largely see this work as throwaway, so I’m free to make mistakes and not feel so bad about it.

So for the first goals, it’d be creating a VM that can run the psedudocode program:

int add(int a, int b)
{
    return a + b
}

int a = 100
int b = 200
int c = add(a, b)
  1. Primitive types
  2. Static typing
  3. Local Variables
  4. Function definitions
  5. Function calls with parameters
  6. Return values
  7. Bytecode formats

So for now, I’m going to support only the int type only. I need to decide on how to manage variables, hold script functions and deal with passing of arguments to them.

Bytecode Format

I decided to think about bytecode first as this generally dicates the type of patterns you use up front; specifically whether your VM is stack-based or register-based.

The original GameMonkey script was stack-based; which meant that the majority of the bytecode instructions were simple and took their operands from the stack and pushed their results back to the stack.

For example, adding a = 1 and 2 would be something like:

push 1
push 2
add             // 2 values popped from stack, result pushed back
setlocal @a     // local variable a set from top of stack

Whereas a register-based VM (such as Lua 5) would encode the operands into their instructions itself.

add 1, 2 -> @a

Register-based VMs typically have larger bytecode ‘scripts’ as their instructions are larger (typically 3264 bits instead of 8), however you can see from the simple example above that in register-based machines, each instruction tends to do more “work”, meaning that the VM spends lets time in the interpreter and so can be much faster.

One of my original thoughts was to speculate about converting the original GameMonkey from being stack-based to being register-based to see the impact; but as that largely meant rewriting the codegen and VM, I decided against it at this point in time (although it would be cool to do that).

As a result of this, I’ve decided that my toy VM will be register-based.

With this decision made, I looked at some descriptions on how Lua does it. I actually like Lua’s implementation here as it makes sense. We’re a VM, so we don’t have to map VM registers onto hardware registers - as a result, they’re basically part of the stackframe.

Stackframes

The stackframe layout for the toy (register-based) VM would be something like this:

0: Parameters [pN]
pN: Constants [kN]
kN: Variable registers [rN]
rN: Top of working stack

With the simple pseudo script of:

int a = 100
int b = 200
int c = a + b

We’d end up with a stackframe that looked something like this:

    [No params]
k0: int -> 100
k1: int -> 200
r0: int -> a
r1: int -> b
r2: int -> c

We can therefore assign variable registers from the constant table such as:

loadk k0 r0     ; load value from k0 (100) into variable register r0
loadk k1 r1

The add instruction would take the two source registers and the destination, eg:

add r0 r1 r2    ; add r0 and r1, store in r2

Assembler

Now I have a couple of instructions and a general idea how registers work, I think it’d be a good idea for the first parser to be an assembler. The reason for this is that it would speed up my iteration time on scripts and force me with a data-driven way to initialise things like functions and such. The first pass would likely start in code, but I’d like to write the assembler early to dust off my parsing skills before it comes to think about a language.

With this in mind, we could specify the above program as something like this:

.func _main
.params 0
.consts 2
.const 0 int 100
.const 1 int 200
.locals 3
.local 0 int
.local 1 int
.local 2 int

loadk 0 0
loadk 1 1
add 0 1 2
ret 0

It’s not the most readable of syntax, so perhaps we could consider some sort of symbol table in the assembler:

.func _main
.var a int
.var b int
.var c int

loadk 100 a
loadk 200 b
add a b c
ret 0

The consts are now inlined, with the assembler being required to keep track of the parsed constants and assign them to a prototype for the function. Variables are referenced by identifier now, requiring a symbol table to be created.

Seems like creating this assembler would be a logical first step.

Now the question is do I write this in C++ or C99?

Until next time!

Introduction to GameMonkey Script: Part 1 Introduction to GameMonkey Script: Part 2 Continuing GameMonkey Script: Advanced Use