Skip to content

Building the Lox Language

Published: at 05:06 PM

Author: jtara1

Introduction

In the world of computer science, few endeavors are as enlightening as creating your own programming language. Robert Nystrom’s “Crafting Interpreters” provides a masterful guide to this process. In this article, I’ll explore the key components of an interpreter essential implementation patterns and techniques.

As in the book, the language we are implementing is called Lox, which was designed by Mr. Nystrom. We’ll be implementing using Rust sharing snippets throughout.

Outline:

Breaking Down the Problem

Before diving into details, let’s understand the overall process, we read source code, it takes action:

source code: 1 + 1 == 2                                           # string input
scanned:     token(1), token(+), token(1), token(==), token(2)    # lexical analysis
parsed:      (== (+ (1) (1)) (2))                                 # syntax analysis
interpret:   true                                                 # traverses the tree evaluating leafs upwards (DFS)

This pipeline transforms human-readable code into something the computer can execute. Let’s examine each stage in detail.

The Scanner: Turning Text into Tokens

The scanner (or lexer) is the first phase of language processing. It takes raw source code as a string and converts it into a sequence of tokens - the smallest meaningful units in our language.

Key Responsibilities:

  1. Identify language elements (numbers, operators, keywords, identifiers)
  2. Discard irrelevant information (whitespace, comments)
  3. Track source positions for error reporting
  4. Detect lexical errors (invalid characters)

Scanner Sample

The scanner works character by character, maintaining internal state about its position in the source string. It produces tokens that track their type, the actual text they represent (lexeme), a literal value, and source location.

fn scan_token(&mut self) -> Result<(), ScanError> {
    let char = self.advance()?;
    
    match char {
        '(' => {
            self.add_token_type(TokenType::LeftParen);
        }
        // ...
        '=' => {
            let token_type = if self.is_it('=') {
                TokenType::EqualEqual
            } else {
                TokenType::Equal
            };
            self.add_token_type(token_type);
        }
        // ...
        _ => {
            return Err(ScanError::UnexpectedCharacter(
                char,
                self.line_number,
                self.current,
            ));
        }
    };

    Ok(())
}

pub struct Token {
    pub token_type: TokenType,
    pub lexeme: String,
    pub literal: Literal,
    pub line: u32,
}

In this snippet, we see we are pushing a Token which may just be a symbol with some number of characters (e.g.: () while others are a token with an inner value in its literal (e.g.: 1). We need to scan greedily matching == instead of a = and = if it is scanned in the source.

The Parser: Building Syntax Trees

Once we have tokens, the parser transforms them into a structured representation of the program - typically an Abstract Syntax Tree (AST). This structure reflects the grammatical relationships between tokens according to the language’s grammar rules.

Quick peek at simplified Lox grammar.txt

program        → exprStmt* EOF ;
exprStmt       → equality ";" ;
equality       → comparison ( ( "!=" | "==" ) comparison )* ;
comparison     → term ( ( ">" | ">=" | "<" | "<=" ) term )* ;
term           → factor ( ( "-" | "+" ) factor )* ;
factor         → unary ( ( "/" | "*" ) unary )* ;
unary          → ( "!" | "-" ) unary ;
primary        → "true" | "false" | "nil"
               | NUMBER | STRING
               | "(" exprStmt ")"
               | IDENTIFIER ;

Grammars are explained later, but the idea behind this is it generates all possible sequence of tokens described by the language, Lox. It tells us how to structure our recursive descent parser. Each rule corresponds to a function in our parser, and the structure of those rules defines the precedence and associativity of our language constructs.

Snippet of the parser:

fn equality(&mut self) -> ParsedExpr {
    let mut expr = self.comparison()?;
    while self.is_match(vec![TokenType::BangEqual, TokenType::EqualEqual]) {
        let operator: Token = self.previous();
        let right = self.comparison()?;
        expr = Expr::binary(Box::new(expr), operator, Box::new(right));
    }
    Ok(expr)
}

// ...

fn primary(&mut self) -> ParsedExpr {
    if self.is_match(vec![
        TokenType::False,
        TokenType::True,
        TokenType::Nil,
        TokenType::Number,
        TokenType::String,
    ]) {
        let value: Value = self.previous().literal.into();
        Ok(Expr::literal(value))
    } else if self.is_match(vec![TokenType::LeftParen]) {
        let expr = self.expression()?;
        self.consume(TokenType::RightParen, "Expected ')' after expression.")?;
        Ok(Expr::grouping(Box::new(expr)))
    } else if self.is_match(vec![TokenType::Identifier]) {
        Ok(Expr::variable(Box::new(self.previous())))
    } else {
        Err(ParseError::ExpectedExpression {
            token: self.peek().clone(),
        })
    }
}

We implement a function for each grammar rule. We use recursive descent (explained below) to structure the grammar rule at the bottom first, then structure upwards deriving means from a series of tokens as a sequence of expressions.

Parser Sample

For our interpreter, we’ll use a technique called recursive descent parsing. This is a top-down parsing technique where the parser starts with the highest-level grammar rule and recursively works its way down to the lowest-level rules.

Key points:

Some rules may need to recurse back down again like Token(a) Token(==) Token(1), i.e., the first leaf we reach is a Token which is a primary expression, Identifier(a), then equality recurses back down to ultimately parse the next token as another primary expression, Number(1). This gives us our struct we set out to parse, BinaryExpr { left: Identifier(a), operator: EqualEqual, right: Number(1) ) } as seen in figure 1.1 below.

graph TD A[equality] --> |1| B[comparison] B --> C[term] C --> D[factor] D --> E[unary] E --> F1[primary 'a'] A --> |2| G["Token('==')"] A --> |3| H[comparison] H --> I[term] I --> J[factor] J --> K[unary] K --> L1[primary '1'] F1 --> F2[Identifier 'a'] L1 --> L2[Number '1'] class G token; class F2,L2 token; class Z,Z1 result;

Figure 1.1: Recursive descent for a == 1

An illustration of 1 + 1 == 2 would be

graph TD A[equality] --> |1| B[comparison] B --> |2| C[term] C --> |3| D[factor] D --> |4| E[unary] E --> |5| F1[primary '1'] C --> |6| G["Token('+')"] C --> |7| H[factor] H --> |8| I[unary] I --> |9| J1[primary '1'] A --> |10| K["Token('==')"] A --> |11| L[comparison] L --> |12| M[term] M --> |13| N[factor] N --> |14| O[unary] O --> |15| P1[primary '2'] F1 --> Q1[Number '1'] J1 --> R1[Number '1'] P1 --> S1[Number '2'] class G,K token; class Q1,R1,S1 token; class Z,Z1 result;

Figure 1.2: Recursive descent for 1 + 1 == 2

Parsed structure is

BinaryExpr { 
    left: BinaryExpr {
        left: Number(1),
        operator: Plus,
        right: Number(1)
    }, 
    operator: EqualEqual, 
    right: Number(2) 
}

Or represented directly in an AST,

graph TD A["Binary(==)"] --> B["Binary(+)"] A --> C["Number(2)"] B --> D["Number(1)"] B --> E["Number(1)"] classDef operator fill:#f96,stroke:#333,stroke-width:2px classDef value fill:#9cf,stroke:#333,stroke-width:2px class A,B operator class C,D,E value

Figure 1.3: Abstract syntax tree for 1 + 1 == 2

Grammar Rules and Sets

I first learned the concept of a “grammar rule” in my Finite Automata class. This is a set of rule(s) that produces: another rule, a symbol, a termination, an empty string (epsilon (ε)), or a combination of the 4.

For example, the grammar:

A -> aA
A -> aB
B -> b

The idea is we start with A, choose one of its rules, evaluate to aB, evaluate the B in this to ab. ab is one of the strings this grammar can produce.

Some strings this grammar can produce include: ab, aab, aaab, …, or a+b in regex. Things that can make grammar rules more complex include: when one rule can produce alternative values (denoted with regex-like | often), more recursion, larger set of rules, or larger symbol set. Note, A is implicitly the starting rule in my grammar.

More formally,

A small collection of regex operators are used to define grammars more concisely in practice. Those can include:

|
*
+
()

Visitor Pattern

The Visitor design pattern allows us to add new operations to our AST without modifying the node classes themselves. This separation of concerns enables a clean implementation where:

  1. AST nodes define the structure
  2. Visitors define operations on that structure (like interpretation)

We can see an example of this pattern in my implementation of Lox in which I define a trait I implement in a few places:

// expr.rs
ExprVisitor<R, E> {
    fn visit_literal_expr(&mut self, expr: &Literal) -> Result<R, E>;
    fn visit_grouping_expr(&mut self, expr: &Grouping) -> Result<R, E>;
    fn visit_unary_expr(&mut self, expr: &Unary) -> Result<R, E>;
    fn visit_binary_expr(&mut self, expr: &Binary) -> Result<R, E>;
    fn visit_variable_expr(&mut self, expr: &Variable) -> Result<R, E>;
    fn visit_assign_expr(&mut self, expr: &Assign) -> Result<R, E>;
    fn visit_logical_expr(&mut self, expr: &Logical) -> Result<R, E>;
    fn visit_call_expr(&mut self, expr: &Call) -> Result<R, E>;
}

impl Expr {
    pub fn accept<R, E>(&self, visitor: &mut dyn ExprVisitor<R, E>) -> Result<R, E> {
        match self {
            Expr::Literal(l) => visitor.visit_literal_expr(l),
            Expr::Grouping(g) => visitor.visit_grouping_expr(g),
            Expr::Unary(u) => visitor.visit_unary_expr(u),
            Expr::Binary(b) => visitor.visit_binary_expr(b),
            Expr::Variable(v) => visitor.visit_variable_expr(v),
            Expr::Assign(a) => visitor.visit_assign_expr(a),
            Expr::Logical(l) => visitor.visit_logical_expr(l),
            Expr::Call(c) => visitor.visit_call_expr(c),
        }
    }
}

// interpreter.rs
impl Interpreter {
    fn evaluate(&mut self, expr: &Expr) -> Result<Value, RuntimeError> {
        expr.accept(self)
    }
    // ...
}

This augments our code, providing:

The Interpreter: Bringing Code to Life

Finally, the interpreter traverses the AST and evaluates the expressions, producing the final result.

fn visit_binary_expr(&mut self, expr: &Binary) -> Result<Value, RuntimeError> {
    let left = self.evaluate(&expr.left)?;
    let right = self.evaluate(&expr.right)?;

    match expr.operator.token_type {
        TokenType::Plus => add(left, right),
        TokenType::Minus => subtract(left, right),
        TokenType::Slash => divide(left, right),
        TokenType::Star => multiply(left, right),
        TokenType::Greater => greater_than(left, right),
        TokenType::GreaterEqual => greater_than_or_equal(left, right),
        TokenType::Less => less_than(left, right),
        TokenType::LessEqual => less_than_or_equal(left, right),
        TokenType::BangEqual => Ok(!same_value(left, right)?),
        TokenType::EqualEqual => same_value(left, right),
        _ => Err(RuntimeError::InvalidBinaryOperator {
            msg: "Expected one of: +, -, /, *, >, >=, <, <=, !=, ==".to_string(),
            token: expr.operator.clone(),
        }),
    }
}

fn add(left: Value, right: Value) -> Result<Value, RuntimeError> {
    match (left, right) {
        (Value::Number(n1), Value::Number(n2)) => Ok(Value::Number(n1 + n2)),
        (Value::String(s1), Value::String(s2)) => Ok(Value::String(format!("{}{}", s1, s2))),
        (left, right) => Err(RuntimeError::OperandIsNotNumberOrString {
            msg: "Can only add numbers or strings".to_string(),
            values: vec![left, right],
        }),
    }
}

The snippet above is most of what we need to know to understand how the interpreter takes a structure of expression then returns the value it yields. It can evaluate statements such as variables declaration & assignment, functions, and classes which give it memory.

Given our previous example input string, 1 + 1 == 2, the interpreter would be given the BinaryExpr we saw above, then return Bool(true). Following the AST in figure 1.3 above, we evaluate the whole expression, handle the outermost ==, evaluate its left expression which recurses on to yield 2, evaluate right, yield true from the original binary expression visit call.

While evaluate is the entrypoint for yielding a value from an expression in the interpreter, it makes recursive calls until it matches the node we’re visiting in the AST following the grammar, recursing more as needed.

sequenceDiagram participant Interpreter participant Binary_EqualEqual as Binary(==) participant Binary_Plus as Binary(+) participant Literal_1a as Number(1) participant Literal_1b as Number(1) participant Literal_2 as Number(2) Interpreter->>Binary_EqualEqual: evaluate(root) ... visit_binary_expr(==) Binary_EqualEqual->>Binary_Plus: evaluate(left) ... visit_binary_expr(==) Binary_Plus->>Literal_1a: evaluate(left) ... visit_literal_expr(1) Literal_1a-->>Binary_Plus: Value::Number(1) Binary_Plus->>Literal_1b: evaluate(right) ... visit_literal_expr(1) Literal_1b-->>Binary_Plus: Value::Number(1) Binary_Plus-->>Binary_EqualEqual: Value::Number(2) Binary_EqualEqual->>Literal_2: evaluate(right) ... visit_literal_expr(2) Literal_2-->>Binary_EqualEqual: Value::Number(2) Binary_EqualEqual-->>Interpreter: Value::Bool(true)

Figure 2.1: Interpreter evaluate sequence for 1 + 1 == 2

The interpreter implements the visitor interface for visiting expressions given an expression, providing methods to evaluate each type of expression. It’s doing recursive descent as we’ve done in the parser.

Variable Store

A simple hashmap is enough for getting started storing identifier -> Value(inner) for implementing variable store as referred to in the context of Lox, the environment. Building on top of this, we make this a linked list to implement lexical blocks in Lox. We later add a special variable store for defining our Lox std lib which we call global.

graph TD Global[Global Environment] Block1[Block Environment] Block2[Block Environment] Block1 -->|enclosing| Global Block2 -->|enclosing| Block1 Global -.->|"clock -> Function(...)"| G1[Hashmap] Global -.->|"print -> Function(...)"| G1 Block1 -.->|"count -> Number(3)"| B1[Hashmap] Block2 -.->|"my_var -> Boolean(true)"| B2[Hashmap] style Global fill:#d9f7be,stroke:#389e0d style Block1 fill:#bae7ff,stroke:#1890ff style Block2 fill:#bae7ff,stroke:#1890ff style G1 fill:#f5f5f5,stroke:#d9d9d9 style B1 fill:#f5f5f5,stroke:#d9d9d9 style B2 fill:#f5f5f5,stroke:#d9d9d9

While functions and closures are implemented, I don’t go into detail on them in this article.

Lox Standard Library and Native Functions

To provide built-in functionality, an interpreter typically implements native functions - functions implemented in the host language (Rust in our case) rather than in the interpreted language.

I’m skimming over Lox function implementation and design, but to define some common properties of each, we can see I define a Callable trait. Rust provides Fn, a trait in Rust std lib which allows us to easily provide an implementation of a Lox function natively in Rust. That is, the anonymous closure (the thing starting |_args| { // ... }). Then our NativeFunc implementation of call boils down to (self.func)(arguments).

// globals.rs
pub struct Globals {
    pub env: SharedEnvironment,
}

impl Globals {
    pub fn new() -> Globals {
        let mut environment = Environment::new(None);

        // Lox std lib
        let clock = NativeFunc::new("clock", 0, |_args| {
            Ok(Value::Number(
                SystemTime::now()
                    .duration_since(UNIX_EPOCH)
                    .unwrap()
                    .as_secs_f64(),
            ))
        });
        environment.define(clock.name.clone(), Value::NativeFunc(clock));

        Globals {
            env: Rc::new(RefCell::new(environment)),
        }
    }
}

// function.rs
pub trait Callable {
    fn call(
        &self,
        interpreter: &mut Interpreter,
        arguments: Vec<Value>,
    ) -> Result<Value, RuntimeError>;
    fn arity(&self) -> usize;
    fn name(&self) -> String;
}

Native functions provide a bridge between the interpreter and the host language, enabling access to system resources and libraries. While not implemented here, a foreign function interface (FFI) extends this concept by allowing dynamic binding to external libraries or code. This is often done to run C/C++ compiled libraries using FFI in Python.

Additional Topics

These are topics covered in the book which I didn’t go over or go into depth on:

Future Work

Conclusion

Building an interpreter is a fascinating journey that provides deep insight into how programming languages work. From scanning tokens to parsing syntax trees and interpreting code, each step in the process reveals the elegant machinery that powers our daily programming tools.

The techniques we’ve explored are foundational concepts that apply across a wide range of language implementations.

Robert Nystrom’s “Crafting Interpreters” expands on these concepts in much greater detail, offering a comprehensive guide to building robust, efficient interpreters from scratch. By understanding these principles, you’ll gain a deeper appreciation for the languages you use every day and the tools that bring them to life.

Remember that this exploration covers just the first part of building an interpreter - the front-end components that handle parsing and evaluation. A full language implementation would also include more advanced features like type systems, optimization, and compilation to bytecode or machine code.

// MermaidRenderer.astro // This component handles Mermaid diagram rendering in Astro