Introduction
In the world of computer science, few endeavors are as enlightening as creating your own programming language. Robert Nystrom’s “Crafting Interpreters” provides a masterful guide to this process. In this article, I’ll explore the key components of an interpreter essential implementation patterns and techniques.
As in the book, the language we are implementing is called Lox, which was designed by Mr. Nystrom. We’ll be implementing using Rust sharing snippets throughout.
Outline:
- Introduction
- Breaking Down the Problem
- The Scanner: Turning Text into Tokens
- The Parser: Building Syntax Trees
- The Interpreter: Bringing Code to Life
- Lox Standard Library and Native Functions
- Additional Topics
- Future Work
- Conclusion
Breaking Down the Problem
Before diving into details, let’s understand the overall process, we read source code, it takes action:
source code: 1 + 1 == 2 # string input
scanned: token(1), token(+), token(1), token(==), token(2) # lexical analysis
parsed: (== (+ (1) (1)) (2)) # syntax analysis
interpret: true # traverses the tree evaluating leafs upwards (DFS)
This pipeline transforms human-readable code into something the computer can execute. Let’s examine each stage in detail.
The Scanner: Turning Text into Tokens
The scanner (or lexer) is the first phase of language processing. It takes raw source code as a string and converts it into a sequence of tokens - the smallest meaningful units in our language.
Key Responsibilities:
- Identify language elements (numbers, operators, keywords, identifiers)
- Discard irrelevant information (whitespace, comments)
- Track source positions for error reporting
- Detect lexical errors (invalid characters)
Scanner Sample
The scanner works character by character, maintaining internal state about its position in the source string. It produces tokens that track their type, the actual text they represent (lexeme), a literal value, and source location.
fn scan_token(&mut self) -> Result<(), ScanError> {
let char = self.advance()?;
match char {
'(' => {
self.add_token_type(TokenType::LeftParen);
}
// ...
'=' => {
let token_type = if self.is_it('=') {
TokenType::EqualEqual
} else {
TokenType::Equal
};
self.add_token_type(token_type);
}
// ...
_ => {
return Err(ScanError::UnexpectedCharacter(
char,
self.line_number,
self.current,
));
}
};
Ok(())
}
pub struct Token {
pub token_type: TokenType,
pub lexeme: String,
pub literal: Literal,
pub line: u32,
}
In this snippet, we see we are pushing a Token
which may just be a symbol with some number of characters
(e.g.: (
) while others are a token with an inner value in its literal (e.g.: 1
). We need to scan greedily matching
==
instead of a =
and =
if it is scanned in the source.
The Parser: Building Syntax Trees
Once we have tokens, the parser transforms them into a structured representation of the program - typically an Abstract Syntax Tree (AST). This structure reflects the grammatical relationships between tokens according to the language’s grammar rules.
Quick peek at simplified Lox grammar.txt
program → exprStmt* EOF ;
exprStmt → equality ";" ;
equality → comparison ( ( "!=" | "==" ) comparison )* ;
comparison → term ( ( ">" | ">=" | "<" | "<=" ) term )* ;
term → factor ( ( "-" | "+" ) factor )* ;
factor → unary ( ( "/" | "*" ) unary )* ;
unary → ( "!" | "-" ) unary ;
primary → "true" | "false" | "nil"
| NUMBER | STRING
| "(" exprStmt ")"
| IDENTIFIER ;
Grammars are explained later, but the idea behind this is it generates all possible sequence of tokens described by the language, Lox. It tells us how to structure our recursive descent parser. Each rule corresponds to a function in our parser, and the structure of those rules defines the precedence and associativity of our language constructs.
Snippet of the parser:
fn equality(&mut self) -> ParsedExpr {
let mut expr = self.comparison()?;
while self.is_match(vec![TokenType::BangEqual, TokenType::EqualEqual]) {
let operator: Token = self.previous();
let right = self.comparison()?;
expr = Expr::binary(Box::new(expr), operator, Box::new(right));
}
Ok(expr)
}
// ...
fn primary(&mut self) -> ParsedExpr {
if self.is_match(vec![
TokenType::False,
TokenType::True,
TokenType::Nil,
TokenType::Number,
TokenType::String,
]) {
let value: Value = self.previous().literal.into();
Ok(Expr::literal(value))
} else if self.is_match(vec![TokenType::LeftParen]) {
let expr = self.expression()?;
self.consume(TokenType::RightParen, "Expected ')' after expression.")?;
Ok(Expr::grouping(Box::new(expr)))
} else if self.is_match(vec![TokenType::Identifier]) {
Ok(Expr::variable(Box::new(self.previous())))
} else {
Err(ParseError::ExpectedExpression {
token: self.peek().clone(),
})
}
}
We implement a function for each grammar rule. We use recursive descent (explained below) to structure the grammar rule at the bottom first, then structure upwards deriving means from a series of tokens as a sequence of expressions.
Parser Sample
For our interpreter, we’ll use a technique called recursive descent parsing. This is a top-down parsing technique where the parser starts with the highest-level grammar rule and recursively works its way down to the lowest-level rules.
Key points:
- Each grammar rule becomes a function in the parser
- The parser handles precedence and associativity explicitly
- It’s intuitive and maps clearly to the language grammar
Some rules may need to recurse back down again like Token(a) Token(==) Token(1)
, i.e., the first leaf we
reach is a Token which is a primary
expression, Identifier(a)
, then equality
recurses back
down to ultimately parse the next token as another primary
expression, Number(1)
. This gives us our struct
we set out to parse, BinaryExpr { left: Identifier(a), operator: EqualEqual, right: Number(1) ) }
as seen
in figure 1.1 below.
Figure 1.1: Recursive descent for a == 1
An illustration of 1 + 1 == 2
would be
Figure 1.2: Recursive descent for 1 + 1 == 2
Parsed structure is
BinaryExpr {
left: BinaryExpr {
left: Number(1),
operator: Plus,
right: Number(1)
},
operator: EqualEqual,
right: Number(2)
}
Or represented directly in an AST,
Figure 1.3: Abstract syntax tree for 1 + 1 == 2
Grammar Rules and Sets
I first learned the concept of a “grammar rule” in my Finite Automata class. This is a set of rule(s) that produces: another rule, a symbol, a termination, an empty string (epsilon (ε)), or a combination of the 4.
For example, the grammar:
A -> aA
A -> aB
B -> b
The idea is we start with A, choose one of its rules, evaluate to aB, evaluate the B in this to ab. ab is one of the strings this grammar can produce.
Some strings this grammar can produce include: ab, aab, aaab, …, or a+b
in regex.
Things that can make grammar rules more complex include:
when one rule can produce alternative values (denoted with regex-like |
often), more recursion,
larger set of rules,
or larger symbol set. Note, A
is implicitly the starting rule in my grammar.
More formally,
- N is the set of non-terminal symbols
- Σ is the set of terminal symbols
- P is the set of production rules
- S is the start symbol
A small collection of regex operators are used to define grammars more concisely in practice. Those can include:
|
*
+
()
Visitor Pattern
The Visitor
design pattern allows us to add new operations to our AST
without modifying the node classes themselves. This separation of
concerns enables a clean implementation where:
- AST nodes define the structure
- Visitors define operations on that structure (like interpretation)
We can see an example of this pattern in my implementation of Lox in which I define a trait I implement in a few places:
// expr.rs
ExprVisitor<R, E> {
fn visit_literal_expr(&mut self, expr: &Literal) -> Result<R, E>;
fn visit_grouping_expr(&mut self, expr: &Grouping) -> Result<R, E>;
fn visit_unary_expr(&mut self, expr: &Unary) -> Result<R, E>;
fn visit_binary_expr(&mut self, expr: &Binary) -> Result<R, E>;
fn visit_variable_expr(&mut self, expr: &Variable) -> Result<R, E>;
fn visit_assign_expr(&mut self, expr: &Assign) -> Result<R, E>;
fn visit_logical_expr(&mut self, expr: &Logical) -> Result<R, E>;
fn visit_call_expr(&mut self, expr: &Call) -> Result<R, E>;
}
impl Expr {
pub fn accept<R, E>(&self, visitor: &mut dyn ExprVisitor<R, E>) -> Result<R, E> {
match self {
Expr::Literal(l) => visitor.visit_literal_expr(l),
Expr::Grouping(g) => visitor.visit_grouping_expr(g),
Expr::Unary(u) => visitor.visit_unary_expr(u),
Expr::Binary(b) => visitor.visit_binary_expr(b),
Expr::Variable(v) => visitor.visit_variable_expr(v),
Expr::Assign(a) => visitor.visit_assign_expr(a),
Expr::Logical(l) => visitor.visit_logical_expr(l),
Expr::Call(c) => visitor.visit_call_expr(c),
}
}
}
// interpreter.rs
impl Interpreter {
fn evaluate(&mut self, expr: &Expr) -> Result<Value, RuntimeError> {
expr.accept(self)
}
// ...
}
This augments our code, providing:
- Extensibility: We can create multiple visitors for different operations (interpretation, type checking, optimization) without changing the AST nodes
- Type Safety: The Rust compiler ensures we’ve implemented handlers for every possible node type, preventing runtime errors
- Separation of Concerns: Each visitor focuses on a single responsibility, making the code more maintainable
- Reducing Conditional Logic: Instead of large switch/match statements scattered throughout the codebase, we centralize the dispatch logic in the accept method
The Interpreter: Bringing Code to Life
Finally, the interpreter traverses the AST and evaluates the expressions, producing the final result.
fn visit_binary_expr(&mut self, expr: &Binary) -> Result<Value, RuntimeError> {
let left = self.evaluate(&expr.left)?;
let right = self.evaluate(&expr.right)?;
match expr.operator.token_type {
TokenType::Plus => add(left, right),
TokenType::Minus => subtract(left, right),
TokenType::Slash => divide(left, right),
TokenType::Star => multiply(left, right),
TokenType::Greater => greater_than(left, right),
TokenType::GreaterEqual => greater_than_or_equal(left, right),
TokenType::Less => less_than(left, right),
TokenType::LessEqual => less_than_or_equal(left, right),
TokenType::BangEqual => Ok(!same_value(left, right)?),
TokenType::EqualEqual => same_value(left, right),
_ => Err(RuntimeError::InvalidBinaryOperator {
msg: "Expected one of: +, -, /, *, >, >=, <, <=, !=, ==".to_string(),
token: expr.operator.clone(),
}),
}
}
fn add(left: Value, right: Value) -> Result<Value, RuntimeError> {
match (left, right) {
(Value::Number(n1), Value::Number(n2)) => Ok(Value::Number(n1 + n2)),
(Value::String(s1), Value::String(s2)) => Ok(Value::String(format!("{}{}", s1, s2))),
(left, right) => Err(RuntimeError::OperandIsNotNumberOrString {
msg: "Can only add numbers or strings".to_string(),
values: vec![left, right],
}),
}
}
The snippet above is most of what we need to know to understand how the interpreter takes a structure of expression then returns the value it yields. It can evaluate statements such as variables declaration & assignment, functions, and classes which give it memory.
Given our previous example input string, 1 + 1 == 2
, the interpreter would be given the BinaryExpr
we saw above,
then return Bool(true)
. Following the AST in figure 1.3 above, we evaluate the whole expression, handle the outermost
==
, evaluate its left expression which recurses on to yield 2, evaluate right, yield true
from the original
binary expression visit call.
While evaluate
is the entrypoint for yielding a value from an expression in the interpreter, it makes recursive calls
until it matches the node we’re visiting in the AST following the grammar, recursing more as needed.
Figure 2.1: Interpreter evaluate sequence for 1 + 1 == 2
The interpreter implements the visitor interface for visiting expressions given an expression, providing methods to evaluate each type of expression. It’s doing recursive descent as we’ve done in the parser.
Variable Store
A simple hashmap is enough for getting started storing identifier -> Value(inner) for implementing variable store as referred to in the context of Lox, the environment. Building on top of this, we make this a linked list to implement lexical blocks in Lox. We later add a special variable store for defining our Lox std lib which we call global.
While functions and closures are implemented, I don’t go into detail on them in this article.
Lox Standard Library and Native Functions
To provide built-in functionality, an interpreter typically implements native functions - functions implemented in the host language (Rust in our case) rather than in the interpreted language.
I’m skimming over Lox function implementation and design, but to define some common properties of each, we can see
I define a Callable
trait. Rust provides Fn
, a trait in Rust std lib which allows us to easily provide an implementation
of a Lox function natively in Rust. That is, the anonymous closure (the thing starting |_args| { // ... }
).
Then our NativeFunc
implementation of call
boils down to (self.func)(arguments)
.
// globals.rs
pub struct Globals {
pub env: SharedEnvironment,
}
impl Globals {
pub fn new() -> Globals {
let mut environment = Environment::new(None);
// Lox std lib
let clock = NativeFunc::new("clock", 0, |_args| {
Ok(Value::Number(
SystemTime::now()
.duration_since(UNIX_EPOCH)
.unwrap()
.as_secs_f64(),
))
});
environment.define(clock.name.clone(), Value::NativeFunc(clock));
Globals {
env: Rc::new(RefCell::new(environment)),
}
}
}
// function.rs
pub trait Callable {
fn call(
&self,
interpreter: &mut Interpreter,
arguments: Vec<Value>,
) -> Result<Value, RuntimeError>;
fn arity(&self) -> usize;
fn name(&self) -> String;
}
Native functions provide a bridge between the interpreter and the host language, enabling access to system resources and libraries. While not implemented here, a foreign function interface (FFI) extends this concept by allowing dynamic binding to external libraries or code. This is often done to run C/C++ compiled libraries using FFI in Python.
Additional Topics
These are topics covered in the book which I didn’t go over or go into depth on:
- Error handling: panics and
Result::Err
in Rust and their meanings in different contexts for Lox - Lexical scope and building upon this with: blocks, control flow, functions
- Function closure implementation
- Semantic analysis of variable assignment and usage
- Classes/Object oriented programming
- Bytecode virtual machine and all its details
Future Work
- I need to finish reading the book
- Write something cool in Lox, interpreting with my own Rust Lox interpreter
- Fork my own interpreter, writing my own language
- Language server protocol (LSP)
Conclusion
Building an interpreter is a fascinating journey that provides deep insight into how programming languages work. From scanning tokens to parsing syntax trees and interpreting code, each step in the process reveals the elegant machinery that powers our daily programming tools.
The techniques we’ve explored are foundational concepts that apply across a wide range of language implementations.
Robert Nystrom’s “Crafting Interpreters” expands on these concepts in much greater detail, offering a comprehensive guide to building robust, efficient interpreters from scratch. By understanding these principles, you’ll gain a deeper appreciation for the languages you use every day and the tools that bring them to life.
Remember that this exploration covers just the first part of building an interpreter - the front-end components that handle parsing and evaluation. A full language implementation would also include more advanced features like type systems, optimization, and compilation to bytecode or machine code.