Introduction
I was tweaking my functional language’s parser one evening, and I became frustrated with the messiness of my code. It wasn’t something I intended, but a natural occurrence which happens from writing parsers like this, I really couldn’t do much to fix it without sacrificing hours into rewriting it.
The most annoying issue I was facing was the constant consumption of ignored tokens such as parenthesis, I would call “consume” every single line which added onto the pile of mess the parser created already - it distracted me from reading the grammar flow of the parse function. And just like any other programmer, I decided to fix this; and i didn’t go the easy route I spent a weekend learning about different parsing algorithms, ones I’ve never heard of before, that’s when I came across Earley parsing - an algorithm for context free grammars.
I read that it’s especially useful for parsing functional syntax like my own. After a few hours of understanding how it worked, I wanted to know if there were any similar algorithms like Earley, that’s when I discovered packrat parsing, it’s more of a technique than an algorithm. But the key difference is that it guarantees linear time complexity, and utilizes memoization to achieve this. After reading Bryan Ford’s write up on it, I was able to implement my own PEG packrat parser in Rust for any language, I turned it into a combinatory parser. This allowed for a more natural way to define the logic and order of grammar.
For example, in a generic recursive descent parser; parsing the expression {"key":"value"} may look like this:
fn parse_obj(&mut self) -> Object {
self.consume(); // Consume the opening left brace
let key = self.parse_primary(); // Parse a primary expression, in this case a string will be parsed.
self.consume(); // Consume the colon
let value = self.parse_primary(); // Parse a primary expression for the value.
self.consume(); // Consume the closing right brace
Object { key, value }
}
As you can see, the logic spans across many lines, but initially it doesn’t show what is being consumed; there is no transparency.
However, in my packrat parser this is simplified:
fn parse_obj() -> Object {
action(
between(ch('{'), ch('}'), quoted_string() + ignore(lit(":")) + ws() + quoted_string()),
|vals| {
let parts = values(vals);
Object {
key: parts[0].as_str(),
value: parts[1].as_str(),
raw: String::new(),
}
},
)
}
I want to talk about a specific line in this code which makes it so special:
between(ch('{'), ch('}'), quoted_string() + ignore(lit(":")) + ws() + quoted_string())
This reads like grammar, just from using intuition you can decode this in your head immediately; I personally read it as:
Between the characters ‘{’ and ‘}’, match the sequence: quoted string, ignore the colon, any amount of whitespace, and another quoted string
At first, this may seem like even more work than a simple recursive descent parser. Sure, but that’s because recursive descent parsers are “bound” to certain grammar and context, it has to follow the exact structure of a grammar. This is what makes my PEG parser Iota so cool, it’s not a full context-free grammar parser, but it can handle any unambiguous grammar. That’s what makes PEG so powerful, it trades ambiguity handling for guaranteed linear time and grammars that read like the language they describe.
Memoization
The reason why memoization in a PEG parser is so important is because without it, we risk the chance of re-evaluating the same rule at the same position naively. This will result in the O(2^n) time complexity, which is the worst case - in this case. This is terrible, we don’t want this, we want our parser to be efficient.
This is where the packrat parsing comes into play, it caches the results of each parsing rules based off its rule and input offset. In my case, Iota has a memo table which is keyed on the expressions pointer and position.
For example, we can visualise the memoization like this:
A = B C
B = D E | "x"
D = "y"
Let’s break it down, say we’re parsing the string "yx" at position 0. The parser will try A at position 0, meaning it needs to try B at position 0. Then, B first tried D E, so it tries D at position 0, which matches "y", and then the position is advanced to 1.
Then the parser will try E at position 1, for the sake of understanding, let’s say D E fails. B would then try its second choice "x" at position 0, but that fails too, so B fails at position 0. Since B fails, so does A.
If we didn’t have memoization, and we had a more complex grammar where B gets tried at position 0 again through a different rule path, our parser would repeat all of those same steps from scratch; with a deeply nested grammar that can happen O(2^n) many times. Not good.
However in my example, our parser has memoization, so one B fails at position 0, the result gets stored into the memo table. The next time anything tries B at position 0, it will just look up the table and return the cached result immediately. This is a O(1) lookup, no re-computation needed!
Conclusion
The whole thought behind this post was to express my interest in the wide variety of parsing algorithms, and to make people aware about it; because they’re pretty damn cool. And I feel like I’ve served justice by implementing such an easy to use parser for everyone - hopefully this inspires others to design and write their own programming languages.
You can check out Iota here on GitHub. Or if you want the crate.