Parsing with PetitParser2

Simple, modular and flexible high-performance parsing framework.

Introduction
Introduction to PetitParser2
Migration from PetitParser

Parser Development
Scripting with Bounded Seas
Grammar
Context-Sensitive Grammar
Abstract-Syntax Tree
Full Parser
Comments
Optimizations
Optimization (Memoization)

PetitParser2 Internals
Star Lazy (In Progress)
Caches (In Progress)
Matching Tags (In Progress)
Context-Sensitivity (In Progress)

View the Project on GitHub kursjan/petitparser2

Writing Parsers with PetitParser2

This tutorial is based on Writing Parser with PetitParser from Lukas Renggli.

PetitParser is a parsing framework different to many other popular parser generators. For example, it is not table based such as SmaCC or ANTLR. Instead it uses a unique combination of four alternative parser methodologies: scannerless parsers, parser combinators, parsing expression grammars and packrat parsers. As such PetitParser2 is more powerful in what it can parse and it arguably fits better the dynamic nature of Smalltalk. Let’s have a quick look at these four parser methodologies:

Writing a Simple Grammar

Writing grammars with PetitParser is simple as writing Smalltalk code. For example to write a grammar that can parse identifiers that start with a letter followed by zero or more letter or digits is defined as follows. In a workspace we evaluate:

identifier := #letter asPParser , #word asPParser star.

If you inspect the object identifier you’ll notice that it is an instance of a PP2SequenceNode. This is because the #, operator created a sequence of a letter and a zero or more word character parser. If you dive further into the object you notice the following simple composition of different parser objects:

PP2SequenceNode (this parser accepts a sequence of parsers)
    PP2PredicateObjectNode (this parser accepts a single letter)
    PP2RepeatingNode (this parser accepts zero or more instances of another parser)
       PPPredicateObjectNode (this parser accepts a single word character)

Parsing Some Input

To actually parse a string (or stream) we can use the method #parse::

identifier parse: 'yeah'.          " --> #($y #($e $a $h)) "
identifier parse: 'f12'.           " --> #($f #($1 $2)) "

While it seems odd to get these nested arrays with characters as a return value, this is the default decomposition of the input into a parse tree. We’ll see in a while how that can be customized.

If we try to parse something invalid we get an instance of PP2Failure as an answer:

identifier parse: '123'.           " --> letter expected at 0 "

Instances of PP2Failure are the only objects in the system that answer with true when you send the message #isPetitFailure. Alternatively you can also use #parse:onError: to throw an exception in case of an error:

identifier
   parse: '123'
   onError: [ :msg :pos | self error: msg ].

If you are only interested if a given string (or stream) matches or not you can use the following constructs:

identifier matches: 'foo'.         " --> true "
identifier matches: '123'.         " --> false "

Different Kinds of Parsers

PetitParser2 provides a large set of ready-made parser that you can compose to consume and transform arbitrarily complex languages. The terminal parsers are the most simple ones. We’ve already seen a few of those:

Terminal ParsersDescription
$a asPParser Parses the character $a.
'abc' asPParser Parses the string 'abc'.
#any asPParserParses any character.
#digit asPParserParses the digits 0..9.
#letter asPParserParses the letters a..z and A..Z.

The PP2NodeFactory provides a lot of other factory methods that can be used to build more complex terminal parsers.

The next set of parsers are used to combine other parsers together:

Parser CombinatorsDescription
p1 , p2Parses p1 followed by p2 (sequence).
p1 / p2Parses p1, if that doesn’t work parses p2 (ordered choice).
p starParses zero or more p.
p plusParses one or more p.
p optionalParses p if possible.
p andParses p but does not consume its input.
p notParses p and succeed when p fails, but does not consume its input.
p endParses p and succeed at the end of the input.

Instead of using the #word predicated we could have written our identifier parser like this:

identifier := #letter asPParser , (#letter asPParser / #digit asPParser) star.

To attach an action or transformation to a parser we can use the following methods:

Action ParsersDescription
p ==> aBlockPerforms the transformation given in aBlock.
p flattenCreates a string from the result of p.
p tokenCreates a token from the result of p.
p trimTrims whitespaces before and after p.

To return a string of the parsed identifier, we can modify our parser like this:

identifier := (#letter asPParser , (#letter asPParser / #digit asPParser) star) flatten.

These are the basic elements to build parsers. There are a few more well documented and tested factory methods in the operations protocol of PP2Node. If you want browse that protocol.

Writing a More Complicated Grammar

Now we are able to write a more complicated grammar for evaluating simple arithmetic expressions. Within a workspace we start with the grammar for a number (actually an integer):

number :=  #digit asPParser plus token trim ==> [ :token | token value asNumber ].

Then we define the productions for addition and multiplication in order of precedence. Note that we instantiate the productions as PP2UnresolvedNode upfront, because they recursively refer to each other. The method #def: resolves this recursion using the reflective facilities of the host language:

term := PP2UnresolvedNode new.
prod := PP2UnresolvedNode new.
prim := PP2UnresolvedNode new.
 
term def: (prod , $+ asPParser trim , term ==> [ :nodes | nodes first + nodes last ])
   / prod.
prod def: (prim , $* asPParser trim , prod ==> [ :nodes | nodes first * nodes last ])
   / prim.
prim def: ($( asPParser trim , term , $) asPParser trim ==> [ :nodes | nodes second ])
   / number.

To make sure that our parser consumes all input we wrap it with the end parser into the start production:

start := term end.

That’s it, now we can test our parser and evaluator:

start parse: '1 + 2 * 3'.       " --> 7 "
start parse: '(1 + 2) * 3'.     " --> 9 "

As an exercise we could extend the parser to also accept negative numbers and floating point numbers, not only integers. Furthermore it would be useful to add support subtraction and division as well. All these features can be added with a few lines of PetitParser2 code.