Simple, modular and flexible high-performance parsing framework.
Introduction
Introduction to PetitParser2
Migration from PetitParser
Parser Development
Scripting with Bounded Seas
Grammar
Context-Sensitive Grammar
Abstract-Syntax Tree
Full Parser
Comments
Optimizations
Optimization (Memoization)
PetitParser2 Internals
Star Lazy (In Progress)
Caches (In Progress)
Matching Tags (In Progress)
Context-Sensitivity (In Progress)
In the previous chapter we created a parser to extract a list of JavaScript strings from an HTML source. Now we extend the parser to extract an HTML structure as well.
Elements of HTML has an interesting property: the name of an opening tag has to match the name of a closing tag. Natural for humans, but challenging for parsers.
PetitParser2 comes with a special syntax to express constrains of matching open and close tags.
It can store a result of a rule (e.g. opening an html tag) onto a stack using the push
operator and assert that a result of a rule (e.g. closing an html tag) matches the top of the stack using the match
operator.
First define an element name as a repetition of letters and digits:
WebGrammar>>elementName
^ #word asPParser plus flatten
Than define an element as a sequence of elOpen
, elContent
and elClose
:
WebGrammar>>element
^ (elOpen, elContent, elClose)
In elOpen
, we push the element name as well as we possible arguments as a water:
WebGrammar>>elOpen
^ $< asPParser, elementName push, any starLazy, $> asPParser
==> #second
In elClose
, we first match the element name against the top of a stack and we pop the stack:
WebGrammar>>elClose
^ '</' asPParser, elementName match pop, $> asPParser
elContent
is a zero or more repetitions of the following elements (in the given order):
Javascript is on the first position because it is kind of element and therefore must be ordered before an element rule. The same holds for the element rule, it is also kind of text.
WebGrammar>>elContent
^ (javascript / element / text nonEpsilon) star
Text can be anything.
Therefore, we define it as with the help of bounded seas, concretely using the starLazy
operator:
WebGrammar>>text
^ #any starLazy
Note, we mark the text
rule with nonEpsilon
.
The nonEpsilon
operator is an extension of PEGs that forbids epsilon parses (in other words if the underlying parser does not consume any input, it fails).
The reason for this is that #any asPParser starLazy
can consume anything, even the empty string, because the starLazy
operator allows for zero repetitions.
Without nonEpsilon
, the star repetition of elContent
would end up in an infinite loop recognizing an epsilon in each of its iterations, never failing, never stopping.
You can easily freeze your image by running the following code (we recommend saving your image now):
#any asPParser optional star parse: 'endless loop'
Testing is always a good practice, let’s start with text
:
WebGrammarTest>>testText
self parse: 'foobar' rule: #text
And element
follows:
WebGrammarTest>>testElement
self parse: '<p>lorem ipsum</p>'
rule: #element.
WebGrammarTest>>testElementEmpty
self parse: '<foo></foo>'
rule: #element.
WebGrammarTest>>testElementNested
self parse: '<p>lorem <i>ipsum</i></p>'
rule: #element.
We should be able to parse malformed elements as well.
Lets see if the push
, match
, pop
magic works, as expected:
WebGrammarTest>>testElementMalformedWrongClose
self parse: '<foo><bar>meh</baz></foo>'
rule: #element.
WebGrammarTest>>testElementMalformedExtraClose
self parse: '<foo><bar>meh</bar></fii></foo>'
rule: #element.
WebGrammarTest>>testElementMalformedUnclosed
self parse: '<head><meta content="mess"></head>'
rule: #element.
We extended our parser to match opening and closing elements of HTML.
To do so, we used context-sensitive extension of PetitParser2: push
, pop
and matches
rules.