Simple, modular and flexible high-performance parsing framework.
Introduction
Introduction to PetitParser2
Migration from PetitParser
Parser Development
Scripting with Bounded Seas
Grammar
Context-Sensitive Grammar
Abstract-Syntax Tree
Full Parser
Comments
Optimizations
Optimization (Memoization)
PetitParser2 Internals
Star Lazy (In Progress)
Caches (In Progress)
Matching Tags (In Progress)
Context-Sensitivity (In Progress)
In the previous chapter we created a parser to extract a list of JavaScript strings from an HTML source. Now we extend the parser to extract an HTML structure as well.
Elements of HTML has an interesting property: the name of an opening tag has to match the name of a closing tag. Natural for humans, but challenging for parsers.
PetitParser2 comes with a special syntax to express constrains of matching open and close tags.
It can store a result of a rule (e.g. opening an html tag) onto a stack using the push operator and assert that a result of a rule (e.g. closing an html tag) matches the top of the stack using the match operator.
First define an element name as a repetition of letters and digits:
WebGrammar>>elementName
^ #word asPParser plus flatten
Than define an element as a sequence of elOpen, elContent and elClose:
WebGrammar>>element
^ (elOpen, elContent, elClose)
In elOpen, we push the element name as well as we possible arguments as a water:
WebGrammar>>elOpen
^ $< asPParser, elementName push, any starLazy, $> asPParser
==> #second
In elClose, we first match the element name against the top of a stack and we pop the stack:
WebGrammar>>elClose
^ '</' asPParser, elementName match pop, $> asPParser
elContent is a zero or more repetitions of the following elements (in the given order):
Javascript is on the first position because it is kind of element and therefore must be ordered before an element rule. The same holds for the element rule, it is also kind of text.
WebGrammar>>elContent
^ (javascript / element / text nonEpsilon) star
Text can be anything.
Therefore, we define it as with the help of bounded seas, concretely using the starLazy operator:
WebGrammar>>text
^ #any starLazy
Note, we mark the text rule with nonEpsilon.
The nonEpsilon operator is an extension of PEGs that forbids epsilon parses (in other words if the underlying parser does not consume any input, it fails).
The reason for this is that #any asPParser starLazy can consume anything, even the empty string, because the starLazy operator allows for zero repetitions.
Without nonEpsilon, the star repetition of elContent would end up in an infinite loop recognizing an epsilon in each of its iterations, never failing, never stopping.
You can easily freeze your image by running the following code (we recommend saving your image now):
#any asPParser optional star parse: 'endless loop'
Testing is always a good practice, let’s start with text:
WebGrammarTest>>testText
self parse: 'foobar' rule: #text
And element follows:
WebGrammarTest>>testElement
self parse: '<p>lorem ipsum</p>'
rule: #element.
WebGrammarTest>>testElementEmpty
self parse: '<foo></foo>'
rule: #element.
WebGrammarTest>>testElementNested
self parse: '<p>lorem <i>ipsum</i></p>'
rule: #element.
We should be able to parse malformed elements as well.
Lets see if the push, match, pop magic works, as expected:
WebGrammarTest>>testElementMalformedWrongClose
self parse: '<foo><bar>meh</baz></foo>'
rule: #element.
WebGrammarTest>>testElementMalformedExtraClose
self parse: '<foo><bar>meh</bar></fii></foo>'
rule: #element.
WebGrammarTest>>testElementMalformedUnclosed
self parse: '<head><meta content="mess"></head>'
rule: #element.
We extended our parser to match opening and closing elements of HTML.
To do so, we used context-sensitive extension of PetitParser2: push, pop and matches rules.