Parsing with PetitParser2

This text is part of the Parsing With PetitParser2 series. The table of content can be found at the end of the chapter.

Extracting the Structure

In the previous chapter we created a parser to extract javascript from HTML source. Yet our parser is rather simple. It sees an input as a list of javascripts. In this part we extend the parser and we extract the HTML structure as well, thus we will obtain a tree of html elements and javascript.

1. Matching Open And Close Tags

Elements of HTML has an interesting property: the name of an opening tag has to match the name of a closing tag. Though natural for humans, this is, surprisingly, rather difficult task from the parsing theory point of view.

Unfortunately, standard solutions do not really work as we would like to. We describe problem in more details in the supplementary Matching Tags chapter.

To fix the issue, PetitParser2 comes with a special syntax to express constrains of matching open and close tags. It can store a result of a rule (e.g. opening an html tag) onto a stack using the push operator and assert that a result of a rule (e.g. closing an html tag) matches the top of the stack using the match operator.

Here is the concrete example: First we define an element name as a repetition of letters and digits:

WebGrammar>>elementName
	^ #word asPParser plus flatten

Than we define element as a sequence of elOpen, elContent and elClose:

WebGrammar>>element
	^ (elOpen, elContent, elClose)

In elOpen, we push the element name as well as we consume water in case an element contains arguments:

WebGrammar>>elOpen
	^ $< asPParser, elementName push, any starLazy, $> asPParser ==> #second

In elClose, we first match the element name against the top of a stack and we pop the stack in case of success:

WebGrammar>>elClose
	^ '</' asPParser, elementName match pop, $> asPParser

2. Element Content

Now we need to define elContent rule, which represents the content of an element. elContent is zero or more repetitions of the following components in the given order:

  1. a javascript code;
  2. another element; or
  3. an unknown text.

Javascript is on the first position because it is kind of element[1]and therefore must be ordered before an element rule. The same holds for the element rule, it is also kind of text.

WebGrammar>>elContent
	^ (javascript / element / text nonEpsilon) star

Text can be anything. Therefore, we define it as with the help of bounded seas, concretely using the starLazy operator:

WebGrammar>>text
	^ #any starLazy

Epsilon in Repetitions

Note, we mark the text rule with nonEpsilon. The nonEpsilon operator is an extension of PEGs that forbids epsilon parses (in other words if the underlying parser does not consume any input, it fails). The reason for this is that #any asPParser starLazy can consume anything, even the empty string, because the starLazy operator allows for zero repetitions.

there might be possibility to define plusLazy as well.

Without nonEpsilon, the star repetition of elContent would end up in an infinite loop recognizing an epsilon in each of its iterations, never failing, never stopping. You can easily freeze your image by running the following code (we recommend saving your image now):

"Tip: save the image before running"
#any asPParser optional star parse: 'endless loop'

3. Testing the Code

We have written a new code, let us try if it works. We start with text:

WebGrammarTest>>testText
	self parse: 'foobar' rule: #text
[PASS]

And element follows:

WebGrammarTest>>testElementEmpty
	self parse: '<foo></foo>' rule: #element
[PASS]

WebGrammarTest>>testElement
	self parse: '<p>lorem ipsum</p>' rule: #element
[PASS]

WebGrammarTest>>testElementNested
	self parse: '<p>lorem <i>ipsum</i></p>' rule: #element
[PASS]

We should be able to parse malformed elements as well. Lets see if the push, match, pop magic works, as expected:

WebGrammarTest>>testElementMalformedWrongClose
	self parse: '<foo><bar>meh</baz></foo>' rule: #element.
[PASS]

WebGrammarTest>>testElementMalformedExtraClose
	self parse: '<foo><bar>meh</bar></fii></foo>' rule: #element
[PASS]

WebGrammarTest>>testElementMalformedUnclosed
	self parse: '<head><meta content="mess"></head>' rule: #element.
[PASS]

4. Conclusion

So far it looks good. But our tests are telling us only that element can parse the given input, it does not tell us how it parses the input. We should assert that the proper element names and the right content is extracted.

It is the time to return a more convenient representation of the input: an abstract syntax tree. We will do this in the following chapter.

Table of Contents

This text is part of the Parsing With PetitParser2 series.

Part I, Developer's Workflow:

Do you have ideas, suggestions or issues? Write us an email, or contact u on github!

Go to top.


[1] any javascript code can be recognized as an html element