Parsing with PetitParser2

This text is part of the Parsing With PetitParser2 series. The table of content can be found at the end of the chapter.

Create an Abstract Syntax Tree

In the previous chapter we tested our grammar to parse the nested elements, but we have very limited (none, I would say) knowledge about how does the extracted structure look like. So let us improve the output from the parser and add more information.

The direct output of a parser is typically very detailed and not well-readable. It is called concrete syntax tree (CST), which is (in the case of PetitParser) array of arrays of arrays of ... with characters and strings as terminals.

It is the time to return a more convenient representation of the input: an abstract syntax tree (AST). AST, contrary to the very low-level and detailed CST, is conceptually closer to the target domain. In our case the domain should be some kind of a tree with html elements and javascript.

It is considered to be a good practice in PetitParser to split grammar, which returns CST, from parser, which returns AST. We do this by subclassing WebGrammar:

WebGrammar subclass: #WebParser
	instanceVariableNames: ''
	classVariableNames: ''
	poolDictionaries: ''
	category: 'PetitParser2-Tutorial'

1. Testing First

Parsing is great use case for test driven development. So let us start with tests first. Anyway, tests give us a clear idea what kind of interface we expect from abstract syntax tree:

WebGrammarTest subclass: #WebParserTest
	uses: TPP2TypeAssertions
	instanceVariableNames: ''
	classVariableNames: ''
	poolDictionaries: ''
	category: 'PetitParser2-Tutorial'

WebParserTest>>parserClass
	^ WebParser

WebParserTest>>testElement
	super testElement.
	
	self assert: result name equals: 'p'.
	self assert: result firstChild text equals: 'lorem ipsum'

WebParserTest>>testElementEmpty
	super testElementEmpty.
	
	self assert: result name equals: 'foo'.

WebParserTest>>testElementNested
	super testElementNested.
	
	self assert: result name equals: 'p'.
	self assert: result firstChild text trim equals: 'lorem'.
	self assert: result secondChild name equals: 'i'.
	self assert: result secondChild firstChild text equals: 'ipsum'

And for malformed elements, we expect the following results:

WebParserTest>>testElementMalformedExtraClose
	super testElementMalformedExtraClose.
	
	self assert: result name equals: 'foo'.
	self assert: result secondChild text equals: '</fii>'

WebParserTest>>testElementMalformedWrongClose
	super testElementMalformedWrongClose.
	
	self assert: result name equals: 'foo'.
	self assert: result firstChild text equals: '<bar>meh</baz>'

WebParserTest>>testElementMalformedUnclosed
	super testElementMalformedUnclosed.
	
	self assert: result name equals: 'head'.
	self assert: result firstChild text trim equals: '<meta content="mess">'

And of course, we expect javascript code to be extracted as follows:

WebParserTest>>testJavascript
	super testJavascript.
	
	self assert: result code equals: 'alert("hi there!")'

WebParserTest>>testJavascriptWithString
	super testJavascriptWithString.
	
	self assert: result code equals: 'alert(''</script>'')'

2. AST Nodes

If we want to pass these test, the result of a parse should be a tree consisting of three different nodes:

  1. a javascript code;
  2. an html element; and
  3. a text.

We define these nodes as follows, starting with its abstract predecessor WebElement with some convenience methods.

Object subclass: #WebElement
	instanceVariableNames: ''
	classVariableNames: ''
	poolDictionaries: ''
	category: 'PetitParser2-Tutorial'

WebElement>>children
	^ #()

WebElement>>firstChild
	"Just for convenience"
	^ self children first

WebElement>>secondChild
	"Just for convenience"
	^ self children second

WebElement>>gtTreeViewIn: composite
	<gtInspectorPresentationOrder: 40>

	composite tree
		title: 'Tree';
		children: [:n | n children ];
		format: [:n| n displayText printStringLimitedTo: 50 ];
		shouldExpandToLevel: 6

JavascriptElement follows:

WebElement subclass: #JavascriptElement
	instanceVariableNames: 'code'
	classVariableNames: ''
	poolDictionaries: ''
	category: 'PetitParser2-Tutorial'

JavascriptElement>>code
	^ code

JavascriptElement>>code: anObject
	code := anObject

JavascriptElement>>displayText
	^ self code

JavascriptElement>>gtText: composite
	<gtInspectorPresentationOrder: 40>
	
	composite text
		title: 'Text';
		display: [ :context | code ]

As well as HtmlElement:

WebElement subclass: #HtmlElement
	instanceVariableNames: 'name children'
	classVariableNames: ''
	poolDictionaries: ''
	category: 'PetitParser2-Tutorial'

HtmlElement>>children
	^ children

HtmlElement>>name
	^ name

HtmlElement>>name: newName
	name := newName

HtmlElement>>displayText
	^ self name

Last but not least UnknownText:

WebElement subclass: #UnknownText
	instanceVariableNames: 'text'
	classVariableNames: ''
	poolDictionaries: ''
	category: 'PetitParser2-Tutorial'

UnknownText>>text
	^ text

UnknownText>>text: anObject
	text := anObject

UnknownText>>gtText: composite
	<gtInspectorPresentationOrder: 40>
	
	composite text
		title: 'Text';
		display: [ :context | text ]

3. From a Grammar to a Parser

Now we override the rules of WebGrammar in WebParser so that the javascript rule returns JavascriptElement, the element rule returns HtmlElement and the text rule returns UnknownText:

WebParser>>javascript
	^ super javascript
	
	map: [ :_code | 
		(JavascriptElement new)
			code: _code;
			yourself
	]

WebParser>>element
	^ super element 
	
	map: [ :_open :_content :_close | 
	 	(HtmlElement new)
			name: _open;
			children: _content;
			yourself
	]

WebParser>>text
	^ super text flatten
	
	map: [ :_value | 
		UnknownText new
			text: _value;
			yourself	
	]

And finally, for convenience, we trim whitespaces around html elements:

WebParser>>elClose
	^ super elClose trim

WebParser>>elOpen
	^ super elOpen trimRight

There is trimRight in elOpen, which means that only the whitespace on the right is trimmed. This makes caching of PetitParser slightly more efficient, because element always starts at the first non-whitespace character. If there is a trimming from left and right (using the trim), it might start at any preceding whitespace or the first non-whitespace character and this would lead into lower cache-hit ratio (we will talk about caches later).

By this time all the tests should pass:

(WebGrammarTest buildSuiteFromMethods: #(
	#testElement
	#testElementEmpty
	#testElementNested
	#testElementMalformedUnclosed
	#testElementMalformedExtraClose
	#testElementMalformedWrongClose
	#testJavascript
	#testJavascriptWithString)) run
8 run, 8 passes, 0 skipped, 0 expected failures, 0 failures, 0 errors, 0 unexpected passes

4. Structured Document

Now let us construct the whole HTML document. All the pieces we need are already ready we just need to put them together.

In the previous chapters, we defined document (represented by the document rule) in WebGrammar as a repetition of javascript seas. Now we define a structured document (structuredDocument), i.e. a document with an html structure. Of course, we write tests first and we start with WebGrammarTest:

WebGrammarTest>>testStructuredDocumentSimple
	| input |
	input := '<html>
		<body>
			<script>alert("hello world")</script>
		</body>
	</html>'.
	
	self parse: input rule: #structuredDocument

WebGrammarTest>>testStructuredDocumentWithDoctype
	| input |
	input := '
<!DOCTYPE html>
<!-- comment -->
<html>
	<body>
		<script>alert("hello world")</script>
	</body>
</html>'.
	
	self parse: input rule: #structuredDocument

WebGrammarTest>>testStructuredDocument
	| input |
	input := PP2Sources current htmlSample.
	
	self parse: input rule: #structuredDocument

From the tests we see that the root of an html file is an element that can be surrounded by some other information (e.g. doctype or comments), therefore we define structured document as an element surrounded by a sea:

WebGrammar>>structuredDocument
	^ element sea

Furthermore, we verify that we extract the correct structure of the document. This can be done in the WebParserTest:

WebParserTest>>testStructuredDocumentSimple
	| html body javascript |
	super testStructuredDocumentSimple.
	
	self assert: result name equals: 'DOCUMENT'.


	html := result secondChild.
	self assert: html name equals: 'html'.

	body := html firstChild.
	self assert: body name equals: 'body'.
	
	javascript := body firstChild.
	self assert: javascript isKindOf: JavascriptElement.
	self assert: javascript code equals: 'alert("hello world")'.

WebParserTest>>testStructuredDocumentWithDoctype
	| html body javascript |
	super testStructuredDocumentWithDoctype.
	
	self assert: result name equals: 'DOCUMENT'.


	html := result secondChild.
	self assert: html name equals: 'html'.

	body := html firstChild.
	self assert: body name equals: 'body'.
	
	javascript := body firstChild.
	self assert: javascript isKindOf: JavascriptElement.
	self assert: javascript code equals: 'alert("hello world")'.


WebParserTest>>testStructuredDocument
	| html body |
	super testStructuredDocument.
	
	self assert: result name equals: 'DOCUMENT'.

	html := result secondChild.
	self assert: html name equals: 'html'.

	self assert: html firstChild name equals: 'head'.	
	self assert: html secondChild name equals: 'body'.

In order to pass these tests, we create a root element called DOCUMENT in structuredDocument. Furthermore we add to it the html element as well a its surroundings:

WebParser>>structuredDocument
	^ super structuredDocument
	
	map: [ :_bw :_html :_aw |
		| beforeWater afterWater |
		beforeWater := UnknownText new
			text: _bw;
			yourself.
			
		afterWater := UnknownText new
			text: _aw;
			yourself.
			
		HtmlElement new
			name: 'DOCUMENT';
			children: (Array 
				with: beforeWater 
				with: _html 
				with: afterWater);
			yourself
	]

And we should remember to change the start method.

WebGrammar>>start
	^ structuredDocument 

By this time all the tests should pass:

(WebGrammarTest buildSuiteFromMethods: #(
	#testElement
	#testElementEmpty
	#testElementNested
	#testElementMalformedUnclosed
	#testElementMalformedExtraClose
	#testElementMalformedWrongClose
	#testJavascript
	#testJavascriptWithString
	#testStructuredDocumentSimple
	#testStructuredDocumentWithDoctype
	#testStructuredDocument) run
11 run, 11 passes, 0 skipped, 0 expected failures, 0 failures, 0 errors, 0 unexpected passes

5. Comments

In this section we focus on HTML comments. Comments can contain scripts or other elements that are not part of the html document (they are just comments that look the same). And they may confuse the parser.

Actually they do and they do so in our tests as well! There is an error in WebParser that we have not found. It extracts one extra HTML element, which is not part of the HTML document. Inspect the following and switch to the tree view:

WebParser new parse: input.

There is a <p> element in the <body>. But <p> is a part of a comment, it should not be included in the document structure. First, we should fix the testStructuredDocument to cover this case:

WebParserTest>>testStructuredDocument
	| html body |
	super testStructuredDocument.
	
	self assert: result name equals: 'DOCUMENT'.

	html := result secondChild.
	self assert: html name equals: 'html'.

	self assert: html firstChild name equals: 'head'.	
	self assert: html secondChild name equals: 'body'.
	
	body := html secondChild.
	self assert: body children size equals: 4.

So where does the problem come from? The cause is the same as in the case of prematurely ended javascript. WebGrammar has no notion of a comment and handles a content of a comment as an ordinary html code. In order to teach it what is an html comment, we have to define it:

WebGrammar>>comment
	^ '<!--' asPParser, any starLazy, '-->' asPParser

And now we can redefine text so that it takes comments into an account:

WebGrammar>>text
	^ (comment / any) starLazy

If you are interested in the details of the starLazy operator, check out the The starLazy Operator chapter.

This is it, we can parse all of it!

6. Conclusion

In this chapter we have applied the test-driven approach to define our AST nodes. We defined WebParser, which extends the WebGrammar and builds the AST nodes. Last but not least, we had a look how to avoid false positives caused by comments.

Complete Sources

You can download the sources here:

The sources of this tutorial are part of the PetitParser2 package, you just need to install PetitParser or use Moose as described in the Introduction.

In the following text we will show how to improve performance of our parser.

Table of Contents

This text is part of the Parsing With PetitParser2 series.

Part I, Developer's Workflow:

Do you have ideas, suggestions or issues? Write us an email, or contact u on github!

Go to top.