Simple, modular and flexible high-performance parsing framework.
Introduction
Introduction to PetitParser2
Migration from PetitParser
Parser Development
Scripting with Bounded Seas
Grammar
Context-Sensitive Grammar
Abstract-Syntax Tree
Full Parser
Comments
Optimizations
Optimization (Memoization)
PetitParser2 Internals
Star Lazy (In Progress)
Caches (In Progress)
Matching Tags (In Progress)
Context-Sensitivity (In Progress)
In the previous chapter we created a grammar to parse nested HTML elements. The grammar parses the input, but we don’t know what is the result. Now, we implement structured html representation of an HTML input.
The typical output of a parser is detailed and not suitable for further processing. It is called concrete syntax tree (CST), which is (in the case of PetitParser) an array of arrays of arrays of … arrays. Characters and strings leaf nodes.
Convenient representation of an input is an abstract syntax tree (AST). AST in our case will be a tree with HTML and JavaScript elements. The AST we will use consists of three nodes:
We define these nodes as follows, starting with its abstract predecessor WebElement
with some convenience methods.
Note that WebParser
and other classes are already loaded in your image. You might want to save them for future reference, or rename them before doing next steps.
Object subclass: #WebElement
instanceVariableNames: ''
classVariableNames: ''
poolDictionaries: ''
category: 'PetitParser2-Tutorial'
WebElement>>children
^ #()
WebElement>>firstChild
"Just for convenience"
^ self children first
WebElement>>secondChild
"Just for convenience"
^ self children second
WebElement>>gtTreeViewIn: composite
<gtInspectorPresentationOrder: 40>
composite tree
title: 'Tree';
children: [:n | n children ];
format: [:n| n displayText printStringLimitedTo: 50 ];
shouldExpandToLevel: 6
JavascriptElement
follows:
WebElement subclass: #JavascriptElement
instanceVariableNames: 'code'
classVariableNames: ''
poolDictionaries: ''
category: 'PetitParser2-Tutorial'
JavascriptElement>>code
^ code
JavascriptElement>>code: anObject
code := anObject
JavascriptElement>>displayText
^ self code
JavascriptElement>>gtText: composite
<gtInspectorPresentationOrder: 40>
composite text
title: 'Text';
display: [ :context | code ]
As well as HtmlElement
:
WebElement subclass: #HtmlElement
instanceVariableNames: 'name children'
classVariableNames: ''
poolDictionaries: ''
category: 'PetitParser2-Tutorial'
HtmlElement>>children
^ children
HtmlElement>>name
^ name
HtmlElement>>name: newName
name := newName
HtmlElement>>displayText
^ self name
Last but not least UnknownText
:
WebElement subclass: #UnknownText
instanceVariableNames: 'text'
classVariableNames: ''
poolDictionaries: ''
category: 'PetitParser2-Tutorial'.
UnknownText>>text
^ text.
UnknownText>>text: anObject
text := anObject.
UnknownText>>gtText: composite
<gtInspectorPresentationOrder: 40>
composite text
title: 'Text';
display: [ :context | text ].
It is a good practice in PetitParser2 to split grammar (returns concrete syntax tree) and parser (returns abstract syntax tree).
We do this by subclassing WebGrammar
:
WebGrammar subclass: #WebParser
instanceVariableNames: ''
classVariableNames: ''
poolDictionaries: ''
category: 'PetitParser2-Tutorial'
Now we override the rules of WebGrammar
in WebParser
so that the javascript
rule returns JavascriptElement
, the element
rule returns HtmlElement
and the text
rule returns UnknownText
:
WebParser>>javascript
^ super javascript
map: [ :_code |
(JavascriptElement new)
code: _code;
yourself
]
WebParser>>element
^ super element
map: [ :_open :_content :_close |
(HtmlElement new)
name: _open;
children: _content;
yourself
]
WebParser>>text
^ super text flatten
map: [ :_value |
UnknownText new
text: _value;
yourself
]
And finally, for convenience, we trim whitespaces around html elements:
WebParser>>elClose
^ super elClose trim
WebParser>>elOpen
^ super elOpen trimRight
There is trimRight
in elOpen
, which means that only the whitespace on the right is trimmed. This makes caching of PetitParser slightly more efficient, because element always starts at the first non-whitespace character. If there is a trimming from left and right (using the trim
), it might start at any preceding whitespace or the first non-whitespace character and this would lead into lower cache-hit ratio (we will talk about caches later).
Don’t forget tests.
WebGrammarTest subclass: #WebParserTest
uses: TPP2TypeAssertions
instanceVariableNames: ''
classVariableNames: ''
poolDictionaries: ''
category: 'PetitParser2-Tutorial'
WebParserTest>>parserClass
^ WebParser
WebParserTest>>testElement
super testElement.
self assert: result name equals: 'p'.
self assert: result firstChild text equals: 'lorem ipsum'
WebParserTest>>testElementEmpty
super testElementEmpty.
self assert: result name equals: 'foo'.
WebParserTest>>testElementNested
super testElementNested.
self assert: result name equals: 'p'.
self assert: result firstChild text trim equals: 'lorem'.
self assert: result secondChild name equals: 'i'.
self assert: result secondChild firstChild text equals: 'ipsum'
And for malformed elements, we expect the following results:
WebParserTest>>testElementMalformedExtraClose
super testElementMalformedExtraClose.
self assert: result name equals: 'foo'.
self assert: result secondChild text equals: '</fii>'
WebParserTest>>testElementMalformedWrongClose
super testElementMalformedWrongClose.
self assert: result name equals: 'foo'.
self assert: result firstChild text equals: '<bar>meh</baz>'
WebParserTest>>testElementMalformedUnclosed
super testElementMalformedUnclosed.
self assert: result name equals: 'head'.
self assert: result firstChild text trim equals: '<meta content="mess">'
And of course, we expect JavaScript code to be extracted as follows:
WebParserTest>>testJavascript
super testJavascript.
self assert: result code equals: 'alert("hi there!")'
WebParserTest>>testJavascriptWithString
super testJavascriptWithString.
self assert: result code equals: 'alert(''</script>'')'
Try if tests pass:
(WebParserTest buildSuiteFromMethods: #(
#testElement
#testElementEmpty
#testElementNested
#testElementMalformedUnclosed
#testElementMalformedExtraClose
#testElementMalformedWrongClose
#testJavascript
#testJavascriptWithString)) run
8 run, 8 passes, 0 skipped, 0 expected failures, 0 failures, 0 errors, 0 unexpected passes
We extended grammar to return convenient input: an abstract syntax tree.
The AST is constructed inWebParser
, which extends the WebGrammar
.
The sources of this tutorial are part of the PetitParser2 package, you just need to install PetitParser2 or use Moose as described in the Introduction.