Create an Abstract Syntax Tree

In the previous chapter we created a grammar to parse nested HTML elements. The grammar parses the input, but we don’t know what is the result. Now, we implement structured html representation of an HTML input.

Note:

The typical output of a parser is detailed and not suitable for further processing. It is called concrete syntax tree (CST), which is (in the case of PetitParser) an array of arrays of arrays of … arrays. Characters and strings leaf nodes.

AST Nodes

Convenient representation of an input is an abstract syntax tree (AST). AST in our case will be a tree with HTML and JavaScript elements. The AST we will use consists of three nodes:

a JavaScript code;
an HTML element; and
a text.

We define these nodes as follows, starting with its abstract predecessor WebElement with some convenience methods.

Warning:

Note that WebParser and other classes are already loaded in your image. You might want to save them for future reference, or rename them before doing next steps.

Object subclass: #WebElement
  instanceVariableNames: ''
  classVariableNames: ''
  poolDictionaries: ''
  category: 'PetitParser2-Tutorial'

WebElement>>children
  ^ #()

WebElement>>firstChild
  "Just for convenience"
  ^ self children first

WebElement>>secondChild
  "Just for convenience"
  ^ self children second

WebElement>>gtTreeViewIn: composite
  <gtInspectorPresentationOrder: 40>

  composite tree
    title: 'Tree';
    children: [:n | n children ];
    format: [:n| n displayText printStringLimitedTo: 50 ];
    shouldExpandToLevel: 6

JavascriptElement follows:

WebElement subclass: #JavascriptElement
  instanceVariableNames: 'code'
  classVariableNames: ''
  poolDictionaries: ''
  category: 'PetitParser2-Tutorial'

JavascriptElement>>code
  ^ code

JavascriptElement>>code: anObject
  code := anObject

JavascriptElement>>displayText
  ^ self code

JavascriptElement>>gtText: composite
  <gtInspectorPresentationOrder: 40>
  
  composite text
    title: 'Text';
    display: [ :context | code ]

As well as HtmlElement:

WebElement subclass: #HtmlElement
  instanceVariableNames: 'name children'
  classVariableNames: ''
  poolDictionaries: ''
  category: 'PetitParser2-Tutorial'

HtmlElement>>children
  ^ children

HtmlElement>>name
  ^ name

HtmlElement>>name: newName
  name := newName

HtmlElement>>displayText
  ^ self name

Last but not least UnknownText:

WebElement subclass: #UnknownText
  instanceVariableNames: 'text'
  classVariableNames: ''
  poolDictionaries: ''
  category: 'PetitParser2-Tutorial'.

UnknownText>>text
  ^ text.

UnknownText>>text: anObject
  text := anObject.

UnknownText>>gtText: composite
  <gtInspectorPresentationOrder: 40>
  
  composite text
    title: 'Text';
    display: [ :context | text ].

From a Grammar to a Parser

It is a good practice in PetitParser2 to split grammar (returns concrete syntax tree) and parser (returns abstract syntax tree). We do this by subclassing WebGrammar:

WebGrammar subclass: #WebParser
  instanceVariableNames: ''
  classVariableNames: ''
  poolDictionaries: ''
  category: 'PetitParser2-Tutorial'

Now we override the rules of WebGrammar in WebParser so that the javascript rule returns JavascriptElement, the element rule returns HtmlElement and the text rule returns UnknownText:

WebParser>>javascript
  ^ super javascript
  
  map: [ :_code | 
    (JavascriptElement new)
      code: _code;
      yourself
  ]

WebParser>>element
  ^ super element 
  
  map: [ :_open :_content :_close | 
     (HtmlElement new)
      name: _open;
      children: _content;
      yourself
  ]

WebParser>>text
  ^ super text flatten
  
  map: [ :_value | 
    UnknownText new
      text: _value;
      yourself  
  ]

And finally, for convenience, we trim whitespaces around html elements:

WebParser>>elClose
  ^ super elClose trim

WebParser>>elOpen
  ^ super elOpen trimRight

Note:

There is trimRight in elOpen, which means that only the whitespace on the right is trimmed. This makes caching of PetitParser slightly more efficient, because element always starts at the first non-whitespace character. If there is a trimming from left and right (using the trim), it might start at any preceding whitespace or the first non-whitespace character and this would lead into lower cache-hit ratio (we will talk about caches later).

Testing

Don’t forget tests.

WebGrammarTest subclass: #WebParserTest
  uses: TPP2TypeAssertions
  instanceVariableNames: ''
  classVariableNames: ''
  poolDictionaries: ''
  category: 'PetitParser2-Tutorial'

WebParserTest>>parserClass
  ^ WebParser

WebParserTest>>testElement
  super testElement.
  
  self assert: result name equals: 'p'.
  self assert: result firstChild text equals: 'lorem ipsum'

WebParserTest>>testElementEmpty
  super testElementEmpty.
  
  self assert: result name equals: 'foo'.

WebParserTest>>testElementNested
  super testElementNested.
  
  self assert: result name equals: 'p'.
  self assert: result firstChild text trim equals: 'lorem'.
  self assert: result secondChild name equals: 'i'.
  self assert: result secondChild firstChild text equals: 'ipsum'

And for malformed elements, we expect the following results:

WebParserTest>>testElementMalformedExtraClose
  super testElementMalformedExtraClose.
  
  self assert: result name equals: 'foo'.
  self assert: result secondChild text equals: '</fii>'

WebParserTest>>testElementMalformedWrongClose
  super testElementMalformedWrongClose.
  
  self assert: result name equals: 'foo'.
  self assert: result firstChild text equals: '<bar>meh</baz>'

WebParserTest>>testElementMalformedUnclosed
  super testElementMalformedUnclosed.
  
  self assert: result name equals: 'head'.
  self assert: result firstChild text trim equals: '<meta content="mess">'

And of course, we expect JavaScript code to be extracted as follows:

WebParserTest>>testJavascript
	super testJavascript.
	
	self assert: result code equals: 'alert("hi there!")'

WebParserTest>>testJavascriptWithString
	super testJavascriptWithString.
	
	self assert: result code equals: 'alert(''</script>'')'

Try if tests pass:

(WebParserTest buildSuiteFromMethods: #(
  #testElement
  #testElementEmpty
  #testElementNested
  #testElementMalformedUnclosed
  #testElementMalformedExtraClose
  #testElementMalformedWrongClose
  #testJavascript
  #testJavascriptWithString)) run

8 run, 8 passes, 0 skipped, 0 expected failures, 0 failures, 0 errors, 0 unexpected passes

Summary

We extended grammar to return convenient input: an abstract syntax tree. The AST is constructed inWebParser, which extends the WebGrammar.

Sources

The sources of this tutorial are part of the PetitParser2 package, you just need to install PetitParser2 or use Moose as described in the Introduction.