Parsing with PetitParser2

Simple, modular and flexible high-performance parsing framework.

Introduction
Introduction to PetitParser2
Migration from PetitParser

Parser Development
Scripting with Bounded Seas
Grammar
Context-Sensitive Grammar
Abstract-Syntax Tree
Full Parser
Comments
Optimizations
Optimization (Memoization)

PetitParser2 Internals
Star Lazy (In Progress)
Caches (In Progress)
Matching Tags (In Progress)
Context-Sensitivity (In Progress)

View the Project on GitHub kursjan/petitparser2

Full Parser

Let us construct the whole HTML document. All the pieces are ready, we just need to put them together.

Structured Document

In the previous chapters, we defined document (represented by the document rule) in WebGrammar as a repetition of javascript seas. Now we define a structured document (structuredDocument), i.e. a document with an html structure. Of course, we write tests first and we start with WebGrammarTest:

Root of an HTML file is an element that can be surrounded by some other information (e.g. doctype or comments). Therefore define structured document as an element surrounded by a sea:

WebGrammar>>structuredDocument
  ^ element sea

As a root element return a DOCUMENT with an HTML element as well a its surroundings:

WebParser>>structuredDocument
  ^ super structuredDocument
  
  map: [ :_bw :_html :_aw |
    | beforeWater afterWater |
    beforeWater := UnknownText new
			text: (String new writeStream nextPutAll: _bw; yourself) contents;
      yourself.
      
    afterWater := UnknownText new
			text: (String new writeStream nextPutAll: _aw; yourself) contents;
      yourself.
      
    HtmlElement new
      name: 'DOCUMENT';
      children: (Array 
        with: beforeWater 
        with: _html 
        with: afterWater);
      yourself
  ]

And we should remember to change the start method.

WebGrammar>>start
  ^ structuredDocument 

Tests

Verify the code works as expected:

WebGrammarTest>>testStructuredDocumentSimple
  | input |
  input := '<html>
    <body>
      <script>alert("hello world")</script>
    </body>
  </html>'.
  
  self parse: input rule: #structuredDocument

WebGrammarTest>>testStructuredDocumentWithDoctype
  | input |
  input := '
<!DOCTYPE html>
<!-- comment -->
<html>
  <body>
    <script>alert("hello world")</script>
  </body>
</html>'.
  
  self parse: input rule: #structuredDocument

WebGrammarTest>>testStructuredDocument
  | input |
  input := PP2Sources current htmlSample.
  
  self parse: input rule: #structuredDocument

Verify that we extract the correct structure of the document.

WebParserTest>>testStructuredDocumentSimple
  | html body javascript |
  super testStructuredDocumentSimple.
  
  self assert: result name equals: 'DOCUMENT'.


  html := result secondChild.
  self assert: html name equals: 'html'.

  body := html firstChild.
  self assert: body name equals: 'body'.
  
  javascript := body firstChild.
  self assert: javascript isKindOf: JavascriptElement.
  self assert: javascript code equals: 'alert("hello world")'.

WebParserTest>>testStructuredDocumentWithDoctype
  | html body javascript |
  super testStructuredDocumentWithDoctype.
  
  self assert: result name equals: 'DOCUMENT'.


  html := result secondChild.
  self assert: html name equals: 'html'.

  body := html firstChild.
  self assert: body name equals: 'body'.
  
  javascript := body firstChild.
  self assert: javascript isKindOf: JavascriptElement.
  self assert: javascript code equals: 'alert("hello world")'.

WebParserTest>>testStructuredDocument
  | html body |
  super testStructuredDocument.
  
  self assert: result name equals: ''DOCUMENT''.

  html := result secondChild.
  self assert: html name equals: 'html'.

  self assert: html firstChild name equals: 'head'.  
  self assert: html secondChild name equals: 'body'.

By this time all the tests should pass:

(WebParserTest buildSuiteFromMethods: #(
  #testElement
  #testElementEmpty
  #testElementNested
  #testElementMalformedUnclosed
  #testElementMalformedExtraClose
  #testElementMalformedWrongClose
  #testJavascript
  #testJavascriptWithString
  #testStructuredDocumentSimple
  #testStructuredDocumentWithDoctype
  #testStructuredDocument)) run.
11 run, 11 passes, 0 skipped, 0 expected failures, 0 failures, 0 errors, 0 unexpected passes

Summary

We finished parser from the previous chapter to properly parse the whole html document.

Sources

The sources of this tutorial are part of the PetitParser2 package, you just need to install PetitParser2 or use Moose as described in the Introduction.