Simple, modular and flexible high-performance parsing framework.
Introduction
Introduction to PetitParser2
Migration from PetitParser
Parser Development
Scripting with Bounded Seas
Grammar
Context-Sensitive Grammar
Abstract-Syntax Tree
Full Parser
Comments
Optimizations
Optimization (Memoization)
PetitParser2 Internals
Star Lazy (In Progress)
Caches (In Progress)
Matching Tags (In Progress)
Context-Sensitivity (In Progress)
Let us construct the whole HTML document. All the pieces are ready, we just need to put them together.
In the previous chapters, we defined document (represented by the document
rule) in WebGrammar
as a repetition of javascript seas.
Now we define a structured document (structuredDocument
), i.e. a document with an html structure.
Of course, we write tests first and we start with WebGrammarTest
:
Root of an HTML file is an element that can be surrounded by some other information (e.g. doctype or comments). Therefore define structured document as an element surrounded by a sea:
WebGrammar>>structuredDocument
^ element sea
As a root element return a DOCUMENT with an HTML element as well a its surroundings:
WebParser>>structuredDocument
^ super structuredDocument
map: [ :_bw :_html :_aw |
| beforeWater afterWater |
beforeWater := UnknownText new
text: (String new writeStream nextPutAll: _bw; yourself) contents;
yourself.
afterWater := UnknownText new
text: (String new writeStream nextPutAll: _aw; yourself) contents;
yourself.
HtmlElement new
name: 'DOCUMENT';
children: (Array
with: beforeWater
with: _html
with: afterWater);
yourself
]
And we should remember to change the start
method.
WebGrammar>>start
^ structuredDocument
Verify the code works as expected:
WebGrammarTest>>testStructuredDocumentSimple
| input |
input := '<html>
<body>
<script>alert("hello world")</script>
</body>
</html>'.
self parse: input rule: #structuredDocument
WebGrammarTest>>testStructuredDocumentWithDoctype
| input |
input := '
<!DOCTYPE html>
<!-- comment -->
<html>
<body>
<script>alert("hello world")</script>
</body>
</html>'.
self parse: input rule: #structuredDocument
WebGrammarTest>>testStructuredDocument
| input |
input := PP2Sources current htmlSample.
self parse: input rule: #structuredDocument
Verify that we extract the correct structure of the document.
WebParserTest>>testStructuredDocumentSimple
| html body javascript |
super testStructuredDocumentSimple.
self assert: result name equals: 'DOCUMENT'.
html := result secondChild.
self assert: html name equals: 'html'.
body := html firstChild.
self assert: body name equals: 'body'.
javascript := body firstChild.
self assert: javascript isKindOf: JavascriptElement.
self assert: javascript code equals: 'alert("hello world")'.
WebParserTest>>testStructuredDocumentWithDoctype
| html body javascript |
super testStructuredDocumentWithDoctype.
self assert: result name equals: 'DOCUMENT'.
html := result secondChild.
self assert: html name equals: 'html'.
body := html firstChild.
self assert: body name equals: 'body'.
javascript := body firstChild.
self assert: javascript isKindOf: JavascriptElement.
self assert: javascript code equals: 'alert("hello world")'.
WebParserTest>>testStructuredDocument
| html body |
super testStructuredDocument.
self assert: result name equals: ''DOCUMENT''.
html := result secondChild.
self assert: html name equals: 'html'.
self assert: html firstChild name equals: 'head'.
self assert: html secondChild name equals: 'body'.
By this time all the tests should pass:
(WebParserTest buildSuiteFromMethods: #(
#testElement
#testElementEmpty
#testElementNested
#testElementMalformedUnclosed
#testElementMalformedExtraClose
#testElementMalformedWrongClose
#testJavascript
#testJavascriptWithString
#testStructuredDocumentSimple
#testStructuredDocumentWithDoctype
#testStructuredDocument)) run.
11 run, 11 passes, 0 skipped, 0 expected failures, 0 failures, 0 errors, 0 unexpected passes
We finished parser from the previous chapter to properly parse the whole html document.
The sources of this tutorial are part of the PetitParser2 package, you just need to install PetitParser2 or use Moose as described in the Introduction.