In the previous chapter we tested our grammar to parse the nested elements,
but we have very limited (none, I would say) knowledge about how does the extracted structure look like.
So let us improve the output from the parser and add more information.
The direct output of a parser is typically very detailed and not well-readable.
It is called concrete syntax tree (CST), which is (in the case of PetitParser) array of arrays of arrays of ... with characters and strings as terminals.
It is the time to return a more convenient representation of the input: an abstract syntax tree (AST).
AST, contrary to the very low-level and detailed CST, is conceptually closer to the target domain.
It is considered to be a good practice in PetitParser to split grammar, which returns CST, from parser, which returns AST.
We do this by subclassing WebGrammar:
1. Testing First
Parsing is great use case for test driven development.
So let us start with tests first.
Anyway, tests give us a clear idea what kind of interface we expect from abstract syntax tree:
And for malformed elements, we expect the following results:
2. AST Nodes
If we want to pass these test, the result of a parse should be a tree consisting of three different nodes:
an html element; and
We define these nodes as follows, starting with its abstract predecessor WebElement with some convenience methods.
As well as HtmlElement:
Last but not least UnknownText:
3. From a Grammar to a Parser
And finally, for convenience, we trim whitespaces around html elements:
There is trimRight in elOpen, which means that only the whitespace on the right is trimmed. This makes caching of PetitParser slightly more efficient, because element always starts at the first non-whitespace character. If there is a trimming from left and right (using the trim), it might start at any preceding whitespace or the first non-whitespace character and this would lead into lower cache-hit ratio (we will talk about caches later).
By this time all the tests should pass:
4. Structured Document
Now let us construct the whole HTML document.
All the pieces we need are already ready we just need to put them together.
Now we define a structured document (structuredDocument), i.e. a document with an html structure.
Of course, we write tests first and we start with WebGrammarTest:
From the tests we see that the root of an html file is an element that can be surrounded by some other information (e.g. doctype or comments), therefore we define structured document as an element surrounded by a sea:
Furthermore, we verify that we extract the correct structure of the document.
This can be done in the WebParserTest:
In order to pass these tests, we create a root element called DOCUMENT in structuredDocument.
Furthermore we add to it the html element as well a its surroundings:
And we should remember to change the start method.
By this time all the tests should pass:
In this section we focus on HTML comments.
Comments can contain scripts or other elements that are not part of the html document (they are just comments that look the same).
And they may confuse the parser.
Actually they do and they do so in our tests as well!
There is an error in WebParser that we have not found.
It extracts one extra HTML element, which is not part of the HTML document.
Inspect the following and switch to the tree view:
There is a <p> element in the <body>.
But <p> is a part of a comment, it should not be included in the document structure.
First, we should fix the testStructuredDocument to cover this case:
So where does the problem come from?
WebGrammar has no notion of a comment and handles a content of a comment as an ordinary html code.
In order to teach it what is an html comment, we have to define it:
And now we can redefine text so that it takes comments into an account:
If you are interested in the details of the starLazy operator, check out the The starLazy Operator chapter.
This is it, we can parse all of it!
In this chapter we have applied the test-driven approach to define our AST nodes.
We defined WebParser, which extends the WebGrammar and builds the AST nodes.
Last but not least, we had a look how to avoid false positives caused by comments.