Parsing with PetitParser2

This text is part of the Parsing With PetitParser2 series. The table of content can be found at the end of the chapter.

Manual Memoization

Please note that the WebGrammar offered to download in the Abstract Syntax Tree chapter contains already optimized version of an element rule. In this chapter we suppose that element definition looks like defined in Extracting the Structure chapter, i.e. it looks like this:b

WebGrammar>>element
	^ (elOpen, elContent, elClose)

In the previous chapter Optimizations we have greatly improved the performance of our parser. So let's parse the real sources. The home page of we wikipedia, github, facebook and google can be parsed invoking this commands:

sources := PP2Sources current htmlSourcesAll.
parser := WebParser new optimize.
sources collect: [ :s |
	parser parse: s.
]

Unfortunately, this still takes too much time even though we applied the automated optimizations. Can we do something, that the optimizations can't? Usually, hard to say. It depends on the nature of a grammar and input to be parsed. But in this case, we can really improve the performance. (obviously, we wouldn't write this chapter otherwise).

It is a good time to check the events morph of a debug result. Events morph shows timeline of parser invocations (dot) at a given position (x axis) in a given time (y axis). Inspect the result of the following command again and switch to Events view.

WebParser new optimize debug: input.

0.1. Visualization of invocations of the optimized parser on input

On the screenshot in Figure 0.1 we see only a part of the story (the part that fits into a window). The whole story is that parsing progress pretty fast towards the end of input (this is the good part), but later one can notice (when scrolling down) that parser jumps to the beginning of input and starts again. Over and over again (this is the bad part).

Please not that for long inputs the Event tab shows only a beginning of the input and first few thousands of invocations.

Even though optimizations in PetitParser2 work reasonably well, not all of them can be applied automatically. Some grammars do a lot of backtracking and automated optimizations can reduce only part of it. If I ask you why is the backtracking happening you would probably have a hard time to figure this out[1]. So does have the code trying to apply the optimizations, so we have to do this manually.

Luckily, there are tools to help us. For convenience of visualization, we use compact and simplified input.

compact := '
<!a>
<h>
<m foo>
<m e>
<b>
Lorem ipsum donor sit amet
</b>
</h>
'.
WebParser new optimize debug: compact.

0.2. Visualization of invocations of the optimized parser on compact input

1. Searching for the cause

In the events morph we see how does the parser backtrack, over and over. We have to do a bit of detective work to figure out when and why. Let us navigate through the parsing until the HTML body (represented by the b element in the compact input) is parsed:

0.3. Debug view of the element rule

Now switch to the traces tab:

0.4. Debug view of the element rule, traces tab

The yellow rectangle in Figure 0.4 highlights 872 underlying invocations of the element rule. The dark red rectangle highlights another invocations of the same element rule that started at the same position. And we see there are quite a few of them. In general, one does not want to see repeating dark-red rectangles, and the more of them or the higher these dark-red rectangles are, the worse. It simply means redundant computations. Remember, each pixel on the y-axis is a parser invocation.

Why do these redundant invocations happen? Because of the unclosed HTML meta elements (represented by m in the compact input). The parser sees meta, starts an element, parses the content of meta, including the body element. But it does not find the closing meta. So it returns before the meta element, skips the meta part as a water and continues parsing. This means it re-parses the body element again. The more meta elements, the worse.

If yo are watchful, you might have noticed a line behind a dark-red bar. This line is created by automated cache, which immediately returns the result of a body element and prevents the whole execution. But because it can remember only the last result, it can prevent only half of the executions. Twice as fast is good, but we can do even better.

2. Fixing the Cause

There is a solution to this, called memoization. Memoization is a technique to remember all the results for all the positions (while caching remembered only the last result for the last position). The more power comes at a cost of less efficiency, therefore memoizations should not be used superfluously

To suggest PetitParser2 to add memoization, add a memoize keyword to the element rule.

WebGrammar>>element
	^ (elOpen, elContent, elClose)
		memoize;
		yourself

Please note currently PetitParser2 does not support memoizations of push and pop parsers. The "neutral" parsers, for example a sequence of push and pop, as in the case of the element rule, can be memoized.

Let us check the result now:

WebParser new optimize debug: compact.

0.5. Memoized trace of a compact input

The Figure 0.5 is the result we want to see. Even though the parser still backtracks, it is because of the nature of the grammar. Grammar is defined in a way it can speculate and so it speculates. We see two big backtrackings: one for each meta in the input, the html body itself is not re-parsed over and over again and the precious time is saved.

If you want to avoid backtracking of meta elements, you have to include syntax of meta into the grammar.

You can try to parse some big sources as well, they should be parsed in reasonable time:

sources := PP2Sources current htmlSourcesAll.
parser := WebParser new optimize.
sources collect: [ :s |
	parser parse: s.
]

3. Conclusion

This is the last chapter of extracting Javascript from HTML sources. We started with a prototype in the playground in the first chapter and end up with a full-fledged and tested parser returning html structure, parsing real sources and tolerating malformed inputs.

While specifying the grammar, we wrote only rules that describe the interesting input, we ignored the rest. This way we got feedback about our parser from the early stages of development. This is in contrast with a traditional parser development, where a grammar engineer has to specify most or a complete grammar before she obtains useful results.

PetitParser allows a grammar engineer to use an agile approach. This is thanks to (i) bounded seas, which allows for the tolerant parsing, and (ii) automated optimizations, which significantly improve performance of a removing the burden from the grammar engineer.

Suggestions and ideas?

The parser we implemented has a reasonable performance considering the simplicity of the grammar and effort we put into specifying it. Of course, PetitParser allows you to specify even faster parser, but at a cost of higher implementation effort. We discuss optimizations of a high-performance grammar on the concrete example of the Smalltalk grammar. The Smalltalk grammar reaches performance better than a table-driven approach and is comparable to a performance of a hand-written parser.

Nevertheless, If you have an ideas how to improve performance even more, please contact us orcontribute!

Sources

The sources of the html parser can be found here:

The sources used to generate this tutorial are part of the PetitParser2 package.

Table of Contents

This text is part of the Parsing With PetitParser2 series.

Part I, Developer's Workflow:

Do you have ideas, suggestions or issues? Write us an email, or contact u on github!

Go to top.


[1] actually, it was not easy for the author of this text to pinpoint the root cause as well