Please note that the
WebGrammar offered to download in the Abstract Syntax Tree chapter contains already optimized version of an
element rule. In this chapter we suppose that
element definition looks like defined in Extracting the Structure chapter, i.e. it looks like this:b
In the previous chapter Optimizations we have greatly improved the performance of our parser. So let's parse the real sources. The home page of we wikipedia, github, facebook and google can be parsed invoking this commands:
Unfortunately, this still takes too much time even though we applied the automated optimizations. Can we do something, that the optimizations can't? Usually, hard to say. It depends on the nature of a grammar and input to be parsed. But in this case, we can really improve the performance. (obviously, we wouldn't write this chapter otherwise).
It is a good time to check the events morph of a debug result. Events morph shows timeline of parser invocations (dot) at a given position (x axis) in a given time (y axis). Inspect the result of the following command again and switch to Events view.
On the screenshot in Figure 0.1 we see only a part of the story (the part that fits into a window). The whole story is that parsing progress pretty fast towards the end of input (this is the good part), but later one can notice (when scrolling down) that parser jumps to the beginning of input and starts again. Over and over again (this is the bad part).
Please not that for long inputs the Event tab shows only a beginning of the input and first few thousands of invocations.
Even though optimizations in PetitParser2 work reasonably well, not all of them can be applied automatically. Some grammars do a lot of backtracking and automated optimizations can reduce only part of it. If I ask you why is the backtracking happening you would probably have a hard time to figure this out. So does have the code trying to apply the optimizations, so we have to do this manually.
Luckily, there are tools to help us. For convenience of visualization, we use compact and simplified input.
In the events morph we see how does the parser backtrack, over and over.
We have to do a bit of detective work to figure out when and why.
Let us navigate through the parsing until the HTML body (represented by the
b element in the compact input) is parsed:
Now switch to the traces tab:
The yellow rectangle in Figure 0.4 highlights 872 underlying invocations of the
The dark red rectangle highlights another invocations of the same
element rule that started at the same position.
And we see there are quite a few of them.
In general, one does not want to see repeating dark-red rectangles, and the more of them or the higher these dark-red rectangles are, the worse.
It simply means redundant computations.
Remember, each pixel on the y-axis is a parser invocation.
Why do these redundant invocations happen?
Because of the unclosed HTML
meta elements (represented by
m in the compact input).
The parser sees meta, starts an element, parses the content of meta, including the body element.
But it does not find the closing meta.
So it returns before the meta element, skips the meta part as a water and continues parsing.
This means it re-parses the body element again.
meta elements, the worse.
If yo are watchful, you might have noticed a line behind a dark-red bar.
This line is created by automated cache, which immediately returns the result of a
body element and prevents the whole execution.
But because it can remember only the last result, it can prevent only half of the executions.
Twice as fast is good, but we can do even better.
There is a solution to this, called memoization. Memoization is a technique to remember all the results for all the positions (while caching remembered only the last result for the last position). The more power comes at a cost of less efficiency, therefore memoizations should not be used superfluously
To suggest PetitParser2 to add memoization, add a
memoize keyword to the
Please note currently PetitParser2 does not support memoizations of push and pop parsers. The "neutral" parsers, for example a sequence of push and pop, as in the case of the
element rule, can be memoized.
Let us check the result now:
The Figure 0.5 is the result we want to see.
Even though the parser still backtracks, it is because of the nature of the grammar.
Grammar is defined in a way it can speculate and so it speculates.
We see two big backtrackings: one for each
meta in the input, the html body itself is not re-parsed over and over again and the precious time is saved.
If you want to avoid backtracking of
meta elements, you have to include syntax of
meta into the grammar.
You can try to parse some big sources as well, they should be parsed in reasonable time:
While specifying the grammar, we wrote only rules that describe the interesting input, we ignored the rest. This way we got feedback about our parser from the early stages of development. This is in contrast with a traditional parser development, where a grammar engineer has to specify most or a complete grammar before she obtains useful results.
PetitParser allows a grammar engineer to use an agile approach. This is thanks to (i) bounded seas, which allows for the tolerant parsing, and (ii) automated optimizations, which significantly improve performance of a removing the burden from the grammar engineer.
The parser we implemented has a reasonable performance considering the simplicity of the grammar and effort we put into specifying it. Of course, PetitParser allows you to specify even faster parser, but at a cost of higher implementation effort. We discuss optimizations of a high-performance grammar on the concrete example of the Smalltalk grammar. The Smalltalk grammar reaches performance better than a table-driven approach and is comparable to a performance of a hand-written parser.
Nevertheless, If you have an ideas how to improve performance even more, please contact us orcontribute!
The sources of the html parser can be found here:
The sources used to generate this tutorial are part of the PetitParser2 package.
This text is part of the Parsing With PetitParser2 series.
Part I, Developer's Workflow:
Go to top.