Parsing with PetitParser2

This text is part of the Parsing With PetitParser2 series. The table of content can be found at the end of the chapter.

Optimizations Overview

Please note that the WebGrammar you could download in the Abstract Syntax Tree chapter contains already optimized version of an element rule. In this chapter we suppose that element definition looks like defined in Extracting the Structure chapter, i.e. it looks like this:

WebGrammar>>element
	^ (elOpen, elContent, elClose)

1. Basic Analysis

When running the testStructuredDocument, you might have noticed that the performance is not good. It is actually terribly bad. Let us inspect in in detail:

input := PP2Sources current htmlSample.
WebParser new debug: input.

Report View

This will run the parser in a debug mode and collect a lot of useful information. Unfortunately, it also takes considerable amount of time so be patient. Once you get the result, inspect the Report tab (note that numbers might slightly differ in your case):

Report view of non-optimized parse

The total stream size is 1171 characters. The parser has remembered the full context 510197 times and restored 387667 times. This means that the full copy of the context (including the stack of opened elements, which we use to match open and close tags) has been created 510197 times and the context has been restored from this full copy (by doing another copy) 387667 times. This is in average 331 backtrackings per consumed character.

Some of the parsers do not need to perform the full context copy. These are prefixed with ‘lightweight’. In the lightweight case, only the position is remembered and restored (which is much faster than the full copy of a stack). This happened in average 92 times per character.

Honestly, these numbers are brutal. In ideal case, the number of full restores should be zero. Full remember and restore is very expensive. Very often only lightweight remember and restore happens, since many grammars are context-free, i.e. they do not use push, match and pop operators.

Regarding the lightweight backtracking, the average of lightweight backtracks per character should be below 1 (in ideal case). This is also not so unusual case when dealing with deterministic grammars, which do not do any (or only a little) speculations and know exactly which alternative of a choice is the correct one.

Debug View

But our grammar is obviously not deterministic it does lots of speculations. Why are the numbers so bad in our case? Let us inspect in more detail how this was happening, using the Debug view:

0.1. Debug View of non-optimized parse.

The view in Figure 0.1 shows that 1502165 parsers were invoked to consume the input. Most of it (751559 invocations) happened when parsing first two lines of the input --- the before water of the sea.

This is too much for a short input. Such a bad performance is caused by the fact that bounded seas have to verify at every position if the next parser (i.e. the parser that will be invoked after the sea) will succeed. If the next parser happens to be another sea (as in the case of WebGrammar), the number of invocations grows really fast (exponentially in fact).

We see that the element rule took approximately one half of the invocations (750598). Majority of 751559 invocations of before water were spent by testing for element.

Another glimpse of ineffectiveness can be seen in the elOpen rule in Figure 0.2.

0.2. Debug view of the elOpen rule non-optimized parse.

The whole <html lang="html" dir="ltr" class="js-enabled"> is parsed by invoking 272 parsers. This is quite a lot for a line with 55 characters. One of the reasons is how are the identifiers (elementName) parsed. You can see that when recognizing the html text, 8 different parsers are invoked; for each letter one and some extra boilerplate parsers (flattening, possessive repeating,...).

2. Using the optimize method

Luckily, PetitParser2 comes with an automated optimizations that are able to reduce most of the unnecessary overhead. Let it give a try:

WebParser new optimize debug: input.

The report looks much better now:

0.3. Report view of the optimized parse.

Only one full remember, and only 7 lightweight backtracks per character. This is reasonably good for a grammar with bounded seas.

Because of the nature of bounded seas and nature of PetitParser2, we do not recommend to parse grammars with more complicated bounded seas without optimizations.

Optimization in Detail

Let us inspect some of the optimizations that happened. We switch to Debug view:

0.4. Debug view of optimized parse

The Figure 0.4 shows that the total number of invoked parsers is 55144, roughly 30x less. We also see that, as in the previous case, most of the parsers are invoked in the before-water of the initial sea, but it is roughly 10x better now. We can also notice that the following island and after water are parsed using only 2 and 4 parser invocations respectively. This is great, it was half of the total invocations in the case of unoptimized parse!

Caches

The main reason of reduced invocations is caching. Under the element rule in Figure 0.4, there is a mapping node. The mapping node returns the complete HTML element just in one single invocation. The fact that the result has been cached is visualized via the result reference (see DebugResultLink: reference).

The reference indicates that at the time of the invocation the result has already been computed during some previous invocation. The parser remembered the result and returned the remembered without invoking the parser again. There is another mapping node as a child of the reference with 54746 underlying invocations. But as mentioned these have happened before and mapping node shows them just for the convenience.

More about caches can be found in the Caching chapter.

Specializations

Another optimizations applied by PetitParser two are specializations. The way specializations work can be demonstrated on the opening of an html element: <html lang="mul" dir="ltr" class="js-enabled">.

0.5. Debug view of the elOpen rule optimized parse

One can see that the html text (the elementName rule) has been recognized using only three invocation contrary to the eight invocations in the unoptimized case (see Figure 0.2). In the unoptimized case (see Figure 0.2) the html text was parsed so that one parser combinator was invoked for each of the characters. In the optimized case (see Figure 0.5) the whole html text was parsed by a single parser invocation. Internally, the parser is implemented as a while loop, which is much more efficient than multiple parser invocations.

The same thing happened with the mapping node. The specializations recognized an any-star-lazy pattern, which, in this case, means to consume any character until end of the tag > appears. This can be implemented as a single while loop saving 252 parser invocations.

More details about specializations are provided in the Smalltalk Parser chapter.

3. Conclusion

In general, PEG-based parser combinator parsers (such as PetitParser) are popular because they are easy to understand, easy to extend and easy to debug. But these advantages come at the price of low efficiency. In order to target this problem, PetitParser comes with the optimizations, which can dramatically improve the efficiency while preserving all the other advantages.

Yet, some of these optimizations cannot be done automatically. There a kind of optimizations that has to be done manually and we dedicate the following Memoization chapter to it.

Table of Contents

This text is part of the Parsing With PetitParser2 series.

Part I, Developer's Workflow:

Do you have ideas, suggestions or issues? Write us an email, or contact u on github!

Go to top.