Parsing with PetitParser2

This text is part of the Parsing With PetitParser2 series. The table of content can be found at the end of the chapter.

Extracting Javascript

In this chapter we extract javascript from html files using a simple script. We create the real parser later.

Hands On

Open your playground and let's start coding. First of all, we define what we want to parse:

source := PP2Sources current htmlSample.

The source very simplified (and slightly modified) version of Wikipedia and contains the following text:

<!DOCTYPE html>
<!-- saved from url=(0026)https://www.wikipedia.org/ -->
<html lang="mul" dir="ltr" class="js-enabled">
<head>
	<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
	<title>Wikipedia</title>
	<meta name="description" content="Wikipedia is a free online encyclopedia, created and edited by volunteers around the world and hosted by the Wikimedia Foundation.">
	<!--[if gt IE 7]-->
	<script>document.documentElement.className = document.documentElement.className.replace( /(^|\s)no-js(\s|$)/, "$1js-enabled$2" );</script>
	<!--[endif]-->
</head>
<body id="www-wikipedia-org">
	<h1 class="central-textlogo" style="font-variant: small-caps" alt="WikipediA" title="Wikipedia">
		<img src="./Wikipedia_files/Wikipedia_wordmark.png" srcset="portal/wikipedia.org/assets/img/Wikipedia_wordmark@1.5x.png 1.5x" width="174" height="30" alt="WikipediA" title="Wikipedia">
		<strong id="js-localized-slogan" class="localized-slogan" style="visibility: visible;">The Free Encyclopedia</strong>
	</h1>
	<div id="mydiv">
		Hi there!
	</div>
	<script>alert("All scripts ends with: '</script>'...")</script>
	<!-- <p>obsolete conentent</p> -->
</body>
</html>

Second, we define javascript as a js rule:

js := '<script>' asPParser, #any asPParser starLazy flatten, '</script>' asPParser 
	==> #second.

The starLazy operator is a new feature of PetitParser2[1]. It repetitively invokes the given parser (any character in this case) until a string recognized by the following parser (</script> in this case) appears. What makes starLazy unique and useful is that you don’t need to specify what is the following parser, it is inferred automatically based on the grammar specification. Of course, with any change in the grammar, the starLazy updates itself. If you want to define the same rule in previous version of PetitParser, it would be:

#any asParser starLazy: '<script>' asParser.

In order to extract the javascript itself and throw away the begin and end tags, we use

==> #second

Let us try if the javascript rule can parse our source:

js parse: source.

The result is failure. If we inspect the failure object and switch to Debug View we notice that <script> is expected at the beginning of a file, yet our input starts with <!DOCTYPE html>. The fix to this is simple with PetitParser2: we create a new rule jsSea as a javascript island in a sea of an uninteresting water, using the sea operator:

jsSea := js sea.
jsSea parse: source.

Looks like sea is doing a lot of magic and the result is not failure! But it is not exactly what we want. Sea returns an array of three elements:

  1. before-water
  2. island and
  3. after-water

Island is the result of the javascript rule. Before and after water contain the rest of an input (the one we did not specified and we are not interested in). We are interested only in the javascript, so we redefine the document rule as follows:

jsSea := js sea ==> #second.
jsSea parse: source.

Looks better, but we are missing some results! This is because we never specified that there could be multiple occurrences of jsSea. Therefore the sea rule finds only one --- the first one. We can easily add more jsSea rules by defining a document rule:

document := jsSea star.

The whole script looks like:

source := PP2Sources current htmlSample.
js := '<script>' asPParser, #any asPParser starLazy flatten, '</script>' asPParser
	==> #second.
jsSea := js sea ==> #second.
document := jsSea star.

Now by calling document parse: source we extract both javascripts from the source, the result should look like:

document.documentElement.className = document.documentElement.className.replace( /(^|\s)no-js(\s|$)/, "$1js-enabled$2" );

alert("All scripts ends with: '

Yet, there is something fishy about the second result. The second javascript ended prematurely! It should look like this:

alert("All scripts ends with: '</script>'...")

It is ended prematurely, because the javascript rule js as we defined it does not know about strings. Therefore, the javascript rule thinks that there is a closing of the script tag even though it is a part of the alert message string. We can fix it by defining javascript strings and redefining the js rule:

any := #any asPParser.
jsString := $' asPParser, any starLazy, $' asPParser.
js := '<script>' asPParser, ((jsString / any) starLazy) flatten, '</script>' asPParser
	 ==> #second.

jsSea := js sea ==> #second.
document := jsSea star.

The result looks better now:

document.documentElement.className = document.documentElement.className.replace( /(^|\s)no-js(\s|$)/, "$1js-enabled$2" );

alert("All scripts ends with: '</script>'...")

Great, everything works as expected!

Conclusion

In this part, we have quickly prototyped a parser to extract javascript from html files. In the new version of PetitParser, we have a sea and starLazy operators at hand to help us skip the uninteresting input. More details are provided in this chapter.

If you think that the task from this chapter could be done using a simple regular expression, you are right absolutely right. Nevertheless, contrary to regular expressions, this is just a first step for PetitParser. PetitParser will boldly go where no regular expression has gone before.

Table of Contents

This text is part of the Parsing With PetitParser2 series.

Part I, Developer's Workflow:

Do you have ideas, suggestions or issues? Write us an email, or contact u on github!

Go to top.


[1] The operator builds on top of bounded seas and we will talk about this in more detail later.