Lopez Crawl Directives

Selection rule sets: extract relevant information

The core objective of Lopez Crawl Directives is to allow you to extract relevant information from webpages in a structured way. Besides, being able to specify regions of the Web, you can specify what to extract from each page. This is done using selection rule sets, which leverages the existing syntax of CSS selectors for selecting elements from HTML with a syntax for doing common manipulations over these selections.

Here is a sample example of such rule set:

select in "/duckburg" h1 {
   all-h1s: collect(inner-html);
   ducks-in-h1: count(text capture "duck" get "0" is-not-null);
}

Every rule set starts with the keyword select, followed by an optional in-clause which specifies on which pages to apply this rule set using a regular expression. Then what follows is a CSS selector, in this case, “all H1 headers in the HTML”. After that, a block of all the named rules follows. In this document, we will not discuss the CSS selector part; it is a widespread technology in the Web and there are plenty of good tutorials for you to get started.

Now, let’s dive into the syntax of the named rules, which operates on the selections generated by the CSS selector applied against the HTML page.

Aggregators and extractors

The core of the name rule is an aggregator. Aggregators are functions that map a selection, an abstract collection of HTML elements, into a single value. On the other hand, extractors are functions that extract useful information from an element in the selection. Aggregators and extractors are used in tandem to extract information not from a single element, but from the whole selection.

Here is a small list of the most common aggregators to get you started: * count: counts the number of elements in the selection. * count(extractor): counts the number of instances that the extractor evaluates to true. * first(extractor): retrieves only the first element in the selection and applies the extractor to it. If the selection is empty, this aggregator evaluates to null. * collect(extractor): retrieves all the elements in the selection, applying the extractor to each element and puts all values in an array.

You can find a complete list of all supported aggregators here.

And this is a list of the most common extractors: * name: retrieves the name of the element as a string. A ul element yields the value "ul", an a yields "a" and so on… * text: retrieves the inner text of the element, removing all HTML markup. If the text is divided in multiple chunks, all the chunks are joined by a single empty space. * html: retrieves the whole element HTML as text. * inner-html: the whole element HTML as text, not including the element itself, just its child nodes. * attr "an-attribute": retrieves the value of an attribute of a given element, if present, else, returns null. For example, attr "href" retrieves the href attribute when applied to an anchor (a elements), that is, where a given link points to.

You can find a complete list of all supported extractors here.

Transformers

Of course, sometimes only aggregators and extractors are just not able to do the job. Suppose, as in the example given in the head of this document, you want to count all the occurrences of the word “duck” in the H1s of a page. Here is where transformers come in handy. Transformers are functions which transform values into values. In the “duck” example, the capture transformer takes in the string given by the text extractor and applies the regular expression “duck” on it, putting all captures in a dictionary. Then, the extractor get gets the key "0" from the returned dictionary (or returns null if not present). Finally, is-not-null tests if the returned value is null or not. This is the value that is passed to the count aggregator. As you can see, transformers allow you to build much more expressive match-rules than you would be able to get with aggregators and extractors alone.

Although not present in the example, you can compound transformers with both extractors and aggregators, without distinction. Here is an example where you can count all occurrences of “duck” only in the H2 of a page:

select h2 {
   count-ducks: collect(text all-captures "duck") flatten length;
}

Here, all-captures returns all occurrences of “duck” in the text of each h2 as a list. Then, after aggregation, the result, which is a list of lists, is flattened (transformed in a single list) and counted.

You can find a complete list of all transformers here.

The `!explode` pseudo-transformer

When passing arrays from a transformer to an aggregator, you can use the pseudo-transformer !explode to indicate that, instead of passing the whole array to the aggregator as a single element, each element of the array should be passed individually. This has an effect that is very similar to flatten, but connects the “extractor realm” to the “aggregator realm”. A simple application of this is shown below:

select * {
   all-classes: distinct(classes !explode);
}

This rule collects all distinct classes in the webpage. If the rule were written distinct(classes), the effect would be to collect all distinct arrays of classes, which is not what we want.

Types and the type system

As you may have noticed, there are lots of values flying around while a rule is in action. All these values are actually only instances of JSON objects, a ubiquitous data format in the Web. Contrary to the general philosophy of JSON, though, rules are strongly typed. Each aggregator, extractor and transformer output predetermined types for each input type. If some incompatibility is found when validating the crawl configuration, you will get a “type error’ message. This is far better than to discover you fell into a type trap ten thousands pages into a crawl.

The only exception to this rule is null. Null can be any type. It is mostly used when an operation found an error, for example, when a string could not be converted to a number, or when no value was actually there, such as when a key was not found in an object. From there, null will have the tendency to infect all results in your rule. Be careful: Lopez will play fast and loose with nulls.

Rule names and the import system

All rule names are declared within the namespace of the module in which they were declared. Therefore, you may have rules with the same name in different modules. In fact, what is passed to the backend (the part of the program that stores the collected data) is the full name of the rule, which is the actual name prepended by the full module path. For example, suppose we have the following setup:

main.lcd
a-module.lcd
other-module/
   module.lcd
   sub-module.lcd