Boundaries and seeds: controlling what to crawl
Boundary and seed directives
A crawl needs a specification of a region of the Web graph to run. This region is the set of all URLs that are allowed in the crawl. Most of the time, this region will be simply “all URLs of a given domain”, for example. Lopez uses three directives to control which pages should be crawled: allow, disallow and frontier. In addition, Lopez needs to know where to start crawling. This is controlled by the seed directive.
Using allow
The directive allow specifies a regular expression of pages that are a part of the Web region to be crawled. Here is an example of an allow directive that includes all URLs in the domain example.org, using either HTTP or HTTPS:
allow "^htpps?://example\.org";
```plain
Of course, this regular expression also matches `https://example.orgy`, which is bad. We can do a bit better:
```plain
allow "^htpps?://example\.org(:?/|$)";
Remember that your configuration needs to include at least one allow directive for Lopez to be able to run at all.
Using disallow
The directive disallow is the exact opposite of allow: it blocks certain pages from the Web region to be crawled. For example, suppose example.org has a private area, which is only accessible for logged-in customers, called /mailbox. You can disallow this specific URL using a regular expression matching it:
disallow "^https?://example\.org/mailbox$";
Using frontier
The directive frontier specifies pages within the allowed region that are to act like a border for the region. That is, URLs matching the frontier regular expression will not have their links followed, even though they might be crawled. A use case is, for example, that you want to find all profile pages in a social network that are linked by a given domain, but don’t want to crawl the whole social network.
Normally, frontier needs to be used in conjunction with allowed to work:
// Allows "FaceSpace"...
allow "^https?://facespace\.com";
// ... but never follow links inside it.
frontier "^https?://facespace\.com";
Using seed
Finally, Lopez needs to know where to start crawling. A good place to start a homepages, which are full of links to the rest of websites, but you can specify any URL. In fact, you can specify any number of seeds for the crawl. All the seeds will be used as starting points. This is useful if you know that there is an area of a website which is unreachable from the rest.
To declare seeds, you use the seed directive, which takes the URL of the seed (not a regular expression, mind you):, like so
seed "https://example.org";
seed "https://example.org/orphan-page";
Your configuration must include at least one seed to be valid and every seed needs to be allowed and not be in the frontier.
Interactions with robots.txt
Sites use the Robots Exclusion Protocol to control which pages are allowed for web crawlers and which are not, among other things. Lopez complies with this standard and there is no way to opt out from it. This means that, in addition to the boundary rules defined in your configuration, a certain page might not be included in the crawl because of the robots exclusion protocol. You can test if a given page is allowed by this standard using the test verb in Lopez’ CLI.