silent puma logo Silent Puma | Documentation

Standard Library Modules

Warning: the standard library is still evolving fast. Expect this section to change from time to time.

The standard library, or lopez-std, is a collection of handy modules that can help you to get started fast without having to reinvent the wheel. If you are new to Lopez, you should first read about the import system before reading further. This document is intended as a reference for when you start writing your configuration files.

This is the reference for the Standard Library version 0.3.

The seo module

The seo module contains some rule-sets to aid on basic on-page SEO tasks, such as checking the existence of a description tag on a webpage. As such, this module declares the following rules: * page-hash: hashes the whole content of the page. This is useful for tracking duplicate content without having to actually store the whole text of the page. A simple number will do. * title: returns the text of the first title tag in the page, in the head section. * count-title: returns the number of title tags in the page, in the head section. * h1: returns the text of the first h1 in the page. * count-h1: returns the number of h1 tags in the whole page. * canonical: returns the “canonical”-tag of the page (first occurrence). * count-canonical: returns the number of canonical tags in the page. * meta-description: return the “meta-description” tag in page (first occurrence). * count-meta-description: returns the number of “meta-description” tags in the page. * n-elements: returns total number of HTML elements in the page. * missing-alt-text: returns a collection of all “src” attributes of all images missing the “alt” text.

The og module

This module contains rules sets on OpenGraph tags, tags used by Facebook and other social networks to create feed post of your site, among other things. Pleas note that this is not a full implementation of all tags, only the most common ones.

  • type: the “type” of the webpage (is it a website, an article, a profile…?).
  • site-name: the (pretty) name of the website this page belongs to.
  • image: a nice image to display in peoples’ timelines.
  • url: the URL of this webpage. This works in a similar vein to the “canonical” tag in the seo module, in that it may point to another “canonical” webpage, other than itself.
  • description: a nice description to go with the image.

## The frontiers module

This module contains frontier rules to some domains which are not that interesting to crawl or that you should really keep away from. Note that these sites are not outright disallowed. The crawl will only not go through them. This module is mostly intended for when you need to make a Big Crawl© around the Web, spanning multiple domains.

This module is divided in sub-modules, according to category. You can selectively import each module to have fine grain on what you are excluding. Importing frontiers amounts to importing all submodules.

  • internet-archives: contains frontiers on Internet archives, sites that mirror the whole structure of the Web inside them.
  • social-media: contains frontiers on the most common social networks and other tech giants (as from the perspective of a Brazillian in 2020). Currently included are:
    • Google
    • YouTube
    • Facebook
    • Twitter
    • Instagram
    • WhatsApp
    • LinkedIn
    • Amazon
    • Apple
    • Pinterest
    • Medium

Have I missed your favourite big tech company? You are welcome to open an issue. * wikipedia: all the projects from the WikiMedia Foundation, as per their website footer in August 2020.

The ignore-tracking module

This module contains ignore params directives for the most common tracking parameters around the Web. People sometimes leave tracking parameters on URLs, mostly on purpose, but also on accident (mainly in user generated content). Since this parameters do not alter the page in any conceivable way, it is a good idea to ignore them, especially if you are using use param *.

Current supported parameters are the following: * All utm_* parameter, used by Analytics and other trackers. * gclid parameter, a Google unique identifier for each click, mainly used within Google Ads. * fbclid parameter, Facebook’s version of gclid. Have I missed any widely used parameter (maybe used only in your country)? You are welcome to open an issue.

The bad-extensions module

This module contains disallow rules on common extensions which are normally put in anchors, but are not valid HTML. By default, Lopez will scrape any kind of garbage in the internet, no matter how “un-HTML” it is. It is insightful to run such crawls, since you can still match some regular expressions even inside a, say, .pdf file, since the text is still there. However, if this is annoying you, you can import "bad-extensions"; and (most of) your problems with extension will go away.

Currently supported extensions are: * .pdf * .png * .jpg and .jpeg * Structured data: * .json * .xml * Microsoft Office formats: * .doc and .docx * .xsl,.xlsx and .csv * .ppt and .pptx * Compressed data: * .zip * .rar * .tar, .tar.gz and .gz * Other binary junk: * .exe

Have I forgot anything specifically annoying? You are welcome to open an issue.