silent puma logo Silent Puma | Documentation

Set variables: configuring your crawl

Lopez Cawl Directives support set directives that allows you to configure option controlling your crawl. The syntax is straightforward:

set foo = "a string";
set bar = 123;
set baz = 3.14;

However, you cannot set just any variable. This directive is explicitly intended for configuration knobs only. Currently, the following set-variables are supported:

  • user_agent = string: sets the value to be used as the User-Agent header when doing HTTPS requests. This defaults to lopez/<version> <lopez homepage>, which may vary, depending on the version you are using.
  • quota = integer: the maximum number of pages that Lopez will crawl. Defaults to 1000.
  • max_depth = integer: the deepest that Lopez will go while crawling. This is the most number of links that can be followed from a given seed.
  • max_hits_per_sec = float: the maximum number of pages that can be crawled per second per origin. Be very careful when setting this value. If you set it too high, you might overload the web server you are crawling. Defaults to 2.5.
  • request_timeout = float: the timeout for a given request in seconds. Defaults to 60, one minute.
  • max_body_size = float: the maximum size of the webpage in bytes that will be accepted. If the webpage exceeds this limit, it will be truncated and that is what is going to be parsed. Defaults to 10_000_000, that is, 10MB.

Interaction with the module system

Each declaration of a set is global. That is, if a module foo sets a variable to a given value and you import it, you cannot set it in the importing module, since each set variable can only be set once in the whole configuration. However, this may be relaxed in the future, if need. If your use case needs this, please open an issue.