Set variables: configuring your crawl
Lopez Cawl Directives support set directives that allows you to configure option controlling your crawl. The syntax is straightforward:
set foo = "a string";
set bar = 123;
set baz = 3.14;
However, you cannot set just any variable. This directive is explicitly intended for configuration knobs only. Currently, the following set-variables are supported:
user_agent = string: sets the value to be used as theUser-Agentheader when doing HTTPS requests. This defaults tolopez/<version> <lopez homepage>, which may vary, depending on the version you are using.quota = integer: the maximum number of pages that Lopez will crawl. Defaults to1000.max_depth = integer: the deepest that Lopez will go while crawling. This is the most number of links that can be followed from a given seed.max_hits_per_sec = float: the maximum number of pages that can be crawled per second per origin. Be very careful when setting this value. If you set it too high, you might overload the web server you are crawling. Defaults to2.5.request_timeout = float: the timeout for a given request in seconds. Defaults to60, one minute.max_body_size = float: the maximum size of the webpage in bytes that will be accepted. If the webpage exceeds this limit, it will be truncated and that is what is going to be parsed. Defaults to10_000_000, that is,10MB.
Interaction with the module system
Each declaration of a set is global. That is, if a module foo sets a variable to a given value and you import it, you cannot set it in the importing module, since each set variable can only be set once in the whole configuration. However, this may be relaxed in the future, if need. If your use case needs this, please open an issue.