Quickstart
Let’s get you started with Lopez. First, be sure that you have lopez installed in your system by following the one easy step installation. To test if your installation was successful, type this in your command line:
lopez -V
which should print the Lopez version you installed, which will be the latest release.
Setting up Postgres
Currently, Lopez supports only storing crawl data in a PostgreSQL database. So, if you have not done that already, go get yourself a postgres or use a managed cloud service like Amanzon RDS (if you have money for that sort of thing).
Now that you have got hold of a PostgreSQL cluster, it is time to create a user for Lopez. Well, I mean, you can use your own credentials, if you want; any user will do, actually. Just be sure that it is an user with the CREATEDB privilege, i.e., a user that can create new databases in the cluster. If you want to create a user for lopez, you can execute the following commands in any database:
CREATE USER lopez WITH CREATEDB, PASSWORD '**your passwd here**';
Creating a crawl configuration
Now, let’s find something to crawl… let’s see… ah! What about the lopez repository here in GitHub? There are not that many pages and it can prove useful. In a suitable folder create the following file and give it the name main.lcd:
// Tell lopez to stay within the repository
allow "^https://github\.com/tokahuke/lopez(:?/|$)"; // this is a regular expression
// But don't let it browse every tiny commit in history:
disallow "/([a-f0-9]{40})"; // commit hashes in the URL
// Start from the repo homepage:
seed "https://github.com/tokahuke/lopez";
// set a quota of 500 pages only (you can try with more later...)
set quota = 500;
// Now, let's make some "very interesting analysis":
// In the wiki pages, get the h1 headers:
select in "lopez/wiki" .repository-content h1 {
wiki-title: first(text);
}
// In the Issues section, get the issue title:
select in "lopez/issues" h1 .js-issue-title {
issue-title: first(text);
}
// ... and don't forget the status.
select in "lopez/issues" span.State {
issue-status: first(attr "title" capture "Status: (.*)" get "1");
}
Before you run…
Before going all crazy downloading pages, it is a good idea to test the configuration to see if everything is in place. It can be very frustrating to discover that you committed a typo… 10000 pages in! Don’t worry. Lopez has got your back. Here, we will introduce two handy commands for debugging configurations: validate and test.
To validate the configuration, just write
lopez validate main.lcd
When you run the above command, you will see a message of “valid configuration” printed on the screen. If Lopez finds any errors, you will see a summary with the errors that Lopez has found. Go back and figure out what is wrong (and rinse and repeat, if necessary).
The validate command may be useful for finding simple errors, but it does not say anything about how Lopez with behave exactly when crawling a webpage. Here is where test comes in handy. This command allows you to test your configuration against a supplied URL, like the URL for issue number 6: https://github.com/tokahuke/lopez/issues/6. Test to see what happens when Lopez crawls this page:
lopez test main.lcd "https://github.com/tokahuke/lopez/issues/6"
You will see that it will show a bunch of stats on the page, includin the value of the analyses in the file. I encourage you to play around with other URLs, even with ones which are outside the repository.
Running the crawl
Now, it is finally time to call Lopez into action. We need to tell Lopez three things: 1. Which file to use as configuration file. 2. The name of the crawl. This is a short, descriptive and unique name identifying the crawl, which will be used for * retrieving information from the database. Remember that the same database can store multiple crawl waves. * resuming you crawl if you decide to stop early. 3. The keys to the server: where is it? what is the name of the database? user name? password?
The first one is easy: main.lcd, the file we created in the last session. The second one requires a bit of creativity. Therefore, let’s be uncreative and call this crawl tutorial. The third is a bit more involved. I give two alternatives.
Alternative 1: use environment variables
Most knobs in lopez can be set via environment variables. However, the number of variables can get quite out of hand (even a few start becoming a pain to type al the time). Enter the .env file: a file to set all environment variables and load them all at once. In our case, the .env file will contain only the variables that configure Lopez’ connection to Postgres. Therefore, create a .env file in the same folder as main.lcd with the following content:
export DB_HOST=localost # or domain name, or IP address if you are using a service
export DB_DBNAME=lopez # you can choose the name of the db. It will be created if it doesn't exist
export DB_USER=lopez # the user you decided to use in the beginning of this guide
export DB_PASSWORD=**your passwd here** # the password you have chosen
To load the variables, it is as simple as running
source .env
Note that this needs to be done once per terminal session.
Now, with everything in place, just do:
lopez run main.lcd tutorial
And the crawl is on!
Alternative 2: a lot of parameters
If environment variables are not your cup of tea (maybe you should just get used to the concept), no problem! Almost every knob for Lopez has a corresponding command line flag. In fact, you can make the whole crawl work in only a single command! To put everything in a single line would be to stretch a little, so we use line breaks (\) for quality of life:
lopes run main.lcd tutorial \
--host localost \
--dbname lopez \
--user lopez \
--password **your passwd here**
You can just substitute the password in the command above, type it and the crawl is on! However, you may find that it becomes tedious to write these big and redundant commands all the time.
Retrieving some actual information
Now that you have put Lopez to work, it is time to grab a cup of coffee and relax a bit. Crawls, even small ones, like this, take time to finish. However, with a good Internet connection, your crawl should be ready in a couple of minutes. You will know as soon as Lopez yields the command line back to you.
Now, connect to your database using any SQL client and let’s do some queries. First, let’s see if we have it any bad pages, pages which got a
404, or worse, a 500 error from the server. To list these pages, you can use the following query:
select
page_url,
status_code
from
named_status
where
wave_name = 'tutorial' and status_code >= 400;
And, sure enough, you will find that there are indeed some broken pages. These correspond to the pages where you need to be logged in to view the actual contents.
Ok, so… on to the analyses. What about the issues? Which were actually crawled? You can find that using the named_analyses view, like so:
select
page_url,
result
from
named_analyses
where
wave_name = 'tutorial' and analysis_name = 'issue-title';
With a bit more of SQL, you can even pair titles and status:
with titles as (
select
page_url,
result as title
from
named_analyses
where
wave_name = 'tutorial' and analysis_name = 'issue-title'
), status as (
select
page_url,
result as status
from
named_analyses
where
wave_name = 'tutorial' and analysis_name = 'issue-status'
) select
page_url,
title,
status
from
titles join status using (page_url);
As you can see, you have very broad control over crawl data in Postgres. Your post-processing options are virtually unlimited.
Now, it’s your turn!
After you finish playing around with the data in my repository, you are able to start crawling the wider Web. If you want to get more into Lopez, this Wiki is the place to start. In special, check out the documentation on Lopez Crawl Directives, the language in which crawl configurations are written, and the documentation on the database structure.