silent puma logo Silent Puma | Documentation

Lopez Postgres database

The data that Lopez retrieves from the Web is stored in a PostgreSQL database. Currently, this is the only supported database, even though support for any kind of storage is possible via an abstraction layer. If you would like to contribute with your own backend implementation, you are more than welcome to do so. However, the PostgreSQL should suffice most of your needs for scale up to millions or tens of millions of pages (given that that kind of thing fits your budget).

This document is not an in-depth look at the nitty-gritty details of the database schema and is intended as a tutorial to get you started with your analyses. Most of the fine details you can discover giving a look around the tables. If something is still unclear, you can open an issue to discuss that.

The high-level view: the named_* relations

These are the relations that present data as high-level information. They were so named because they abstract away the underlying ids, which means that less joins are necessary to get things done. These are the relations with which you will interact most of your time, especially if you want to get some straightforward information.

The named_* views are the following:

  • named_status: this view has information on crawl data, specifically on
    • status code of every searched page.
    • the search status of the page, that is, whether the page was scheduled to be searched (open), was searched (closed or error) or is “in the process of being searched” (taken). This last status is common if the crawl is still active or if it was interrupted (e.g., using Ctrl+C).
    • the depth of each page in the search tree, that is, how many links were followed from a seed until that page. Note that this is not the true distance (the length of the shortest path from the nearest seed), since the search is not a breadth-first search. Because of limitations on crawling speed per domain, the search order can be quite caotic at times.
  • named_linkage: this view lists all links that were found during the crawl and contains the following fields:
    • from_page_url: the page that has the reference to another page.
    • to_page_url: the page that was referenced by from_page_url.
    • reason: the type of reference. Currently, only ahref (HTML anchors) and canonical (canonical link tags) are supported.
  • named_analyses: this is where the analysis defined in the crawl configuration go. The organization is straightforward:
    • page_url: the URL of the page being analysed.
    • analysis_name: the full name of the rule that generated the analysis.
    • result: the result of the analysis on that page. This is a JSON (technically a JSONB) field. If you are unfamiliar with handling JSON in SQL (more specifically, PostgreSQL), you can read more about it here.
  • named_page_rank: a PageRank of your pages. This is a measure of “how popular” a given page is within the crawl. You can use this measure to order and weigh your analysis if you need a metric of “importance”. You can read more about PageRank here. There are two fields in this view:
    • page_url: the URL of the page.
    • rank: the PagerRank of that page. If you sum over all PageRanks of a given crawl, you should get 1.0, plus or minus some tiny error, due to floating point manipulation.

Not included in the description above is the wave_name column, which is present on all relations and corresponds to the name of the crawl you chose when invoking Lopez.

Going lower-level

The named_* tables are convenient for quick queries, but sometimes introduces some inefficiencies, in the form of unnecessary joins under the hood. Depending on what you are doing, this can drag your analyses substantially. If need want your queries to go faster, you can forgo the convenience of views and use the raw data directly. The adaption should be straightforward: * For each named_* view, there is a corresponding table, without the named prefix. For example, for the named_linkage, there is the linkage table. * In these tables, page_urls become page_ids, wave_names become wave_ids, analysis_name become analysis_id, etc…

In addition to the aforementioned tables, you will most probably need these too: * waves: this is the table storing information on each crawl run (called wave) and is indexed by wave_id. * pages: this table relates page_ids to page_urls. Contrary to all other tables, this table is not indexed by neither wave_id nor by wave_name. This is done so that different crawl can share the same lines in this table, keeping the database size… smaller. * analyses: this table contains the full names of every possible analysis in each wave. It relates analysis_name (the full name of the rule that generates that analysis) to analysis_id and to the wave_id of the crawl.

Deleting a wave

Finally, if you want to get rid of all the data from a given crawl (maybe you have messed things up), you can easily do that by simply deleting the entry in the waves table corresponding to your crawl, like so:

delete from waves where wave_name = 'name of your crawl';

All wave_id fields in all tables are foreign keys that implement ON DELETE CASCADE, meaning that deletes cascade through all tables, like a line of dominoes. In addition to that, there is a TRIGGER on this table, which will delete any page URLs in the table pages that is not referenced anywhere in the database. Again, this helps to keep the size of the database smaller by not letting garbage pile up over time (in fact, this technique is formally known as garbage collection).