silent puma logo Silent Puma | Documentation

List of extractors

Extractors are functions that implicitly take an HTML element and extract data from it as JSON. They are used to specify what raw information you want to scrape from a webpage. You can read more on extractors and how they fit into the bigger picture here. This is only a reference list.

Extractors acting directly on the “focus element”

  • name: retrieves the name of the element as a string. A ul element yields the value "ul", an a yields "a" and so on…
  • text: retrieves the inner text of the element, removing all HTML markup. If the text is divided in multiple chunks, all the chunks are joined by a single empty space.
  • html: retrieves the whole element HTML as text.
  • inner-html: the whole element HTML as text, not including the element itself, just its child nodes.
  • attr "an-attribute": retrieves the value of an attribute of a given element, if present, else, returns null. For example, attr "href" retrieves the href attribute when applied to an anchor (a elements), that is, where a given link points to.
  • attrs: retrieves all the attributes of a given element, returning the result as a map of attribute names and attribute values.
  • classes: retrieves all the CSS classes of a given element, returning an array where each element is a string containing the class name.
  • id: retrives the id of a given element, as specified by the id attribute.

Extractors acting on the “immediate family” of an element

  • parent(extractor): applies an extractor on the parent of a given element. This extractor returns null if the node has no parent element (e.g., it is the html right at the beginning of the document).
  • children(extractor): applies an extractor on all children of a given element, returning all results collected in an array.

Extractors acting on sub-selections of an element

  • select-any(extractor, a .selector): selects a descendant of the element that matches the supplied CSS selector and extracts from it. This is analogous to the querySelector function in JavaScript. If no matching descendant is found, this extractor returns null.
  • select-all(extractor, a .selector): selects all descendants of the element that match the supplied CSS and extracts from each one, collecting all results in an array. This is analogous to the querySelectorAll function in JavaScript. If no matching descendant is found, this extractor returns an empty array.