Lopez Reference

List of transformers

Transformers are functions that transform JSON values into JSON values. They can be used in combination with aggregators and extractors to create expressive extraction rules for your crawl, avoiding much of the tedious post-processing. You can read more on transformers and how they fit into the bigger picture here. This is only a reference list.

Note: unless explicitly specified, all transformers return null on a null input.

General purpose transformers

is-null: tests if the input value is null, returning true if it is.
is-not-null: tests if the input value is null, returning false if it is.
hash: transforms a given string into a “characteristic number” representative of the whole string, but small (8 bytes long). This is not a cryptographic hash, but then again, this is not our use case, is it? The exact algorithm used is SipHash24.
not: negates a Boolean value. This transformer also returns null on a null input.

Numeric transformers

as-number: transforms a string into a number, if possible. If not, returns null.
greater-than 3.14: tests if the input is greater than the supplied number.
lesser-than 3.14: tests if the input is leser than the supplied number.
equals 3.14: tests if the input is greater than the supplied number.

Transformers over collections

length: returns the size of the collection.
- for strings: returns the number of bytes occupied by the strings. This might be quite different from the number of graphemes you see on the screen.
- for arrays: returns the number of elements in the array.
- for maps: returns the number of key-value pairs in the map.
is-empty: tests if the collection (either string, array or map) is empty.
get "a key": gets the corresponding value of the supplied key in a map. It returns null if the key is not present.
get 12: gets the element in the specified position in an array. It returns null if the index is out of bounds. The first position is zero.
flatten: flattens an array of array (a 2d array) into a single array (a 1d array) containing all the elements from all given arrays.
each(other-transformer): maps a given transformer over each element of an array. For example, classes each(length) returns an array with the length of each CSS class in each element. On a map, this operates over the values of each pair.
filter(other-transformer): tests if the supplied transformer returns true or false (or null) for each element in an array and retains only those which tested true. For example, classes filter(length less-than 5) returns an array containing only the CSS classes with less than 5 characters. On a map, this operates over the values of each pair.

String manipulation

pretty: this is a best effort to remove whitespace from text. It will remove extra spaces and tabs between words as well as extra line breaks between paragraphs. In addition, it will append a trailing line break to the final paragraph (if there is one), since this is “best practice”. One more thing, Windows users: all carriage returns will be stripped out of the text.

Regular expressions

capture "[a-z]+": returns one match of the regular expression to input text, or null if the regular expression does not match the text at all. A match is a map mapping the identification of the capture groups to the captured parts. The key “0” indicates the match of the while regular expression. For example, suppose the regular expression is "http(s)?://(?P<domain>[^/]+)/" and the input text is "http://example.org/", then the output will be the map {"0": "http://example.org", "domain": "example.org"}
all-captures "[a-z]+": returns all matches (occurrences) of the regular expression to the input text. As such, this transformer returns an array of maps, with each map representing a match. In the case where no match is found, an empty list (not null) is returned.
matches "[a-z]+": returns true if the input matches the given regular expression or false if it doesn’t.
replace "([a-z])+" with "foo $1": matches all occurrence of the regular expression in the input and replaces it with the replacer string. You can replace the capture groups using $n syntax or even $name where name is the name of a named capture group. For example, replace "(foo|bar)-baz" with "duck-$1" applied to "foo-baz quack bar-baz" returns "duck-foo quack duck-bar"