List of transformers
Transformers are functions that transform JSON values into JSON values. They can be used in combination with aggregators and extractors to create expressive extraction rules for your crawl, avoiding much of the tedious post-processing. You can read more on transformers and how they fit into the bigger picture here. This is only a reference list.
Note: unless explicitly specified, all transformers return null on a null input.
General purpose transformers
is-null: tests if the input value isnull, returningtrueif it is.is-not-null: tests if the input value isnull, returningfalseif it is.hash: transforms a givenstringinto a “characteristic number” representative of the whole string, but small (8 bytes long). This is not a cryptographic hash, but then again, this is not our use case, is it? The exact algorithm used is SipHash24.not: negates a Boolean value. This transformer also returnsnullon anullinput.
Numeric transformers
as-number: transforms a string into a number, if possible. If not, returnsnull.greater-than 3.14: tests if the input is greater than the supplied number.lesser-than 3.14: tests if the input is leser than the supplied number.equals 3.14: tests if the input is greater than the supplied number.
Transformers over collections
length: returns the size of the collection.- for strings: returns the number of bytes occupied by the strings. This might be quite different from the number of graphemes you see on the screen.
- for arrays: returns the number of elements in the array.
- for maps: returns the number of key-value pairs in the map.
is-empty: tests if the collection (either string, array or map) is empty.get "a key": gets the corresponding value of the supplied key in a map. It returnsnullif the key is not present.get 12: gets the element in the specified position in an array. It returnsnullif the index is out of bounds. The first position is zero.flatten: flattens an array of array (a 2d array) into a single array (a 1d array) containing all the elements from all given arrays.each(other-transformer): maps a given transformer over each element of an array. For example,classes each(length)returns an array with the length of each CSS class in each element. On a map, this operates over the values of each pair.filter(other-transformer): tests if the supplied transformer returnstrueorfalse(ornull) for each element in an array and retains only those which testedtrue. For example,classes filter(length less-than 5)returns an array containing only the CSS classes with less than 5 characters. On a map, this operates over the values of each pair.
String manipulation
pretty: this is a best effort to remove whitespace from text. It will remove extra spaces and tabs between words as well as extra line breaks between paragraphs. In addition, it will append a trailing line break to the final paragraph (if there is one), since this is “best practice”. One more thing, Windows users: all carriage returns will be stripped out of the text.
Regular expressions
capture "[a-z]+": returns one match of the regular expression to input text, ornullif the regular expression does not match the text at all. A match is a map mapping the identification of the capture groups to the captured parts. The key “0” indicates the match of the while regular expression. For example, suppose the regular expression is"http(s)?://(?P<domain>[^/]+)/"and the input text is"http://example.org/", then the output will be the map{"0": "http://example.org", "domain": "example.org"}all-captures "[a-z]+": returns all matches (occurrences) of the regular expression to the input text. As such, this transformer returns an array of maps, with each map representing a match. In the case where no match is found, an empty list (notnull) is returned.matches "[a-z]+": returnstrueif the input matches the given regular expression orfalseif it doesn’t.replace "([a-z])+" with "foo $1": matches all occurrence of the regular expression in the input and replaces it with the replacer string. You can replace the capture groups using$nsyntax or even$namewherenameis the name of a named capture group. For example,replace "(foo|bar)-baz" with "duck-$1"applied to"foo-baz quack bar-baz"returns"duck-foo quack duck-bar"