extract_regex¶
Type of operation: Projection, Semistructured
Description¶
Add one or more columns by matching capture names in a regular expression against a given source expression.
Regex extractions create string columns. Named capture groups are an extension to POSIX extended regular expressions.
If the column already exists, and it is of type string
, the original value is replaced with the matched text. If the regular expression does not match anything, the original value is preserved.
If the column already exists, but it is not a string column, extract_regex
returns an error.
See also: make_col
.
The flags
argument specifies optional regex flags:
c - Enables case-sensitive matching (default.)
i - Enables case-insensitive matching.
m - Enables multi-line mode (i.e. meta-characters ^ and $ match the beginning and end of any line of the input string.) By default, multi-line mode is disabled (i.e. ^ and $ match the beginning and end of the entire input string.)
s - Enables the POSIX wildcard character
.
to match\n
(newline.) By default,.
does not match\n
.
For more about syntax, see POSIX extended regular expressions.
Note that regular expressions in OPAL are surrounded by /slashes/, not ‘quotes’ like strings.
extract_regex
also supports capture group column typecasting using the following syntax: (?P<value::float64>)
. The named capture group column “value” is casted to float64
using the float64 typecast function. The currently supported typecast functions are float64
, int64
, string
, parse_isotime
(typecast to timestamp), duration
, duration_ms
, duration_sec
, duration_min
, duration_hr
and parse_json
.
Usage¶
extract_regex path, regex, [ flags ]
Argument |
Type |
Optional |
Repeatable |
Restrictions |
---|---|---|---|---|
path |
expression |
no |
no |
none |
regex |
regex |
no |
no |
none |
flags |
string |
yes |
no |
none |
Accelerable¶
extract_regex is always accelerable if the input is accelerable. A dataset that only uses accelerable verbs can be accelerated, making queries on the dataset respond faster.
Examples¶
extract_regex message, /status=(?P<statuscode>\d+)/
Create the column ‘statuscode’ by matching for status=numbers in the field ‘message’.
extract_regex inputcol, /(?P<sensor>[^|]*)\|count:(?P<counts>[^|]*)\|env:(?P<env>[^|]*)/
Given an input column value like: “studio-aqi|count:654 201 28 0 0 0|env:3 4 4a”, generate three output columns: “sensor” with the value “studio-aqi”, “counts” with the value “654 201 0 0 0”, and “env” with the value “3 4 4a”.
extract_regex message, /(?P<date::parse_isotime>[0-9:TZ.]+) (?P<name>[a-z]+)=(?P<value::float64>[0-9.]+)/
Given an input column called message, generate three output columns called date, name and value with the corresponding regex matching. The “date” column is typecasted to datatype timestamp using parse_isotime
, the “name” column remains the default type string
and the “value” column is typecasted to datatype float64 using float64
.
Aliases¶
colregex
(deprecated)