extract_regex

Type of operation: Projection, Semistructured

Description

Add one or more columns by matching capture names in a regular expression against a given source expression.

Regex extractions create string columns. Named capture groups are an extension to POSIX extended regular expressions.

If the column already exists, and it is of type string, the original value is replaced with the matched text. If the regular expression does not match anything, the original value is preserved.

If the column already exists, but it is not a string column, extract_regex returns an error.

See also: make_col.

The flags argument specifies optional regex flags:

  • c - Enables case-sensitive matching (default.)

  • i - Enables case-insensitive matching.

  • m - Enables multi-line mode (i.e. meta-characters ^ and $ match the beginning and end of any line of the input string.) By default, multi-line mode is disabled (i.e. ^ and $ match the beginning and end of the entire input string.)

  • s - Enables the POSIX wildcard character . to match \n (newline.) By default, . does not match \n.

For more about syntax, see POSIX extended regular expressions.

Note that regular expressions in OPAL are surrounded by /slashes/, not ‘quotes’ like strings.

extract_regex also supports capture group column typecasting using the following syntax: (?P<value::float64>). The named capture group column “value” is casted to float64 using the float64 typecast function. The currently supported typecast functions are float64, int64, string, parse_isotime (typecast to timestamp), duration, duration_ms, duration_sec, duration_min, duration_hr and parse_json.

Usage

extract_regex path, regex, [ flags ]

Argument

Type

Optional

Repeatable

Restrictions

path

expression

no

no

none

regex

regex

no

no

none

flags

string

yes

no

none

Accelerable

extract_regex is always accelerable if the input is accelerable. A dataset that only uses accelerable verbs can be accelerated, making queries on the dataset respond faster.

Examples

extract_regex message, /status=(?P<statuscode>\d+)/

Create the column ‘statuscode’ by matching for status=numbers in the field ‘message’.

extract_regex inputcol, /(?P<sensor>[^|]*)\|count:(?P<counts>[^|]*)\|env:(?P<env>[^|]*)/

Given an input column value like: “studio-aqi|count:654 201 28 0 0 0|env:3 4 4a”, generate three output columns: “sensor” with the value “studio-aqi”, “counts” with the value “654 201 0 0 0”, and “env” with the value “3 4 4a”.

extract_regex message, /(?P<date::parse_isotime>[0-9:TZ.]+) (?P<name>[a-z]+)=(?P<value::float64>[0-9.]+)/

Given an input column called message, generate three output columns called date, name and value with the corresponding regex matching. The “date” column is typecasted to datatype timestamp using parse_isotime, the “name” column remains the default type string and the “value” column is typecasted to datatype float64 using float64.

Aliases

  • colregex (deprecated)