Regular expressions

Learn how to use regular expression to implement natural language processing for search queries.

The string-regexp-extract step is used to run regular expressions on pipeline inputs to transform variables such as queries into more structured information. The regular expressions use RE2 syntax.

They are used mostly for extracting structured information from queries (also referred to as query scoping). Below are some examples of common use cases. We also support using more advanced NLP models, but for many cases regex is sufficient.

Extracting a size

A common requirement is to extract patterns such as "size 14 shoes" from queries. In this case, no products mention "size 14", so it's very common to see search implementations fail on these types of queries. Pipelines can easily alleviate these complexities in a few lines of YAML.

Below is a sample query from which we will extract the size information:

{
    "q": "nike size 14 shoes"
}

The regular expression to do this looks for the size <sizeValue> pattern and puts matches into the size variable which is bound to the match param.

- id: string-regexp-extract
  params:
    match:
      bind: size
    matchName:
      constant: sizeValue
    outText:
      bind: q
    pattern:
      constant: size (?P<sizeValue>\d+(.\d+)?)
    text:
      bind: q

text is defining the input variable and outText is defining the output variable. In this case, they are the same, which is why the regex pattern was removed from the query. So alternatively this could be written to a different variable to keep the query unchanged.

Let's take a closer look the step above.

  • With bind: size we are binding the matched string to variable called "size".

  • sizeValue specifies the name of the match group we will be using in the regular expression.

  • constant: size (?P<sizeValue>\d+(.\d+)?) specifies the regular expression.

    (?P<sizeValue> ...) defines the capture group followed by the regular expression to match the string.

Based on our example above, the regular expression would match size 14 and capture the numeric value in the group and assign it to the size variable.

The variables after this step look like:

{
    "q": "nike shoes",
    "size": "14"
}

It is very useful as we can quickly filter on size whenever a size variable exists. This is done using the add-filter step:

- id: add-filter
  params:
    filter:
      constant: option_size ~ [size]
  condition: size

The condition ensures the filter is only added if the size variable exists. The option_size ~ [size] is filtering for products where the option_size field array contains the value in the variable size, in this case "14". The query executed against the indexes would be nike shoes.

Extracting a serial number

When searching for specific serial numbers, you generally expect to see exact matches only. However, a the spelling system might assume that one or two letters (or numbers) were simply typed incorrectly and will show matching alternatives.

To ensure only results that match the serial number exactly are shown we can match the serial number via a regular expression and apply a filter with the exact number.

This example will also write the serial number into a different variable and leave the query text unchanged. That means the search will be executed the way the user entered it, with an additional filter applied.

Assuming the serial number is made up of a few letters and numbers and a user searches for the following serial number.

{
    "q": "ABC12345"
}

The following steps in the pipeline will extract the serial number and bind it to the "serial" variable that is then used to add a filter that only matches if the title contains the serial number.

- id: string-regexp-extract
  params:
    match:
      bind: serial
    matchName:
      constant: serialValue
    outText:
      bind: qModified
    pattern:
      constant: (?P<serialValue>[A-Za-z0-9]*([a-zA-Z]+[0-9]+|[0-9]+[a-zA-Z]+))
    text:
      bind: q
- id: add-filter
  params:
    filter:
      constant: title ~ serial
  condition: serial   

This will ensure that customers searching for an exact serial number will only get results that actually include that serial number in the title.

Extracting a year

Below is an example illustrating how a year can be extracted from a free text query.

{
    "q": "2018 tax legislation"
}

For the above query, we have a year field on every document, but the year is not mentioned in other fields that are typically indexed such as the title and description.

We would ideally like to filter on the year and use the rest of the query as a normal query. Below shows how to do that. The first part is to extract the year:

{
    "q": "tax legislation",
    "yearFilter": "year = 2018"
}

This converts the input into the following:

- id: string-regexp-extract
  params:
    match:
      bind: yearFilter
    matchTemplate:
      const: year = ${yearValue}
    outText:
      bind: q
    pattern:
      const: (?P<yearValue>\b(19|20)[0-9]{2}\b)
    text:
      bind: q

This version uses the matchTemplate option to insert the extracted year ${yearValue} directly into a filter expression. It could also have just extracted the year, but this case illustrates conversion into an expression (it could be any other string pattern).

The filter can then be added simply using:

- id: add-filter
  params:
    filter:
      bind: yearFilter
  condition: yearFilter

Note here the condition is checking to make sure the yearFilter variable exists before adding the filter.

Last updated