Getting started with Regular Expressions in Data Studio
A regular expression (from here on we’ll call them “regex” for short) is a well-known and powerful syntax for describing a search pattern in a simple string of text.
You can think of regex as a way of defining advanced wildcards: In Windows file explorer you might search for *.txt
to return all txt files, or in SQL you might search WHERE SALARY LIKE '200%'
to find salary values starting with “200”. Regex lets you define much more complicated searches using a similar approach.
Data Studio has four functions that work specifically with regex, as well as global library for storing and managing them, which means you can easily harness the power of regex for your project.
Before we look at an example where regex is used in Data Studio, here’s a couple of tips:
- Use an online regex tester to check that your expression does what you expect it to do with a range of different values. There are lots of great regex testers and debuggers out there, my favourite is Regex101
- Regex is powerful but can also be complicated. Any regex search could also be built using Data Studio’s standard comparison functions Equals, Starts With, Contains, and so on. Regex can be a more elegant and rapid way to define a search but it isn’t always the right approach!
- In the examples below, the regex is defined in the function itself. That’s OK, but if you plan to re-use an expression, or to need to maintain it over time, we recommend defining it as a constant value in the Glossary (you’ll find this in the Glossary, where there are already lots of useful regular expressions).
There are several ways to use regex in Data Studio. Let’s take a look at the most common scenario with a quick worked example: Using Compare to check whether values in a column of data match our regex. This is the most common way to use a regex in Data Studio.
In this example, we have a column of IDs:
A10-E1
A11-EX
A10_33
A11-X1
A12-55
B12-E5
A11-33
C11-E2
A10-50
We want to check which values conform to our expected format, which in this case is:
- The first character must be “A” or “B” (uppercase)
- The second and third characters must be “10” or “11”
- The fourth character must be a dash (hyphen)
- The fifth character must be “E” (uppercase) or a numeric value between 1 and 5
- The sixth character must be any numeric value between 0 and 9
This set of criteria is complex, but can be defined by the single regex expression:
^(A|B)(10|11)-(E|[1-5])([0-9])$
Applying this regex in the Compare function of a workflow Transform step would be done by pasting the regex into the Compare function like this:
And the result of our new "Check Id Format" column on a sample of data would be:
Comments
-
@Henry is this correct ?
- The second and third characters must be “11” or “12”
or should it be "10" or" 11" ?
1 -
@Clinton good spot, I was just making sure everyone was paying attention. I've corrected the post now!
2 -
Some users have noticed that a regular expression that appears to work on a regex tester (I mentioned Regex101 in my post above) doesn't work as expected in Data Studio.
The most likely explanation is that Regex comes in many flavours, and the website you've used to develop the expression in doesn't use same underlying regular expression engine as Data Studio, namely Java 8's
java.util.regex
.Here's an example of a regex that is valid in one flavour (ECMAScript) but not in PCRE (PHP) or Java: [a-\d], where we want the hyphen to be treated as a literal value (
In ECMAScript \ JavaScript, this will match to a single character in the list
a, -
However, the PCRE \ PHP engine will not validate the same regex:
To be sure that your regular expression will work in Data Studio, you can test it in a site that explicitly has the Java option, for example freeformatter.com. Here, the same regex will report an error:
For (much) more detail on regular expression engine comparison, see this comparison table.
3