Fuzzy Match with Regular Expressions
Imagine you have an address field containing variation of city names, for example:
You need to determine the country based on the city name. You have an official list of countries and their cities like this:
With any exact match techniques, some of the values will not be matched since they do not appear in the official list of the countries and their cities because they contain abbreviations, diacritics (accented characters), contain postcodes (digits) or is an alias (alternate/old city) name.
Fortunately, Aperture Data Studio provides a way where you can utilize different match types for fuzzy matching.
For the city dataset given, you will need:
- A combination of match types to be able to fully match the list of cities to the official list to obtain the countries. The best way to do this is via a regular expression (regex), which is simply a way to specify a search pattern. Aperture Data Studio supports the java based regex, so you may want to use a website like the regex101.com to test your regex (Select the Java 8 Flavor).
- Instead of using the official list of countries and their cities as the lookup table, you will have to create your own list of countries and cities, where the cities would be specify as search patterns instead of a specific city name, for example:
- Upload both the source dataset containing the City and a dataset containing the Country and City (Regex).
- Create a workflow and connect the Source to a Transform step.
- Within the Transform step, create a column and apply the Contains Match function if you want to return a status indicating if a match is found. Create another column and apply the Replace Matches function to return the corresponding country for the city.
Evidently, matching with a regex is more effective than an exact match when you have variations in your data.
Do expect to spend some time in understanding the regex syntax and tuning the pattern based on the nature of your data and your expectations on how loose/tight you want the pattern match to be.
The above example is modelled after a real customer use case mentioned in this thread.
You may be interested to read about other fuzzy matching examples using functions:
- Remove common words from company name (Company names may contain common words like LTD, Limited, Incorporated, this could be useful as a step to standardize the input prior to de-duplication).
- Replacing abbreviations with expanded form (Again, it helps to standardize your input prior to de-duplication)