Extracting the country from an unstructured address using a lookup list of countries

Clinton Jones · July 2019

The problem that you may need to solve is that you have data that is effectively a contiguous string of unstructured data and you want to determine perhaps the country from that data in order to either Address Match or filter your records.

A basic approach is to simply use an extract Matches function within the Transform step and derive the country to a new column using a transform.

This function has a number of configuration options but effectively the results speak for themselves...

Clinton Jones · July 2019

https://us.v-cdn.net/6031645/uploads/602/6CNUC9PHBAVN.wfl https://us.v-cdn.net/6031645/uploads/416/303C7P5MEX6B.csv

Attached is an example that somewhat describes what you are trying to do Mitesh, with some data off the web

A couple of things to note about this workflow:

* I used the lookup twice but used a branch to use it twice

* I used a single transform to create the match flag for the lookup

* The split is an inverted split so that the workflow lays out nicely!

* The find strings produces a comma separated list but it could be anything

* The Union puts the non companies first and the candidate companies second in the unsorted list and the inferred column of lookup values is tacked on the end of the data set so now appears as the last column.

Nigel Light · October 2019

Hi @ClintonJones

I am still having problems with getting this to work eg if I have an address field containing, for example

Los Angeles

Washington DC

Seattle

Los Angeles 90210

and checking them against a table eg

United States,Los Angeles

United States,Washington

United States,Seattle

Looking up the City to obtain the Country

If I perform a space separated look-up, then I do not match Los Angeles. if I take the whole field, then I do not match Washington DC or Los Angeles 90210

(I realise that I can add this to the table but there are many other examples that need to be catered for in our dataset)

The problem is that we look for the address field in the list entries, rather than the other way round ie looking for the list entry in the address field

(in the style of a Pandora domain lookup)

Any thoughts/advice? Happy to discuss next week at the user group

Nige

Clinton Jones · October 2019

Hi Nigel

What is considered the 'best' approach is in all likelihood a standardization of the address elements ahead of any follow on analysis. @Akshay Davis, @Danny Roden @Henry Simms or @Katya Jermolina may comment further...

Generally we would use the address validation 'batch' step to do this; in some regions however we also use an explicit standardization step to do this.

I believe that in the ANZ region this is used @Ian Buckle can comment on that.

There is also a standardization activity as part of the Find Duplicates step usage

In general standardization is problematic when the addresses are from mixed countries and the country typing is not explicit.

If a prospective match on address is found by the address validation engine then you will see a set of proposed alternative address elements available.

Using a lookup is one way to find matches if you have an exhaustive list as you suggest but it is probably imperfect as you have discovered.

You'll see that the challenges that you outline are not unique to your organization and similarly try to be challenged with lookups programmatically elsewhere such as on stackoverflow - of course data studio avoids you having to do the programming bit but the principles are largely the same.

Irrespective of whether you use SSIS or products like Alteryx or Talend the most appropriate approach seems to be to use a reference data set, in our case, this would be the Address Matching via Batch or using a lookup the way you are doing it now.

Nigel Light · October 2019

Hi @Clinton Jones

Unfortunately our client address file is mixed-countries from around the world (I just simplified it in the above example to highlight the issue) so, as you say, standardisation doesn't really work for us (I know @PavanRao has tried this and we might need to revisit)

Eg Looking up Hamilton Parish with word spacing returns Canada Parish but, if we take the whole phrase, we get (correctly) Bermuda

I considered creating a kind of repetitive function, trying the first word, first 2 words, 3 words etc but again I might be foiled where something precedes the word/phrase of interest eg 2345 Hamilton Parish wouldn't work. We might also be faced with multi-worded cities eg Aix Les Bains which further complicates matters

I fear that this may be something that is beyond the capabilities of Aperture?

Nige😥

Clinton Jones · October 2019

@Nigel Light I am looking at another possibility - i will keep you posted.

Nigel Light · October 2019

Thanks @Clinton - that would be a great help

Nigel Light · October 2019

Capture.JPG

Potentially solved by inverting the lookup

File 1 (Incomplete country addresses) contains the country & city. This is deduplicated and a unique row number applied

File 2 (Addr data) contains the address field (ie Addr5). This too has a unique row number applied.

I then perform a Replace matches, using the city as input and Addr4 as lookup, returning the address unique number and filtering to remove any non-matches

This is joined to (2), by row number, to obtain the address field and a further right join, by the addr5 field, to obtain any duplicate address entries

Finally, unwanted fields are removed and the file resorted into the original order (using row number)

I guess this isn't very efficient but it seems to give correct match results

Hope this helps anybody who is faced with a similar problem

Davis, Akshay · October 2019

@Nigel Light Another option is to use a regex to extract and match the cities.

If you had a lookup file like this

It will allow for case insensitive matches and optional space for "Los Angeles".

You can then use an Extract Matches function to extract the matched city name, followed by a replace matches to replace the found city name with the corresponding country.

Which returns the results as expected

Nigel Light · October 2019

Elegant solution...nice

Extracting the country from an unstructured address using a lookup list of countries

Comments

Categories