Extracting the country from an unstructured address using a lookup list of countries

Clinton JonesClinton Jones Experian Super Contributor

The problem that you may need to solve is that you have data that is effectively a contiguous string of unstructured data and you want to determine perhaps the country from that data in order to either Address Match or filter your records.

A basic approach is to simply use an extract Matches function within the Transform step and derive the country to a new column using a transform.

This function has a number of configuration options but effectively the results speak for themselves...

Comments

  • MiteshKhatriMiteshKhatri Advocate Silver

    This article has been a fantastic find, thanks @Clinton ! I've adapted the principle here in order to do a lookup that, in SQL, would require something like like '%string%'. For example, if I want to select the words "building", "builder", and their potential plurals or slight variations, I might want to be able to use "build" as the look up. I have achieved this by concatenating .* at the end of each value in the lookup table, and then changing the Match Type to be Regular Expression.

    I wanted to put \b at the start of the lookup value (i.e. creating "\bbuild.*") to identify a word boundary at the beginning of the lookup value, but I couldn't make that work. What "version" of RegEx does Aperture use? I wonder if having something like a "Substring" option in Match Type would be useful? It could help identify matches where the keyword in the data is not surrounded by spaces, and would offer Aperture that extra bit of flexibility, if needed.

  • Clinton JonesClinton Jones Experian Super Contributor
    edited July 30

    Attached is an example that somewhat describes what you are trying to do Mitesh, with some data off the web

    A couple of things to note about this workflow:

    * I used the lookup twice but used a branch to use it twice

    * I used a single transform to create the match flag for the lookup

    * The split is an inverted split so that the workflow lays out nicely!

    * The find strings produces a comma separated list but it could be anything

    * The Union puts the non companies first and the candidate companies second in the unsorted list and the inferred column of lookup values is tacked on the end of the data set so now appears as the last column. 

  • Nigel LightNigel Light Super Learner
    edited October 28

    Hi @ClintonJones

    I am still having problems with getting this to work eg if I have an address field containing, for example

    Los Angeles

    Washington DC

    Seattle

    Los Angeles 90210

    and checking them against a table eg

    United States,Los Angeles

    United States,Washington

    United States,Seattle

    Looking up the City to obtain the Country

    If I perform a space separated look-up, then I do not match Los Angeles. if I take the whole field, then I do not match Washington DC or Los Angeles 90210

    (I realise that I can add this to the table but there are many other examples that need to be catered for in our dataset)

    The problem is that we look for the address field in the list entries, rather than the other way round ie looking for the list entry in the address field

    (in the style of a Pandora domain lookup)

    Any thoughts/advice? Happy to discuss next week at the user group

    Nige

  • Clinton JonesClinton Jones Experian Super Contributor

    Hi Nigel

    What is considered the 'best' approach is in all likelihood a standardization of the address elements ahead of any follow on analysis. @Akshay Davis, @Danny Roden @Henry Simms or @Katya Jermolina may comment further...

    Generally we would use the address validation 'batch' step to do this; in some regions however we also use an explicit standardization step to do this.

    I believe that in the ANZ region this is used @Ian Buckle can comment on that.

    There is also a standardization activity as part of the Find Duplicates step usage

    In general standardization is problematic when the addresses are from mixed countries and the country typing is not explicit.

    If a prospective match on address is found by the address validation engine then you will see a set of proposed alternative address elements available.



    Using a lookup is one way to find matches if you have an exhaustive list as you suggest but it is probably imperfect as you have discovered.

    You'll see that the challenges that you outline are not unique to your organization and similarly try to be challenged with lookups programmatically elsewhere such as on stackoverflow - of course data studio avoids you having to do the programming bit but the principles are largely the same.

    Irrespective of whether you use SSIS or products like Alteryx or Talend the most appropriate approach seems to be to use a reference data set, in our case, this would be the Address Matching via Batch or using a lookup the way you are doing it now.

  • Nigel LightNigel Light Super Learner

    Hi @Clinton Jones

    Unfortunately our client address file is mixed-countries from around the world (I just simplified it in the above example to highlight the issue) so, as you say, standardisation doesn't really work for us (I know @PavanRao has tried this and we might need to revisit)

    Eg Looking up Hamilton Parish with word spacing returns Canada Parish but, if we take the whole phrase, we get (correctly) Bermuda

    I considered creating a kind of repetitive function, trying the first word, first 2 words, 3 words etc but again I might be foiled where something precedes the word/phrase of interest eg 2345 Hamilton Parish wouldn't work. We might also be faced with multi-worded cities eg Aix Les Bains which further complicates matters

    I fear that this may be something that is beyond the capabilities of Aperture?

    Nige😥

  • Clinton JonesClinton Jones Experian Super Contributor

    @Nigel Light I am looking at another possibility - i will keep you posted.

  • Nigel LightNigel Light Super Learner

    Thanks @Clinton - that would be a great help

  • Nigel LightNigel Light Super Learner


    Potentially solved by inverting the lookup

    File 1 (Incomplete country addresses) contains the country & city. This is deduplicated and a unique row number applied

    File 2 (Addr data) contains the address field (ie Addr5). This too has a unique row number applied.

    I then perform a Replace matches, using the city as input and Addr4 as lookup, returning the address unique number and filtering to remove any non-matches

    This is joined to (2), by row number, to obtain the address field and a further right join, by the addr5 field, to obtain any duplicate address entries

    Finally, unwanted fields are removed and the file resorted into the original order (using row number)

    I guess this isn't very efficient but it seems to give correct match results

    Hope this helps anybody who is faced with a similar problem

  • Akshay DavisAkshay Davis Administrator

    @Nigel Light Another option is to use a regex to extract and match the cities.

    If you had a lookup file like this

    It will allow for case insensitive matches and optional space for "Los Angeles".

    You can then use an Extract Matches function to extract the matched city name, followed by a replace matches to replace the found city name with the corresponding country.

    Which returns the results as expected


  • Nigel LightNigel Light Super Learner

    Elegant solution...nice

Sign In or Register to comment.