Address Rationalisation

Hi all

 

I am looking for an example of splitting an address into the different parts.

I have an address that consists a single line of address in the following format.

 

Theoretically it can contain a combination of the following.

Address Line1, Address Line2, Address Line3, Address Line4, Address PostCode

And I would like to transform this into the following columns

 

Address Line1 – data before the first “,” as long as it is not a numeric

Address Line2 – data between the first “,” and second “,” as long as it is not a numeric

Address Line3 – data between the second “,” and third “,” as long as it is not a numeric

Address Line4 – data after the third “,” as long as it is not a numeric

Address PostCode – the last 4 characters of the string that is a numeric

 

Any suggestions would be extremely helpful.

Best Answer

Answers

  • Sueann SeeSueann See Administrator

    @Keith Alexander is your input a comma delimited file? if it is, then it should fall into place as 5 columns initially. You can then add further transformation for the other checks. Question is what if we found numeric data for Address Line 1,2,3,4? what do you want to do with it? Do you have a realistic sample of the address you can share?

  • Keith AlexanderKeith Alexander Experian Employee
    edited July 29

    @Sueann See , unfortunately life is never that easy ... the file I am using is a XLSX file .

    I think I should rather say the following

    Address Line1 – data before the first “,” or the complete Address if there is no "'," irrespective of the datatype

    Here is a small extract of the data that I am using


  • Clinton JonesClinton Jones Experian Elite
    edited July 30

    Looking at the data, the largest number of comma separated segments is six, the real question is whether you want to have size address lines?

    I'd argue perhaps not, however inherent to the data is also the fact that some column rows seem to have bogus or incomplete data like simply a

    So the decisions are not easy.

    This is one of the reasons why using address validation sometimes is a good approach to helping to standardise the look and feel of an address, the challenge is the GiGo.

    All that said, the suggestions earlier by @Sueann See and @Akshay Davis seem to be your best option

    add a further five or six columns to your dataset using the transform New Column

    use the split function to take the contents of each comma separated segment and parse that out to a column

    any column that contains just a number (apart from the postcode) could then be reconcatenate with the PO Box or the street or building.

    you will have to do this for each column and give it a name, you can immediately see what the next challenge is, namely that the postcodes all dont line up, you will need to use a transform to get all the postcodes lined up


  • Clinton JonesClinton Jones Experian Elite

    Just adding to this, if you then wanted to pack the postcodes (or any other column) you could do it using a compound function that uses variables, AND and IF THEN else conditions

    first set up your variables

    you will need to do this for each column, probably from three on through six

    then next you will use AND and IF THEN else

    you will need to do this for each column from six through four, clearing as you go as well, you probably dont want the postcode to appear in more than one column

  • Sueann SeeSueann See Administrator

    @Keith Alexander I've used split as Akshay has suggested to split the address into Address line 1,2,3,4 and Postcode.

    In addition to each split, i've used a custom function to check if the value is numeric. If the value is fully convertible to a number, I would assign a null value. I do this for Address Line 1,2,3,4 and not for Postcode.

    I realize we probably need more work on postcode after i arrived with results like this.

    So i edited the transformation for postcode to assign the postcode as the last 4 numbers of the street address.

    Still this isn't perfect because address with 6 segments would have some missing parts. Example: Johannesburg is missing for this address.

    Want to give HERE API a try?


  • Clinton JonesClinton Jones Experian Elite

    Another way to find exactly just the postcode is to do a search of the string, but the challenge here will be if you have a four digit number somewhere in the address line other than the postcode. I am afraid as far as working out which part is a street and which part is a locality, you'll either have to use a lookup based search or a best guess. This is oce again why using something like HERE api custom step is a best choice.

  • Keith AlexanderKeith Alexander Experian Employee

    Thanks all for all the input.

    I have all the info that I need to accomplish what I am doing.

Sign In or Register to comment.