Address Rationalisation
Hi all
I am looking for an example of splitting an address into the different parts.
I have an address that consists a single line of address in the following format.
Theoretically it can contain a combination of the following.
Address Line1, Address Line2, Address Line3, Address Line4, Address PostCode
And I would like to transform this into the following columns
Address Line1 – data before the first “,” as long as it is not a numeric
Address Line2 – data between the first “,” and second “,” as long as it is not a numeric
Address Line3 – data between the second “,” and third “,” as long as it is not a numeric
Address Line4 – data after the third “,” as long as it is not a numeric
Address PostCode – the last 4 characters of the string that is a numeric
Any suggestions would be extremely helpful.
Best Answer
-
Akshay Davis Administrator
@Keith Alexander I think what you're after is the split function.
This takes an input string, a character to split on (comma in your case) and then the item to return.
The example above shows that for Address Line 1. You would create a workflow with a transform step, which creates 5 columns (Address Line 1 -5) and for each, it's just the split step, with the original column as the input and the relevant line number to return.
5
Answers
@Keith Alexander is your input a comma delimited file? if it is, then it should fall into place as 5 columns initially. You can then add further transformation for the other checks. Question is what if we found numeric data for Address Line 1,2,3,4? what do you want to do with it? Do you have a realistic sample of the address you can share?
@Sueann See , unfortunately life is never that easy ... the file I am using is a XLSX file .
I think I should rather say the following
Address Line1 – data before the first “,” or the complete Address if there is no "'," irrespective of the datatype
Here is a small extract of the data that I am using
Looking at the data, the largest number of comma separated segments is six, the real question is whether you want to have size address lines?
I'd argue perhaps not, however inherent to the data is also the fact that some column rows seem to have bogus or incomplete data like simply a
So the decisions are not easy.
This is one of the reasons why using address validation sometimes is a good approach to helping to standardise the look and feel of an address, the challenge is the GiGo.
All that said, the suggestions earlier by @Sueann See and @Akshay Davis seem to be your best option
add a further five or six columns to your dataset using the transform New Column
use the split function to take the contents of each comma separated segment and parse that out to a column
any column that contains just a number (apart from the postcode) could then be reconcatenate with the PO Box or the street or building.
you will have to do this for each column and give it a name, you can immediately see what the next challenge is, namely that the postcodes all dont line up, you will need to use a transform to get all the postcodes lined up
Just adding to this, if you then wanted to pack the postcodes (or any other column) you could do it using a compound function that uses variables, AND and IF THEN else conditions
first set up your variables
you will need to do this for each column, probably from three on through six
then next you will use AND and IF THEN else
you will need to do this for each column from six through four, clearing as you go as well, you probably dont want the postcode to appear in more than one column
@Keith Alexander I've used split as Akshay has suggested to split the address into Address line 1,2,3,4 and Postcode.
In addition to each split, i've used a custom function to check if the value is numeric. If the value is fully convertible to a number, I would assign a null value. I do this for Address Line 1,2,3,4 and not for Postcode.
I realize we probably need more work on postcode after i arrived with results like this.
So i edited the transformation for postcode to assign the postcode as the last 4 numbers of the street address.
Still this isn't perfect because address with 6 segments would have some missing parts. Example: Johannesburg is missing for this address.
Want to give HERE API a try?
Another way to find exactly just the postcode is to do a search of the string, but the challenge here will be if you have a four digit number somewhere in the address line other than the postcode. I am afraid as far as working out which part is a street and which part is a locality, you'll either have to use a lookup based search or a best guess. This is oce again why using something like HERE api custom step is a best choice.
Thanks all for all the input.
I have all the info that I need to accomplish what I am doing.