Split Column - 2 columns
Hi
I have a file that has a column for postal address that is in this format. Using V2.0 Data Studio
3 Ballarat Street BALLARAT VIC 3000 (not a real address but format is like this)
Address validation what read this as no column. As we are automating all loads direct, l need to be able to split this to 2 columns like this
Street
3 Ballarat Street
Suburb
BALLARAT VIC 3000
I have tested this and it works. I can do this in excel using flash fill for now, but l am sure a way in Experian.
Please advise
thanks
Carolyn
Tagged:
0
Answers
Hi @Carolyn this is an interesting one, firstly, I would say that in general the address validation should handle a fully concatenated address in one column, if that's not working as expected, the examples would be useful for us to investigate and remediate. @Sueann See - this is one to look at.
I think there are a couple of ways you might want to consider doing this parse out though, you could use any one of the following approaches:
One you might consider, involves using the fairly blunt method of a delimited substring - it has limitations because it assumes that the first line is delimited from the second line by a comma - you could also use the words, street, road, avenue and all the truncations thereof but again it assumes that you are able to capture all the combinations and then you need to concatenate the delimiter
This second approach uses a combination of position calculations and string delimitation and string substripping but it requires you to do all the street, St, Ave, Avenue, Road, Rd etc combinations. It doesn't require you to have comma separated segments of the address.
Another approach is to use a regular expression to identify the first segment of the address and then use the position based substring
I think either way, you're either stripping the extraneous parts from the address or just parsing out the piece that you have the most confidence is good.
There are some other ways that I have seen, where you can use the Experian Data Quality standardize engine to standardize the address for you but that would likely require a custom step.
Just as an aside, bear in mind that the address validation with the Batch engine actually does standardisation of the address to try and identify the address pieces.
Hi Clinton
I will give these a go and see how l go
thanks
@Clinton Jones
firstly, I would say that in general the address validation should handle a fully concatenated address in one column, if that's not working as expected, the examples would be useful for us to investigate and remediate. @Sueann See - this is one to look at.
This is unfortunately not the case for the Australian datasets you require the suburb, state, postcode split out in order to get good matches with Address Validate. Splitting out address information from one field is a very common scenario in Australia as a lot of council systems (NAR) store the address in this format, although sometimes there is a format stored in more structured way.
@Carolyn
As mentioned there are multiple ways to achieve the splitting of the address in data studio, most of the approaches will use variations of pattern matching like in the case with Excel Fill and each approach will have its limitations.
Below is an example using differing techniques which ill explain. The first two columns labelled Excel are the outputs from Flash Fill in Excel as you can see it will not handle spaces in your suburb name and in some cases will drop it.
The Columns Prefixed with Split are using a technique of turning the value into a list and extracted the last three values in the list similar to Excel Pattern Matching. The Split Part 1 uses the before function to take everything before the Split Part 2 and hence keeps the NORTH part of the address.
Split Part 2
Note the use of Variables to ease the building of a function. And the use of List reverse switch the values around so its easier to select the last three items.
Split Part 1
The second approach uses REGEX and the Before function
The REGEX assumes that only the SUBURB and State will be UPPERCASE and no other words and that the postcode will be the only 4 digit number.
Split Regex Part 2
Split Regex Part 1
The split approach should be OK if you are sending the data subsequently back into Address Validate as the Suburb should get corrected by the process.
There are many other approaches as well which I am sure others will share as well.
I've also included the functions as a dmx file that you can load into your system, provided you are on 2.0.10
Thanks l will test and see what is best option. I have been using the usable function, this is great once you nail a transform which affects many columns and files.
Carolyn
The Batch Address Validation engine currently used by Data Studio has a preference for comma separated address input.
I did a simple test in Aperture Data Studio for a number of property addresses in Victoria.
The best results came from having address input components (Component Address) i.e. having address components in it's own columns, for example, i had the premises and street address in one column, suburb in another column, postcode in another column. This particular example, yielded a "Verified Correct".
When using a single column address input that has components delimited by space, the number of Unmatched result is relatively high. The following example yielded an "Unmatched"
This is true even when compared to having a single column address input simply delimited by commas. The following example, yielded a "Good Full Match". Note that in this case, the street name was simply delimited into Porter and St, and is able to be corrected by the address validation engine to Porter Street.
I can raise a request to check if there is an opportunity to improve on the standardization rules, but the behavior may be inherent in the engine's preference for comma delimited input.