Relationship Discovery by using Find Duplicates Step in Aperture Data Studio

We tried to find the join candidates across tables and databases by using the profiling metadata in Aperture Data Studio. In terms of the blocking keys in Find Duplicates Step, we chose some attributes like Most Common Format, Dominant Datatype, Standard Deviation, Average Length, Length Deviation, Frequency Deviation, Format Frequency. Deviation, as well as Aperture Tags.

However, the prediction result from Find Duplicates is not accurate enough. We think the reasons are that the rules for Find Duplicates is very basic, and also there is a need to reconsider the blocking keys. Could you please help us improve the accuracy?

I tried to drag our dataset and rules file to this post, but it showed "Request failed with status code 403". How could I share the files with you so that you can have a better understanding of the problems?

Thanks for your time.

Best Regards,

Nate

Answers

  • Sueann SeeSueann See Experian Super Contributor

    @Nathaniel i have responded to you (via email) on my observations. Let me know if I can post the files you sent me via email in this community portal.

  • Sueann SeeSueann See Experian Super Contributor

    For everyone's benefit, attached are the files provided by Nathaniel with regards to his questions.

    The following are some observations based on these files:

    Blocking Keys

    Consider removing the following blocking keys:

    • ColumnName – it may not be effective to match based on column name alone. In addition any matches found by column name would probably have been covered by the other attributes.
    • Tags – the percentage of records having value for Aperture or SSMS Tags is very low so any matches found by these tags would probably have been covered by other attributes.
    • Length deviation - There is already another blocking key based on average length. There seems to be little benefit to match with length deviation again as both are indicators of length.

    If performance is not an issue currently, you can always come back to tune the blocking keys later. The Analyze Blocking Keys feature in the Find Duplicates Workbench may help with this.

     Find Duplicate Rules

    • Currently, the rule you have set up says that as long as column name matches to certain extent, the rest of the attributes doesn’t matter. (because you used the OR condition for Column Name).

    Match.Close={ (Datatype.Close & MCF.Close & Tags.Close & ApertureTags.Close & LengthDeviation.Close & AverageLength.Close & FrequencyDeviation.Close & FormatFrequencyDeviation.Close & StandardDeviation.Close) | ColumnName.Close }

    For example, these rows for rowguid have quite different values for "most common format" but they will still be clustered together because the column name matches exactly. If this is not your desired outcome, then perhaps you should consider removing the ColumnName condition from the rules or refining the match conditions.


    • When you com[are ColumnName values "LocationID" & "LoginID" , you will noticed that this gives you a Levenshtein distance of 4. This suggests that there is a risk of over matching based on column name. If this is not your desired outcome, then perhaps you should consider removing the ColumnName condition from the rules or refining the match conditions.


    • Since average and deviations are numerical values, you will not get accurate results with Levenshtein distance that is meant for text comparison. You can use NumericCompare instead of Levenshtein. We introduced NumericCompare in v2.4 so you may need to upgrade to use it.

    Example:

    #StandardDeviation.Generic_String.Close={NumericCompare[1]}   

    This would mean that we would match numbers that have a difference of +-1. Note that we do not support decimal values to be inserted as a parameter for NumericCompare. If you want to allow a smaller range of difference, use the % option instead.

    #StandardDeviation.Generic_String.Close={NumericCompare[1%]} 

    This would mean that we would match numbers that have a difference of +-1%.

     

    • There isn't much benefit in using Levenshtein distance for Tags and Data Types since there are a limited number of possible values for each of these columns. Besides exact match, what other matches are you looking for in Tags and Data Types?
    • There is some "dirty" data in the Data Type and Most Common Format columns. Performing some data cleaning upfront would improve the matching results.


Sign In or Register to comment.