Relationship Discovery by using Find Duplicates Step in Aperture Data Studio
We tried to find the join candidates across tables and databases by using the profiling metadata in Aperture Data Studio. In terms of the blocking keys in Find Duplicates Step, we chose some attributes like Most Common Format, Dominant Datatype, Standard Deviation, Average Length, Length Deviation, Frequency Deviation, Format Frequency. Deviation, as well as Aperture Tags.
However, the prediction result from Find Duplicates is not accurate enough. We think the reasons are that the rules for Find Duplicates is very basic, and also there is a need to reconsider the blocking keys. Could you please help us improve the accuracy?
I tried to drag our dataset and rules file to this post, but it showed "Request failed with status code 403". How could I share the files with you so that you can have a better understanding of the problems?
Thanks for your time.
Best Regards,
Nate
Answers
@Nathaniel i have responded to you (via email) on my observations. Let me know if I can post the files you sent me via email in this community portal.
@Sueann See Yes, you can
For everyone's benefit, attached are the files provided by Nathaniel with regards to his questions.
The following are some observations based on these files:
Blocking Keys
Consider removing the following blocking keys:
If performance is not an issue currently, you can always come back to tune the blocking keys later. The Analyze Blocking Keys feature in the Find Duplicates Workbench may help with this.
Find Duplicate Rules
Match.Close={ (Datatype.Close & MCF.Close & Tags.Close & ApertureTags.Close & LengthDeviation.Close & AverageLength.Close & FrequencyDeviation.Close & FormatFrequencyDeviation.Close & StandardDeviation.Close) | ColumnName.Close }
For example, these rows for rowguid have quite different values for "most common format" but they will still be clustered together because the column name matches exactly. If this is not your desired outcome, then perhaps you should consider removing the ColumnName condition from the rules or refining the match conditions.
Example:
#StandardDeviation.Generic_String.Close={NumericCompare[1]}
This would mean that we would match numbers that have a difference of +-1. Note that we do not support decimal values to be inserted as a parameter for NumericCompare. If you want to allow a smaller range of difference, use the % option instead.
#StandardDeviation.Generic_String.Close={NumericCompare[1%]}
This would mean that we would match numbers that have a difference of +-1%.