Create our own Elements for Find Duplicates Step
Hi Team,
I'd like to ask some questions about Find Duplicates Step. Is this step only for CONTACT DATA, like Name, Address, Email and Phone Number? There are some built-in elements, could I create my own elements?
We have an integrated profiled table and those attributes are from different tables in different databases. We added table name, schema name and database name into the integrated table, and we planned to find the join candidates (the attributes) by using Find Duplicates Step. However, after going through the documentation, I realised that it seems to be only for CONTACT DATA. Did I miss something or could you please give me some suggestions?
Best regards,
Nate
Tagged:
0
Answers
@Nathaniel Find Duplicates is optimized for contact data because there is a standardization process involved where the system will be built in with more knowledge about contact data. As an example, when running Find Duplicates, additional versions of your elements can be created to assist further with the comparison. These are known as modifiers. Modifiers can correct, enhance or derive many known terms that appear in the input. For example, a DERIVED modifier may be created when the element was not contained in the input but the standardization process was able to determine the value (e.g. COUNTY can sometimes be derived from the LOCALITY and POSTCODE input). In addition there are also specific comparators that would be specifically used for names and components of an address.
Having said that, depending on how you want to compare those elements, you could still use Find Duplicates for your own elements, but without the added benefit of the standardization process. If you look at the list of allowed elements in the documentation here, there is one for generic string that you can use. Besides exact match, you can use fuzzy matching algorithms like Levenshtein and JaroWinkler. Do note that when you have multiple generic strings within a Dataset that you want to map, you will have to assign them into groups so that you can define different rulesets for each group even though they are all generic strings.
@Nathaniel as an alternative to Find Duplicates, you may want to consider using the Lookup step or Lookup functions. Lookup functions allow for regular expression matching. When you say join the attributes, what are the possible different values of attributes that you are looking at, would it be something like
Email Address vs Email
Cell Phone vs Mobile vs Cell
Hi @Sueann See, we have thousands of attributes, so we are not clear about the value formats. I am going to use the generic string element, and see what will go on. Thanks for your time!
Wanted to highlight this discussion that discusses an approach that might be relevant here: