Building Rules
In order to start testing the Find Duplicates step, the Find Duplicates settings will also need to have a ruleset defined in addition to the blocking keys. When building rules, we will have to think about the following:
How the Ruleset relates to Blocking Keys
- Blocking keys identifies potential matches. Rules determines if the records really match and how closely they match.
- Records are only compared with the rules if they fall within one of the blocks (groups) defined by the blocking keys.
- In other words, if the records are not blocked, the rules will not have any effect.
For example, for a dataset containing First Name, Last Name and Date of Birth, you've decided to use the following as blocking keys:
- First Name, Last Name, Date of Birth (All Columns)
- Last Name, Date of Birth
Then, at the very least the record must have the same Last Name and Date of Birth in order to be further compared with the rules.
Country
Setting a default country allows the Find Duplicates step to know what country to use when processing and standardizing data.
@default.country=GBR
This default country will only be applied if
- you're not using a country tag when mapping your data columns or
- the country field for a record is blank.
Similar to the blocking keys, the country code can be GBR, USA or AUS. If your business is not based in these regions, choose the one nearest to you or the nature of your data.
Match levels & aliases
Aperture Data Studio allows you to define up to 4 match levels, represented by L0, L1, L2, L3 within the rules. You must have at least one match level. You can override these level names with your own custom names or aliases. For example, defining them as Exact, Close, Probable and Possible match.
/* * Aliases */ define Exact as L0 define Close as L1 define Probable as L2 define Possible as L3
Note: Within the rules, you can specify comments to help keep notes about the rules. /*
represents the start of a comment, */
represents the end of a comment. If a comment spans multiple lines, each line would need to be prefixed with *
Match criteria for each level
For each of the match levels, you will need to define the rules that determines how records will be matched. The most straightforward one to start with is the Exact Match level. This also helps if you are just wanting to test or confirm your blocking keys for a start and need to have at least one match level/rule to start with. Let's take a look at an example.
Defining Exact Match
Assuming we have a dataset containing First Name, Last Name and Date of Birth. How do we decide if the records are really an exact match?
Consider the following questions:
- Do we want to accept any form of differences in the column values for an exact match? (for example, what if one of the records only have the first name entered as an initial i.e. J Doe instead of John Doe)
- Are there any circumstances where we will need to compare different sets of columns for an exact match? (for example, date of birth may be useful for matching, but what if some records do not have the date of birth populated?)
In the simplest form, exact match is usually where all the available input column values match. This can be expressed as a Match Rule and single Element Rule as follows:
/* Match Rule */ Match.Exact={Hash.Exact} /* Element Rule */ Hash.Exact={[ExactMatch]}
- The Match Rule defines defining an overall match between two records and is made up of reference to other rules. In this case, it is referencing the Hash Exact element rule. Note that at least one match rule must be defined.
- The Element Rule is the most granular rule used to specify how to compare individual elements within a record. The available elements are listed here.
- In this case, Hash is a special element that represents the hash value of all the input column values and can be used in this case since the exact match criteria involves all the input columns. The hash value is like a unique identifier for the record.
- The right hand side of the rule is always surrounded by curly brackets { }.
The Exact Match criteria we have defined here requires First Name, Last Name and Date of Birth to be the same. This confirms that Block 1 formed by the blocking keys is indeed useful to identify potential records that are eligible for close match. Since Block 2 is a subset of Block 1, it may appear that Block 2 is redundant at this point, but let's see how this proves useful for other match levels.
Defining Close, Probably, Possible Match
It is entirely up to you whether you choose to define other non-exact match levels. The benefit of doing so is to be able to potentially identify and link records where there may be some degree of differences, but the records actually represent the same entity. Let's attempt to define the Close Match criteria now.
Since we only use exact values for the Exact Match, think of what differences we can tolerate for a close match. For example, assume we can accept the following first name variations, provided the Date of Birth is the same.
- First name as initials (i.e. J Doe & John Doe)
- Inverted first name tokens (i.e. Sarah Jane Smith vs Jane Sarah Smith)
This can be expressed as follows:
/* Match Rule */ Match.Close={Name.Close & DOB.Exact} /* Element Rule */ Name.Close = {(ForenameCompare[InitialVsFullName] | ForenameCompare[InvertedNameMatch]) & Surname[ExactMatch]} DOB.Exact = {Date[ExactMatch]}
- The right hand side of the rules may contain the logical operator
&
(and) or|
(or). For example, this is used to combine multiple element rules in order to indicate that a close match is obtained when the name is a close match and date of birth is an exact match. - Expressions may be nested and logical operators combined (parentheses are required). For example, this is use in the Name element rule to group the multiple forenames match conditions and have them evaluated in addition to the surname match condition.
- ForenameCompare is one of the many comparators that are available to enable more flexible name comparisons. [InitialVsFullName] and [InvertedNameMatch] are 2 of the possible results when using ForenameCompare. There are more as follows:
- For readability and reusability purposes, it is possible to group element rules with Theme rules. Theme rules are made up of references to other rules. For example:
Instead of bundling the match conditions all in the Name element rule,
/* Element Rule */ Name.Close = {(ForenameCompare[InitialVsFullName] | ForenameCompare[InvertedNameMatch]) & Surname[ExactMatch]} DOB.Exact = {Date[ExactMatch]}
reference the Forenames and Surname element rule:
/*Theme Rule*/ Name.Close = {Forenames.Close & Surname.Exact} /* Element Rule */ Forenames.Close = {ForenameCompare[InitialVsFullName] | ForenameCompare[InvertedNameMatch]} Surname.Exact = {Surname[ExactMatch]} DOB.Exact = {Date[ExactMatch]}
The Close Match criteria we have defined here requires Last Name and Date of Birth to be the same while allowing for some flexibility on First Name. This confirms that Block 2 formed by the blocking keys is indeed useful to identify potential records that are eligible for close match.
Here is the overall Find Duplicates ruleset we have defined so far.
@default.country=GBR /* * Aliases */ define Exact as L0 define Close as L1 define Probable as L2 define Possible as L3 /* Match Rule */ Match.Exact={Hash.Exact} Match.Close={Name.Close & DOB.Exact} /*Theme Rule*/ Hash.Exact={[ExactMatch]} Name.Close = {Forenames.Close & Surname.Exact} /* Element Rule */ Forenames.Close = {ForenameCompare[InitialVsFullName] | ForenameCompare[InvertedNameMatch]} Surname.Exact = {Surname[ExactMatch]} DOB.Exact = {Date[ExactMatch]}
Defining Probable and Possible matches
We will not specifically cover the Probable and Possible matches in this post, but essentially, the way to define the other match levels is pretty much the same. Think about:
- Structure and syntax: Is it a match rule? Is it an element rule? Have you included curly braces
{}
where required? Have you included parentheses()
where required especially when using&
(and) and|
(or) operators in an expression? - What are the elements you want to match? Have you used the appropriate element name?
- What comparators and results you want to use?
- What have you covered in the other match levels and what are the remaining possibilities? How much can you tolerate for the rules to be looser?
- Can the potential matches be found based on the blocks formed by the blocking keys? (If not, you'll have to look at tuning blocking keys)