Smart Harmonization FAQ

Sueann See · September 2021

As of release 2.4.5, you are able to turn on a preview of a new feature that involves smart models utilizing machine learning for harmonization. The smart models can be applied as Column specific rules at the additional options for Harmonize duplicates.

What is harmonization?

Harmonization is used to merge, blend or reduce a group of records to a single row in the context of duplicate or similar records. The result is a single record which is commonly known as the survivor record or the golden record.

In Aperture Data Studio, harmonization can be performed using the Harmonize Duplicates step on the data resulting from the Find Duplicates step.

What does Harmonize duplicates involve?

Cluster ID - The grouping of records to be harmonized. This can be achieved using any column of data that carries common values across multiple rows. For the most optimum results, we recommend using the Cluster ID resulting from the Find Duplicates process.

Record processing - Determines how to process the records to be harmonized: Select best record, Select and complete best record or Merge best values.

Record/value priority - Determines the criteria that will be applied when selecting the best record or values.

When selecting the best record, you can opt for the Most Populated (most complete record), Total Length (longest record), or Score based approach (where a column of data is the criteria for the score, like a date, a value or unique record ID).

When merging values into a new and potentially entirely different record, this can be done by Most Populated (most number of columns populated based on a subset of columns from the record), Most Common (selecting the most common value where one exists) , Total Length (selecting the longest value), or Score based (by using a custom-written function to attempt to assign a quantitative ‘score’ to each value, allowing harmonization to select the highest scoring value).

There may not be an easy way to determine what is the best choice among the available supported record/value priority options as the results may not be suitable for certain cases. For example,

You've selected the longest value and it happens to be an outlier.
You've selected the most common value and yet it is in an abbreviated form that you are not expecting.
You want to set up a score, but not sure what criteria you can use to reliably determine the best record.

Ultimately we think there is a need is for a choice that makes use of the data in its entirety to propose a better outcome automatically, hence the concept of Smart harmonization is born.

What is Smart harmonization?

Smart Harmonization provides additional record/value priority options tuned to specific types of data (for example, contact names, data with an expected format or data with an expected list of values).

We believe this has the potential to lead to better results than any of the previously available options and reduces the need for defining custom scoring functions. In future the feature may be extended to reduce the need for custom scoring functions further, by providing a more user-friendly and accurate way for users to define complex decision rules.

How does Smart harmonization work?

Smart harmonization calculates many values to be used as input signals for decisions. A machine learning model is used to combine the signal results to generate a final recommended result.

The range of input signals is likely to grow over time as we continuously improve the model. Currently, this includes:

Various indications of value length (and therefore implied completeness), for example, number of characters, number of words.
Various similarity scores, for example, using the Soundex and edit distance algorithms. Note that one of the drawbacks of the existing “Most Common” value approach is that it relies on exact matching, i.e. given the four values “Chris”, “Christopher”, “Chris C.”, “Bob” it is just as likely to select “Bob” as any of the others.
Indication of frequency, for example, a count of distinct values in a clustered set.
Indications of distinct value frequency as well as value format frequency across the entire breadth of data. These signals are derived from profiling and are useful for selecting values that conform to the overall patterns in the data, for example, selecting “United Kingdom” as opposed to “UK” if the longer form is more prevalent and favoring postcodes that match the more frequent format. (Note: Profiling is used for formats and values model, but not for string similarity model where the values may vary too much across the dataset to be considered significant)

Each model is designed to use a subset of these signals depending on its purpose.

What Machine Learning algorithm does Smart Harmonization use?

Smart Harmonization uses a deep learning neural network to assess the strongest candidates. The configuration of the network will likely change and may become more complex over time to achieve better results. Once candidates have been assessed by machine learning, they are passed through a post processing phase to determine the overall winner.

Isn’t there a risk that Smart harmonization would generate the 'wrong' results?

Consider different ways in which a value can be 'wrong':

Combining values from the wrong records. For example, the output of harmonization contains some values relating to person A, and others from person B. This is clearly wrong, but represents an error with the previous Find Duplicates step, rather the Harmonize step itself. If you’re not confident that the find duplicates process is performing well enough, it’s not safe to use any form of harmonization; smart or otherwise.
Selecting one of many broadly similar values. For the types of data that smart harmonization supports today (for example, person names), it’ very frequent that all the values are similar, with differences being the presence or absence of name components such as the middle names, abbreviations or nicknames. In these cases some of the values may be preferable to others (for example, full name vs initial), but none are "wrong" as such. Based on our testing, smart harmonization agreed with our subjectively-assessed "best value" more often than the other methods (Longest or Most Common) did. What we mean by “best value” here is “if I was performing this harmonization manually and had to pick a value to use, I’d go with this one”.
Selecting between obviously conflicting values. For cases where a group of records contain dissimilar values (for example, due to a change of surname), smart harmonization can use fuzzy matching to select the "most common" values, even if there are variations like in terms of spelling or completeness. If this is not enough to decide (for example in the case where there are only two values present, so they’re both equally common), then it is harder for the process to select the "best value". This would also be true for the other harmonization methods, or even if harmonization was being done manually. In addition, note that if Find Duplicates is working correctly, none of the values for a column like a person name should be totally "wrong"; – at worst it’s likely to be outdated, or is a nickname.

We have ideas for improving the situation further by including more inputs when making decisions such as by allowing smart harmonization to make use of the "last update" date to consider the more recent record.

We will need feedback on the wider concept before investing in this.

Can’t I already get the correct results with score based harmonization?

We see score-based matching as complementary to the standard harmonization step. We are exploring ways the two can be used together to provide the best possible proposed results. In the immediate term, smart harmonization is probably more suitable for situations where you would previously have used “most common” or “longest”, rather than a replacement for situations that require custom scoring, especially when you can reliably define the criteria for the "best value".

The wider vision for smart harmonization would allow us to improve on what is possible with score functions today. Consider the following commonly use score-based criteria.

Prefer records from Source A over Source B This rule may not always represent the best outcome. For example, if Source A contains better data than Source B 90% of the time, a simple source-based rule like this will still be making sub-optimal decisions in 10% of cases. Additionally, this rule doesn’t help when Source A contains duplicate data, and you still need a fallback criteria to decide on the best record or best value.
Last modified date also has limitations. Do all sources contain such a column? Is there a modification-date per column, or is it typically per record? Do you risk thinking a value is more up-to-date than it actually is? Does your scoring function have different scoring tiers for a record that is 5 years older than another, vs 5 minutes?

In the medium term, it is possible to extend smart harmonization to allow users to blend a variety of inputs (such as source systems preference, relative recency, similarity of values, adherence to common formats or patterns) and use machine learning to identify the best way to combine and prioritize those signals.

Another option we can consider is to expose some of the input signals calculated by the smart harmonization module, and allow users to define exactly how they are combined, for example, being able to determine the weight of each input.

Does Smart harmonization only works for names? Can customers train their own models?

The current models we distribute are as follows:

String similarity model - This model has been trained with person names and will work reasonably well with company names, product names or other generic strings as well.
Formats model - Suggested for columns that adhere to a certain format, for example, Country (that could be in ISO2, ISO3 or the long name)
Values model - Suggested for columns that should have a list of valid values, for example, Title (should be Mr, Miss, Mrs, Dr)

As of today customers cannot train their own models, although this is something we could support in the future if there is demand.

I’ve tested this, and some of the results look wrong. What should I do?

Send us the examples! Whenever machine learning is used, there is opportunity for more data to lead to better results, so we’re keen to find examples where we’re not yet getting it perfect. We have built ‘versioning’ into smart harmonization, so that as our models improve, we can distribute improved versions (and you will be able to opt in or out to automatically using the latest and greatest version of each model). The feedback for Smart harmonization is currently tracked in REQUEST 197839.

Limit on number of score pairs

One of the reasons you may be seeing inaccurate results is due to a limit on the number of score pairs allowed that is set to 50 currently. The limit is there to prevent any unforeseen issues in terms of performance. We would like to hear your feedback on whether this is a reasonable limit.

What’s the current status of the feature? Is it ready to be used in production?

It’s ready for demonstration and discovery with selected customers or prospects. However, we want to retain flexibility to update the design based on the feedback we get during the preview period. At the moment, we would advise against building production workflows that uses smart harmonization, just in case we need to make non-backwards-compatible design changes.

Are there any demo datasets available?

We have a limited number of datasets we used for development and testing. However, it is best you test the feature with real world customer data as available so that we discover and improve any other scenarios that have not been sufficiently covered before we make this feature available to our customers.

The test data we have is somewhat contrived but does show examples where we think the smart models provide value over existing methods. These datasets have been created based on the assumption that there may be other columns that have caused the records to be matched or clustered together. The test datasets are as per attached.

TestStringSimilarity.csv

TestFormatsValues.csv

Smart Harmonize Demo Data.csv

The TestStringSimilarity file is intended to showcase the string similarity model on person names.
The TestFormatsValues file contains columns where the formats and values models are applicable such as UID, Title, Country.
The Smart Harmonize Demo Data file contains columns where you can apply the 3 different models - values model to Discount Code, string similarity model to Forename, Surname and Company Name (each as a separate column group), formats model to NINO.

Smart Harmonization FAQ

Categories