Harmonize Duplicates - Select best record based on score with multiple criteria

A customer posted this question to me:

"I want to pick a survivor using multiple score based criteria in the Harmonize step. So when all records in a cluster have the same survivor code ( a number I create) I need it then to pick the record with the newest date. 

 For example two records in a cluster might have survivor codes of 2 and 3 so harmonise will pick 2 as the survivor as I have the lowest first option checked. But if the two records both had a survivor code of 2, then I need Harmonize to look at a date field and select the newest"

A potential solution I can think of is as follows:

  • First, add a Sort step to sort the source data based on the preferred order; in this case, i sort by Cluster ID (ascending), Survivor code (ascending) , Date (descending) because i want the lowest survivor code and newest date first.
  • Add a Transform step to append a new Score column to the sorted data. Use the Current Row function to assign a value to the Score column. Since we have sorted the data, the row number will be an incremental number according to the preferred order.
  • Add the Harmonize Duplicates step, select best record, score-based, using the new Score column. Check on lowest score first.

The workflow should look something like this. Note that i appended a Report step at the end just to compare the results. You can substitute this with an Export step to export the results to a file or external system.

If you have an alternative or better solution, please share it with us.

Sign In or Register to comment.