Updating multiple keys after de-dup

Good Morning,


I have recently ran a de-duplication process and have a good set of cluster ID's. Within the application people / customers are given a unique number and this number should stay with them every time that they are entered into the database by people selecting them from a drop down list. Sadly however sometimes people don't find the person after a search and create a secondary, third or even fourth iteration - hense me de-duplicating.

What I want to be able to do however is retain one of the ID's that already exists however, either the first or last created.

example :


So I had been thinking about assessing which record was the oldest or newest, and I could potentially group all the cluster ID's together but i am having trouble working out how to get to where I want to be. In the example above cluster ID 1 should have their id's all overwritten to now be either 1 (oldest) or 24(newest).

Thanks again everyone.

S

Best Answer

Answers

  • Clinton JonesClinton Jones Experian Elite

    Hi @stevenmckinnon , are you using the harmonize step as part of your process?

  • stevenmckinnonstevenmckinnon Contributor

    Hi Clinton, yes i harmonized at the end of the process. I scored the harmonize based on the newest record.

  • Clinton JonesClinton Jones Experian Elite

    So assuming you harmonized based on cluster ID and nominated the winner as the oldeest record you should have landed up with just three records.

    Looking at your dataset though, I notice that within the clusters there are duplicates, is there a particular reason you wouldn't eliminate those before your run your find duplicates step by using a group on id step?


  • Clinton JonesClinton Jones Experian Elite

    actually looking at this again, it might be an idea to run a harmonize on the data instead of a group step, before you run duplicates and use the ID as the clusterID and then run a second harmonize after the find duplicates

  • stevenmckinnonstevenmckinnon Contributor

    Hi Clinton, the running the harmonize step prior to the duplication step makes sense and I have added that it now. You are correct that those were the rows that the final harmonize step lef tme with but it is the next part that I am struggling with.


    I now want to update all of the records in cluster id 1 for example to all have the id of 1. All of the ones, as per the example above, from cluster id 2 should have the id 15 (not just the two highlighted). This way when I export it back into the original database everyone who was in cluster id 1 all have the same id number.

  • stevenmckinnonstevenmckinnon Contributor

    Hi Clinton, not going to lie the last step confuses me no end but I replicated and its works perfectly. I really like the multi view as well, lets me see exactly what has happened across each of the steps. I am going to build in a function to export the proposed results to allow a bit of QA and testing to be carried out.


    Thanks again.


    S

  • Clinton JonesClinton Jones Experian Elite

    Hi Steve, does the join confuse ?

Sign In or Register to comment.