Remove Duplicates - Keep Most Current Instance?

I have a table of Opportunities from Salesforce that I am pulling into Domo, and Appending on a regular basis, for any Opportunities that were modified by a User or Automated Process in the last day. The purpose would be to have the most current records of all fields, for every Opportunity.

 

What I'm finding is that I most definitely need to de-duplicate in order to have accurate data. However, I'm concerned using the Duplicate Remover in MagicETL, on how to ensure that the Opportunity that is left in the Dataset, is the most current version.

 

What is the best way to approach this?

Best Answer

  • zcameron
    zcameron admin
    Accepted Answer

    The method I describe in the document I uploaded on that thread doesn't actually use the remove duplicates method for that reason. I'll post the doc again here. Take a look and see if it makes any sense for you.

     

    One note is that you can skip the SELECT COLUMNS transform if in your append transform you choose to include columns that come from the branch that contains your NEW data. It will then automatically remove all the column names with "_1" attached.

Answers

  • Echelon,

     

    I use a technique called a Recursive Dataflow to handle this type of situation. There's documentation in the Help Center about how to do this. Also, check out this thread in Dojo for a conversation about this topic and a document I posted with instructions on how to do it in Magic ETL.

     

    I hope that helps!

  • I did find that, and it has definitely got me in the right direction.

     

    One piece that I can't figure out though, is how does the de-duplicating determing which row to remove and which to leave?

     

    What I need to do, is to update the Table that I am creating, and replace it if there is an updated version of a row, with new information. I'm not sure how to remove the old row out of the Table.

     

    In my case I know I would use a unique identifier (ID in my case), but also a Modified Date. Whatever is older, I would filter/remove.

  • This did it - I'm 99% sure I'm pulling in the correct data now...only 1% not sure, because the rows don't EXACTLY match between what's directly in Salesforce, and what is being spit out through the ETL. But, I think it must be just from lag time between the systems.

     

    Thanks so much for the help!

  • Glad to hear it!

     

    Nice work!

  • Putting this out there for any future users, but I was looking at some of my datasets and it seems to have still not been working correctly. I've gone through the Recursive so many times, my head is spinning.

     

    BUT, I think I have found a much easier alternative with the recent Domo update of "Rank and Window". By appending a historical dataset with a newer data pull, I set the two tables to join in a Magic ETL. After, I ranked based off of the unique record ID, and then the SystemModDate in descending order. Then I brought in a Remove Duplicates based on the unique record ID, and I'm left with the most recent record in my dataset.

  • sdr
    sdr ⚪️

    Does this technique require starting with a dataset that's already free of duplicates? It will append rows that are not dupes, but it won't dedupe this original data. Is that right?