How is remove duplicate actually work?

Currently I work using remove duplicate in magic ETL, to remove some data with same id number for example.

The thing is, I am not really sure which data that domo removed when I am using remove duplicate. Here are what I need to reconfirm :

1. When performing remove duplicates, which data that will be removed? Is it will choose last row data to remain in dataset?

2. If so, if I want to perform append dataset, and there is 4 dataset to be appended, which data that will be appended last and how is the order?

3. In DOMO help page, i have seen that remove duplicates only work IF ONLY all columns in a row has the same value with the duplicates. Meanwhile I am doing experiment on two row with same id, remove duplicates will remove one of it even the rest of the columns value is different (But this is the result I am expected tho, since I want to remove row with same id, even other columns value is different)

 

++ : if DOMO could develop feature that make us possible to use append with "primary key like" feature in default data update, it should be more awesome! (Since current possible update choise is only replace and append)

 

Thanks DOMO, awesome product btw

Comments

  • Hi all,

    Can anybody help out @imam_ar?

    Thanks!

  • Hello imam_ar,

    I sent an email to you about this issue but have not heard back.

    If this is still a concern please let me know by replying to that email. If this is no longer an issue please mark this issue as resolved.

    Thank you,

    -Tyler C.

  • Would you please also include the response in this thread as to help people (like me) who are wondering the same thing?

     

    It would be very helpful to see the explanation in the same thread that we find where someone has the same or similar questions.

     

    Thank you,

    +Spencer

     

    @imam_ar

     

    EDIT: I think I found a solution that fits my needs, here "https://dojo.domo.com/t5/Beast-Mode-ETL-Dataflow/How-can-I-add-only-unique-values-from-one-dataset-to-another/td-p/16549"

     

    The answer is the one by ilikenno with the screen shots. zcameron's response is the same concept, I think, but ilikenno's was more helpful to me with screen shots for the ETL.

     

    Hope it helps.

  • Please email me the same thing you emailed this person. I need more help on this and would like to understand more about this.

  • Can we please have the response email in this thread. I had some issues and a detailed explanation would have been helpful

  • This is awesome, thank you!

  • Ditto. This was very helpful, Mr. Clean! Thanks for taking the time to do this.

  • jaeW_at_Onyx
    jaeW_at_Onyx Budapest / Portland, OR 🟤

    RANK & WINDOW + Filter will initially output the same number of rows that came in.

    If Window Functions are not necessary, and you just need a SELECT DISTINCT equivalent, REMOVE DUPLICATES is OK except it keeps all your columns, so a good alternative (but equally slow to process) would be a GROUP BY with a Min / Max or a rowCount.


    Technically I would expect GROUP BY to be marginally slower than REMOVE DUPLICATES because it's a true blocking function (i cannot know the results until i've processed all the rows) whereas REMOVE DUPLICATES could be a pass through function insofar as once I know a value is unique I can pass it onward through the ETL pipeline.


    If performance is not as important as documenting the thought process, then choose the right tool for the job.

    RANK WINDOW +FILTER if you're looking for first / last value.

    GROUP BY for dimensional datasets or aggregations

    REMOVE DUPLICATES if you're feeling lazy / need performance ;)

    Jae Wilson
    Check out my 🎥 Domo Training YouTube Channel 👨‍💻

    **Say "Thanks" by clicking the ❤️ in the post that helped you.
    **Please mark the post that solves your problem by clicking on "Accept as Solution"