Get unique records from datasets

Hi,

 

I have two reports for last and current month which has size, folder name, date as fields. There are duplicates between last month's report and current month's report. When I combine both the reports, the output dataset should not double up - meaning if there are same folder name from previous month report, it should just show if there was a difference between last report and this report.  For example, the folder called "Test1" did not change in size so no updates are needed for that folder. Can you please suggest me on this?

 

Thanks in advance

Best Answer

  • jaeW_at_Onyx
    jaeW_at_Onyx Budapest / Portland, OR 🟤
    Accepted Answer

    @Khan_Naziya 

     

    you're using confusing language sorry, it's not clear to me if you're asking for an UPSERT scheme which keeps only the most recent version of each row or if you're looking for a partition scheme where you only keep the block of data from the most recent month.  OR if you're looking for a way to report if the value changed from one month to the next (slowly changing dimension).

     

    I'm sprinkling in what may sound like jargon b/c they are 'terms-of-art' with 20 years of history each.

     

    If you need UPSERT or Partitioning, you can look at a recursive dataflow pattern.  https://www.youtube.com/watch?v=JNQFZCj8JcQ

     

    The difference between UPSERT or Partitioning would just be how you define your RANK partition, do you make it at the granularity of one row, or month or folder_month.

     

    if you need slowly changing dimensions, you still implement the RANK to get the row version number, but then you add a LAG function to get the value last period.