Snapshotting Data

JackMcG
JackMcG ⚪️

Hi all,

Any help with the following would be much appreciated.

My original dataset is from a proprietary software that is updated daily.

I have a dataset where I would like to calculate the total number of units we have month-on-month. Currently, I can calculate this as a Summary Number on a card - but I would like to be able to make a card from this data and utilise it. 

The Summary Number is calculated by:

  • count of UniqueUnitID
  • filtered to remove where columnX = "REMOVED" and columnY = "NotApplicable"

columnX will hold a value until it is physically removed at which point it is updated to "REMOVED" status. Currently, there is no date/time associated with the move of this from the previous value to "REMOVED", so it is difficult to track the Summary Number for each month. 

Rows are also added on to the original dataset, but there is a DateAdded column that shows when this has happened. Each row has a unique value in the UniqueUnitID column.

I have been reading about recursive dataflows and believe this may be the way to go, but would appreciate any help in setting this up.

Thanks in advance and hope this is clear enough.

Jack

Answers

  • GrantSmith
    GrantSmith Indiana 🥷
    edited October 2021

    Looks like I need to wake up earlier get my answers in before @amehdad :)

    I agree with @amehdad but wanted to add some clarification - The MagicETL or MySQL recursive dataflows will allow you to filter possible duplicate records however the larger your dataset becomes the slower / less performant it becomes because there's more data that needs to be imported and read in.

    Using Dataset Copy (with append) will be more performant as it's only appending the records to your dataset with less data to import and process however its processing will be detached from the underlying dataset you're appending. Meaning with the recursive dataset you can trigger it to run based on when the underlying dataset is updated however with Dataset Copy you have to run it at a certain time and have the possibility to have the Dataset Copy run before the underlying data has been updated.

    In your use case you'll need to utilize a recursive DataFlow and not the Dataset Copy as you're needing to process the data to update the status and add a datetime when that status changed (removed from the original dataset)

    **Was this post helpful? Click Agree or Like below**
    **Did this solve your problem? Accept it as a solution!**
  • JackMcG
    JackMcG ⚪️
    edited November 2021

    Hi both,

    Thank you for your replies on this. I have still been working through the problem.

    Currently, I have created a dataset connector that appends on update. I understand the issue with the appended dataset updating before the main dataset, however the main dataset updates every night and I do not require a high level of granularity on the appended dataset, and I believe the performance/reliability of the dataset connector is likely to make this the best option for me.

    Currently I have the appended dataset updating every night but the aim of the card is to track week-on-week, so I may be able to push this out.

    Next steps for me will be to remove duplicates where all fields are the same with the exception of the _BATCH_ID_ and BATCH_LAST_RUN. I am looking to keep the earliest version of the row as this will allow me to track what is moving into the "REMOVED" status week-on-week / month-on-month.

    I will update later with progress on this.

    Thanks,

    Jack

  • the recursive dataflow is super easy. The trick is to get the dataset set up and then append to daily. I followed the help article Domo had for my salesforce Leads object. Took a little finagling because the article doesn't explain where the original dataset comes from. Hint: you have to make it.


    It's saving snapshots of my data every night at midnight.