Help with recursive dataflow

Hello, Experts,

 

I am looking at creating a data flow that will constantly update data from a dataset while keeping a historical dataset in play. 

 

For this scenario lets call:

  • Historical Dataset (MySQL connector) - Dataset A
  • Update Dataset (MySQL connector) - Dataset B

If you could, please let me know if I am on the right track here or if I need to pivot.

Plan:

  • Create data connector via MySQL that will do select * from table join second_table to populate full historical dataset for Dataset A
  • Create Magic ETL that will absorb that data into an output dataset called Dataset_A_Output
  • Create data connector via MySQL that will do select * from table join second_table where lookback_date >= 3 weeks ago to populate Dataset B with new data constantly to be used for recursion
  • Create Magic ETL that will bring in Dataset_A_Output as an input > bring in Dataset B > group by lookback_date and create two fields to be min date and max date
  • Join Dataset_A_Output w/ Dataset B or should I?
  • Create filter to filter out less than min date and greater than max date
  • Append new updating Dataset B w/ the rest of the data flow and output to Prod_dataset_final

The goal is to bring in the historical dataset since it will never change again and new data going forward to update the dataset for any changes within the last 3 weeks.

 

Let me know if you can help. I greatly appreciate it. Attached is a proposed data flow diagram.

 

Isaiah

 

 

Comments

  • GrantSmith
    GrantSmith Indiana 🔴

    Hi @imelendez 

     

    Will the historical dataset increase in size or is it static? (Will it pull in 3 weeks + 1 day ago or be stuck at the end date where you first pulled the historical data?)

     

    What is your lookback_date? Is that going to be the current date or are you allowing the user to filter based on that date?

     

    If it's based on the current date then typically what I'd do is have two dataflows a historical and a live version.

     

    (In my example the data can continue to update / change for 28 days)

    The historical version would pull the data via SQL from 29 days ago.

    select *
    from `table`
    where current_date - interval '29' day = DATE(`date`)

    I then make sure I set the dataset to append rather than replace. This allows us to save time not having to import the entire dataset over and over (and should make querying faster assuming you have the proper indexes / partitions in place)

     

    The live dataset pulls the data from the last 28 days (typically I'll exclude the current date since it isn't finished yet and it only runs once a day)

    select *
    from `table`
    where DATE(`date`) >= current_date - interval '28' day
    and DATE(`date`) < current_date

    I set this one to be replace so it'll overwrite the current dataset and keep things up to date.

     

    Once those are in place I'll do a simple dataflow to append the two datasets together.

  • Hi, @GrantSmith ,

     

    Thank you for taking the time out of your day to reply and help me out. I am still pretty novice at the whole ETL process and get confused with some of the do's and don'ts regarding recursion.

     

    To answer your question in detail, Dataset A (historical dataset) will be a one time pull - it is static and will not increase. What I anticipate doing is something like this

     

    Dataset A MySQL Connector:

     

    select * from table

    Are you saying then that Dataset B (live version) needs to be created and it should be my more refined/filtered data connector?

     

    I was thinking something like this:

     

    Dataset B MySQL Connector

     

    select * from table where migrated_data = 0 and where posted_date >= 3 weeks ago.

    Questions I have:

    • I need to make Dataset B MySQL connector as a replace setting correct?
    • I need to make Dataset A MySQL connector as a replace but set it to manual run and not a scheduled run? The reason, why I ask, is in the event we need to repopulate the historical data for any reason we can (doubt this will happen but just in case)
    • How would the ETL structure for something like this look like in your experience?

    A million thanks!

     

    Isaiah

  • MarkSnodgrass
    MarkSnodgrass Portland, Oregon 🟤

    @GrantSmith gave a very nice walkthrough so not too much more to add. If you haven't looked that the KB article, it gives a nice step by step guide that may help you.

    https://knowledge.domo.com/Prepare/DataFlow_Tips_and_Tricks/Creating_a_Recursive%2F%2FSnapshot_SQL_DataFlow

    You can also do recursive dataflows in Magic ETL. Here is the KB article for Magic ETL.

    https://knowledge.domo.com/Prepare/DataFlow_Tips_and_Tricks/Creating_a_Recursive%2F%2FSnapshot_ETL_DataFlow

  • GrantSmith
    GrantSmith Indiana 🔴

    @imelendez 

     

    If you're not looking to keep the historical dataset updated then what you're stating should work. You'll have the two input datasets (A and B) and then tie them together with an append dataset tile which then outputs to an Output Dataset tile for your Magic ETL dataflow.

  • @GrantSmith , thanks for the confirmation.

     

    Is this more of a realistic approach - in terms of ETL dataflow?

     

    The image is attached.

     

    A million thanks!

     

    Isaiah

  • GrantSmith
    GrantSmith Indiana 🔴

    @imelendez  I'm a little perplexed by your data. Does your historical dataset have data in the future as well? 

     

    Are you attempting to get data from set A which is within the bounds of the dates found in dataset B?

  • Yeah @GrantSmith so, dataset A should not have future dates as it is static. 

    What I am trying to do is loop in a smaller subset of data from dataset B and check against dataset A if it exists. if it does, filter it out (to avoid dupes), if it doesn't, then append.

     

    I did not create this data flow. 

     

    Does that provide some more clarity?

  • GrantSmith
    GrantSmith Indiana 🔴

    So if it exists in B use B otherwise use A?

  • I am sorry, I might just be confusing myself.

     

    I am after appending data from historical (A) to new data changes from live (B) into the final output. 

  • I think rephrasing it as following will give more clarity:

    if it does, filter it out from A (to avoid dupes), if it doesn't, then append.

     

    Your workflow will have one new B and two versions of dataset A.

    You will have to use Left Join to mark the duplicate values, say "Duplicate", then filter out the duplicate value and append the rest of the values with the New data B into the dataset A.

     

    Please let me know if you have any questions.

     

     

     

  • jaeW_at_Onyx
    jaeW_at_Onyx Budapest / Portland, OR 🟤

    Sorry late to the game!  But my ten cents.  Don't build a recursive dataflow.

     

    Yes, a recursive dataflow will work for your use case, but recursive dataflows don't scale with data volume.  If we're talking 100M+ row datasets, you'll start having to wait in excess of 20 minutes to i've seen 5, 10 and 23.999 hours.  (23.999 b/c Domo kills dataflows that take longer than a day.

     

    RDs are fine with small dattasets, but if it's going to get huge, you'll want to do something else.

     

    Why don't RDs scale?

    B/c every time you run an RD dataflow, the ETL has to transfer your entire dataset to the ETL engine, transform the dataset, then write the entire dataset to disk (Vault).  then transfer the dataset to Domo's database layer (Adrenaline). 

     

    Imagine you have a 100M row historical dataset, and a 10k row new_data.  You have to transform 100M+10k rows... every day.  This can get ugly fast.

     

    TLDR:  the better solution

    Ask your CSM about Data Assembler or UPSERT.  

     

    With Data Assembler, your historical dataset gets chunked into partitions (imagine old-school encyclopedia books partitioned by date).  When you get new data, data assembler says "what partitions (new dates) exist in  my new data?"  If that partition already exists in my historical data, DROP the existing partition, and replace with New Partition.

    Domo can run this operation much faster because it only has to delete the encyclopedia volume and replace it with a new volume.  Because the volumes are stored as separate files, it's figuratively just a case of replacing a file.

     

    UPSERT works like a MERGE statement in SQL if your historical_data table has a 'primary key'.  when you load new data, Domo asks "does this primary key already exist in my historical table?  if yes, then replace the row, if no, then create a new row.

    https://docs.microsoft.com/en-us/sql/t-sql/statements/merge-transact-sql?view=sql-server-ver15

     

    Both UPSERT and DA are much faster than recursive dataflows because you write significantly less data as your data volumes get larger and larger.

  • jaeW_at_Onyx
    jaeW_at_Onyx Budapest / Portland, OR 🟤

    I just posted not to use a Recursive Dataflow.  But if you're hell-bent.

     

    Don't use a JOIN to combine historical and new.  Use an APPEND (b/c JOINS in MagicETL are slow especially with composite keys).

     

    Add a Row_Number() function via the RANK / WINDOW'ed tile where you number the rows based on how you define a duplicate row ordered by batch_upload_date.  Then keep the rows where RowNumber = 1.