Help with recursive dataflow

Reply
Highlighted
Yellow Belt

Help with recursive dataflow

Hello, Experts,

 

I am looking at creating a data flow that will constantly update data from a dataset while keeping a historical dataset in play. 

 

For this scenario lets call:

  • Historical Dataset (MySQL connector) - Dataset A
  • Update Dataset (MySQL connector) - Dataset B

If you could, please let me know if I am on the right track here or if I need to pivot.

Plan:

  • Create data connector via MySQL that will do select * from table join second_table to populate full historical dataset for Dataset A
  • Create Magic ETL that will absorb that data into an output dataset called Dataset_A_Output
  • Create data connector via MySQL that will do select * from table join second_table where lookback_date >= 3 weeks ago to populate Dataset B with new data constantly to be used for recursion
  • Create Magic ETL that will bring in Dataset_A_Output as an input > bring in Dataset B > group by lookback_date and create two fields to be min date and max date
  • Join Dataset_A_Output w/ Dataset B or should I?
  • Create filter to filter out less than min date and greater than max date
  • Append new updating Dataset B w/ the rest of the data flow and output to Prod_dataset_final

The goal is to bring in the historical dataset since it will never change again and new data going forward to update the dataset for any changes within the last 3 weeks.

 

Let me know if you can help. I greatly appreciate it. Attached is a proposed data flow diagram.

 

Isaiah

 

 


_
Isaiah Melendez
The most efficient way to optimize a SQL query is to eliminate it.
Highlighted
Blue Belt

Re: Help with recursive dataflow

Hi @imelendez 

 

Will the historical dataset increase in size or is it static? (Will it pull in 3 weeks + 1 day ago or be stuck at the end date where you first pulled the historical data?)

 

What is your lookback_date? Is that going to be the current date or are you allowing the user to filter based on that date?

 

If it's based on the current date then typically what I'd do is have two dataflows a historical and a live version.

 

(In my example the data can continue to update / change for 28 days)

The historical version would pull the data via SQL from 29 days ago.

select *
from `table`
where current_date - interval '29' day = DATE(`date`)

I then make sure I set the dataset to append rather than replace. This allows us to save time not having to import the entire dataset over and over (and should make querying faster assuming you have the proper indexes / partitions in place)

 

The live dataset pulls the data from the last 28 days (typically I'll exclude the current date since it isn't finished yet and it only runs once a day)

select *
from `table`
where DATE(`date`) >= current_date  - interval '28' day
and DATE(`date`) < current_date

I set this one to be replace so it'll overwrite the current dataset and keep things up to date.

 

Once those are in place I'll do a simple dataflow to append the two datasets together.



**Was this post helpful? Click the heart icon**

**Did this solve your problem? Accept it as a solution!**
Tags (2)
Highlighted
Yellow Belt

Re: Help with recursive dataflow

Hi, @GrantSmith ,

 

Thank you for taking the time out of your day to reply and help me out. I am still pretty novice at the whole ETL process and get confused with some of the do's and don'ts regarding recursion.

 

To answer your question in detail, Dataset A (historical dataset) will be a one time pull - it is static and will not increase. What I anticipate doing is something like this

 

Dataset A MySQL Connector:

 

select * from table

Are you saying then that Dataset B (live version) needs to be created and it should be my more refined/filtered data connector?

 

I was thinking something like this:

 

Dataset B MySQL Connector

 

select * from table where migrated_data = 0 and where posted_date >= 3 weeks ago.

Questions I have:

  • I need to make Dataset B MySQL connector as a replace setting correct?
  • I need to make Dataset A MySQL connector as a replace but set it to manual run and not a scheduled run? The reason, why I ask, is in the event we need to repopulate the historical data for any reason we can (doubt this will happen but just in case)
  • How would the ETL structure for something like this look like in your experience?

A million thanks!

 

Isaiah


_
Isaiah Melendez
The most efficient way to optimize a SQL query is to eliminate it.
Highlighted
Black Belt

Re: Help with recursive dataflow

@GrantSmith gave a very nice walkthrough so not too much more to add. If you haven't looked that the KB article, it gives a nice step by step guide that may help you.

https://knowledge.domo.com/Prepare/DataFlow_Tips_and_Tricks/Creating_a_Recursive%2F%2FSnapshot_SQL_D...

You can also do recursive dataflows in Magic ETL. Here is the KB article for Magic ETL.

https://knowledge.domo.com/Prepare/DataFlow_Tips_and_Tricks/Creating_a_Recursive%2F%2FSnapshot_ETL_D...




**Make sure to like any users posts that helped you and accept the ones who solved your issue.**
Tags (1)
Highlighted
Blue Belt

Re: Help with recursive dataflow

@imelendez 

 

If you're not looking to keep the historical dataset updated then what you're stating should work. You'll have the two input datasets (A and B) and then tie them together with an append dataset tile which then outputs to an Output Dataset tile for your Magic ETL dataflow.



**Was this post helpful? Click the heart icon**

**Did this solve your problem? Accept it as a solution!**
Highlighted
Yellow Belt

Re: Help with recursive dataflow

@GrantSmith , thanks for the confirmation.

 

Is this more of a realistic approach - in terms of ETL dataflow?

 

The image is attached.

 

A million thanks!

 

Isaiah


_
Isaiah Melendez
The most efficient way to optimize a SQL query is to eliminate it.
Highlighted
Blue Belt

Re: Help with recursive dataflow

@imelendez  I'm a little perplexed by your data. Does your historical dataset have data in the future as well? 

 

Are you attempting to get data from set A which is within the bounds of the dates found in dataset B?



**Was this post helpful? Click the heart icon**

**Did this solve your problem? Accept it as a solution!**
Highlighted
Yellow Belt

Re: Help with recursive dataflow

Yeah @GrantSmith so, dataset A should not have future dates as it is static. 

What I am trying to do is loop in a smaller subset of data from dataset B and check against dataset A if it exists. if it does, filter it out (to avoid dupes), if it doesn't, then append.

 

I did not create this data flow. 

 

Does that provide some more clarity?


_
Isaiah Melendez
The most efficient way to optimize a SQL query is to eliminate it.
Highlighted
Blue Belt

Re: Help with recursive dataflow

So if it exists in B use B otherwise use A?



**Was this post helpful? Click the heart icon**

**Did this solve your problem? Accept it as a solution!**
Highlighted
Yellow Belt

Re: Help with recursive dataflow

I am sorry, I might just be confusing myself.

 

I am after appending data from historical (A) to new data changes from live (B) into the final output. 


_
Isaiah Melendez
The most efficient way to optimize a SQL query is to eliminate it.
Announcements
Want to join our weekly online meet up? Please visit this board here and let me know. Click here for more details!