I am looking at creating a data flow that will constantly update data from a dataset while keeping a historical dataset in play.
For this scenario lets call:
If you could, please let me know if I am on the right track here or if I need to pivot.
The goal is to bring in the historical dataset since it will never change again and new data going forward to update the dataset for any changes within the last 3 weeks.
Let me know if you can help. I greatly appreciate it. Attached is a proposed data flow diagram.
Will the historical dataset increase in size or is it static? (Will it pull in 3 weeks + 1 day ago or be stuck at the end date where you first pulled the historical data?)
What is your lookback_date? Is that going to be the current date or are you allowing the user to filter based on that date?
If it's based on the current date then typically what I'd do is have two dataflows a historical and a live version.
(In my example the data can continue to update / change for 28 days)
The historical version would pull the data via SQL from 29 days ago.
select * from `table` where current_date - interval '29' day = DATE(`date`)
I then make sure I set the dataset to append rather than replace. This allows us to save time not having to import the entire dataset over and over (and should make querying faster assuming you have the proper indexes / partitions in place)
The live dataset pulls the data from the last 28 days (typically I'll exclude the current date since it isn't finished yet and it only runs once a day)
select * from `table` where DATE(`date`) >= current_date - interval '28' day and DATE(`date`) < current_date
I set this one to be replace so it'll overwrite the current dataset and keep things up to date.
Once those are in place I'll do a simple dataflow to append the two datasets together.
Hi, @GrantSmith ,
Thank you for taking the time out of your day to reply and help me out. I am still pretty novice at the whole ETL process and get confused with some of the do's and don'ts regarding recursion.
To answer your question in detail, Dataset A (historical dataset) will be a one time pull - it is static and will not increase. What I anticipate doing is something like this
Dataset A MySQL Connector:
select * from table
Are you saying then that Dataset B (live version) needs to be created and it should be my more refined/filtered data connector?
I was thinking something like this:
Dataset B MySQL Connector
select * from table where migrated_data = 0 and where posted_date >= 3 weeks ago.
Questions I have:
A million thanks!
@GrantSmith gave a very nice walkthrough so not too much more to add. If you haven't looked that the KB article, it gives a nice step by step guide that may help you.
You can also do recursive dataflows in Magic ETL. Here is the KB article for Magic ETL.
If you're not looking to keep the historical dataset updated then what you're stating should work. You'll have the two input datasets (A and B) and then tie them together with an append dataset tile which then outputs to an Output Dataset tile for your Magic ETL dataflow.
@GrantSmith , thanks for the confirmation.
Is this more of a realistic approach - in terms of ETL dataflow?
The image is attached.
A million thanks!
@imelendez I'm a little perplexed by your data. Does your historical dataset have data in the future as well?
Are you attempting to get data from set A which is within the bounds of the dates found in dataset B?
Yeah @GrantSmith so, dataset A should not have future dates as it is static.
What I am trying to do is loop in a smaller subset of data from dataset B and check against dataset A if it exists. if it does, filter it out (to avoid dupes), if it doesn't, then append.
I did not create this data flow.
Does that provide some more clarity?
So if it exists in B use B otherwise use A?
I am sorry, I might just be confusing myself.
I am after appending data from historical (A) to new data changes from live (B) into the final output.