Recursive dataflow not storing historical data

Hello, 

In our data team, we are trying to build a recursive SQL  dataflow to save historical data and append new data as the original dataset is updated daily, following the instructions in this post: https://knowledge.domo.com/Prepare/DataFlow_Tips_and_Tricks/Creating_a_Recursive%2F%2FSnapshot_SQL_DataFlow. However, the way it is set up right now, we are getting the data from the day when we set it up and then the last day when the dataset run, but then this data is erased and replaced with the latest data, so that we end up with only two dates worth of data, but nothing in between.

 

These are the steps we followed to build this dataflow:

 

1. Run the following query on our original table to create the dataset we need (all within Domo). Let's call this dataset sampledataset

SELECT club_id AS 'Club', COUNT(member_id) AS 'Client id', NOW() as 'Date'

FROM member_table

WHERE inactive=0 AND is_employee=0

GROUP BY club_id

 

Update setting: run the dataflow when member_table updates

 

2. Create a SQL dataflow with sampledataset as input and the following query as output and run it. Let's call this output historicaldataset

SELECT *

FROM sampledataset

No update settings selected

 

3.  Once step 2 is completed, In the same dataflow add historicaldataset as an input

 

4. Create the following transform:

SELECT *
FROM historicaldataset
WHERE `Date`NOT IN (SELECT `Date`FROM sampledataset)

Generate output table called historical_data

 

5. Add this additional transform:

SELECT * FROM sampledataset
UNION ALL
SELECT * FROM historical_data

 

Generate output table called append_new_data_to_historical

 

6. Generate output using:

SELECT *
FROM append_new_data_to_historical

Dataflow update settings: update only sampledataset.

 

Obviously we are not setting up something correctly because the data is not being appended, but replaced after the next run. Any help will be greatly appeciated.

 

Thanks!

 

Best Answers

  • AS
    AS 🔵
    Accepted Answer

    Have you changed your new input to be the last output of your dataflow (append_new_data_to_historical)?  That's key to the recursive nature of this method. The output has to be the new input.

  • Jon
    Jon
    Accepted Answer

    Well, as I learned, when you consume a datasource in a dataflow, you have to make sure you're actually consuming the one that's the output of the dataflow. I think Domo is just particular about that. 

     

    I'll try to reconstruct what I did:

    1. I created a new dataflow, using my existing data, I'll call it "original_historical_data"

     

    2. So in the first pass at this I just "select * from original_historical_data" into the new dataset I'll call "historical".

     

    3. The I go back remove "original_historical_data" from the Input DataSets and add the "historical" (which I just created).

     

    4. Add my new dataset "this_week" to the Input DataSets...

     

    5. Do my union between the "historical" and "this_week" back into "historical"

     

     

    That way it's truly recursive. 

     

    It's odd, but Domo has to use the dataset it created for this to work. I banged my head on the whiteboard for a few days on this.

     

    If that doesn't make sense, send me a message and I'll give you my number, we can chat.

     

     - Jon

Answers

  • Hi, thanks for your reply. In the output section, we called the output dataset historicaldataset, just like the input, with the following query: 

    SELECT *
    FROM append_new_data_to_historical

     

    and in the update settings, this is not selected to update, only the sampledataset

     

     

  • One thing I learned about building recursive datasets, is that you MUST first create the source dataset from within the dataflow. That way you're consuming the dataset from the workflow.

     

    Also, your step 5 may be a bit redundant, you should be able to drop the results of that union right into your final dataset.

     

     - Jon

  • Hi Jon,

     

    I agree with step 5 being redundant, but I am a bit confused about why would it be necessary to create the source dataset within the dataflow and how that would work? 

  • AS
    AS 🔵

    For context, what's the business question here?

    For the recursion to work, you create an output dataset that becomes an input dataset to the dataflow. 

     

    Something like this:

    Input: Historical transactions dataset, static

    Input: Daily transactions, replaced every day

    Process: Select everything from this historical transactions dataset, union it with the daily transactions dataset, and output that as the main dataset.  Run the dataflow. Then open/edit the dataflow to have the main dataset as an input in place of the historical transactions which we don't need anymore.  Set the dataflow to trigger whenever the daily transactions updates.  That feeds into itself and just builds over time.

     

     

     

    Do you have two different dataflows here?

  • My challenge was that I actually have values that change in the "old" data from time to time (corrections, etc), and the full dataset is too large to retrieve every time. 

     

    I have an intermediate step (which is not relevant to the question here) which removes any ID that;s in the updated dataset before the union.

     

    But, in my case, the businss question was "how do we append when we cannot use the append functionality in the data connector".

     

     - Jon

     

     

  • Well, in our case the business question is: How can we build a card to show the number of active clients over time when the original dataset (member_table) does not store this information over time?

     

    Our idea is, of course, to build a recursive dataflow to build a dataset that will update everyday and store the COUNT(member_id) per day, but what we have now is a dataflow that saved the data from the historical dataset and is now appending only the data from the latest run, but not saving anything in between.

     

    This is why we need the first step and to this query:

    SELECT club_id, COUNT(member_id), NOW()

    FROM member_table

     

    to create our source dataset first. I am thinking whether it would make a difference to use _BATCH_LAST_RUN instead of NOW()?

     

  • Hi AS,

     

    I do have two dataflows in the sense that the first one is to create the data source from one of our tables, then use the output of this to power up the recursive dataflow.

  • Hi Jon,

     

    I've just tried this approach. It makes sense! Thanks.

  • DaniBoy
    DaniBoy

    domo

    💎

    Did any of the replies help you? If so please click on "Accept as Solution" next to each one that solved the problem.

     

    Thanks!

  • Hi Jon,

     

    Just wanted to say thanks because your approach solved the issue. It is a small change, but definitely makes all the difference. Thanks!