Struggling with appending data to progressively build up a dataset

Hi there - I am using Google Sheets to create 2 separate reports (identical structures) of Google Analytics data, the reports are for [dyanmic] yesterday and then scheduled to pull into Domo each day. There are then some transformation things I am doing to it in magic ETL to create an output dataset - I append them together and then make some modifications to the data itself. However, each day the source files change to the day before, and rather than append the new data to the output file, it just overwrites the entire file, so the output dateset is always just yesterday, wheras I want it to build up the file by adding each day on progressively - nowhere in any part of the end to end process is there an option to say append rather replace the onward file. 

I can't get my head around this in Domo generally - another example: I have imported a large file of data, which I have then needed to modify due to problems with the legacy datasource. But I then need to append new records to that output dataset, plus I do not want to keep both the input legacy data as a full file, and the output data, because this is screwing my row count - but if I replace the original file with just today, it does the same thing - replaces my destination file. 

Any advice much appreciated - thanks

 

Comments

  • Search for Recursive ETL in the help section. That will append new data.

    Other connectors have a option to append new data or replace, the Google Sheet one unfortunately does not. Maybe put in a request to updatethe connector.

  • Recursive ETL is a bit funky when you first do it.  Agreed with the above comment to checkout Recursive ETL in the KB, that will get you most of the way there.  Two areas to really keep an eye out:

     

     

    Priming the pump.  When you run it the first time through, use a naming convention for the output that isn't going to confuse you when it gets used as an input.  Make sure your baseline day 1 snap is accurate.

     

    Wait for the second day to let the ETL run based on the dataset refresh.  If your daily snapshot of the source dataset is running each day, then set ETL to run when the source dataset is replaced.  You don't want it refreshing any more than that.  If you manually run the ETL a bunch in one day while you're troubleshooting or you use "Save and Run" by habit, it's going to mess with your row count as you noted. Just save the dataflow then exit out and evaluate whether it's doing what you want the next day.

     

    By nature they take a little longer to build out than a simple dataflow.  I was pulling my hair out first time I setup monthly snapshots on the 1st of each month to feed into a recursive append ?

     

     

     

  • Thanks for the response - so I followed the instructions in the recursive ETL thing exactly, and I saved it without running it, but then the next day after it had run it just replaced the whole table and the output was only yesterday.

     

    The only thing I'm not sure about is the settings in the flow for updating it (i.e. Automatically run DataFlow when any of the following DataSets are updated) . I assumed I should only check the original source file and not the output file, becuause that would run it again after it had run, but is that right?

     

    This is the one thing seriusly putting me off Domo right now, because actually I have much more complex things to layer onto this and if the basics are so hard I can't see how my data is going to be stable.

     

     

  • I had never build the recursive ETL before I suggested it to be honest.

    I started building an appending/remove duplicate row type recusive ETL, you may want to try this alternate, simpler, method. I am still in the testing phase and unfortunatelly I am to busy too monitor it well enough.

     

    The first and second steps are the same as the Domo instructions, then build something like the attached image.

     

    It should just append any new, non-duplicate rows to the existing Dataset. 

       There will be no duplicate rows in the created Dataset so you can run the ETL at any time/s without messing it.

       

  • rado - I like your simple ETL there, it does the trick.  If the regular input balloons in size over time, the transform's run time could eventually degrade.  Nice to avoid the potential for duplicates there if possible.

     

    I attached another simplified ETL, just like yours but without the remove duplicates.  This can be used in a case where there shouldn't ever be duplicate rows.  That said, it doesn't work well if you try to run it multiple times in a day.

     

    The connector-based input is scheduled to update 1x a day, and just includes 1 row of data for the day prior. The dataflow settings have it refresh anytime that the connector-based input is updated.  Just tacks a new row on each day.

     

     

  • Regarding removing columns, I prefer to reassurance of multiple triggers to ensure no data loss (we mostly use on-site data).

     

    I had been putting off setting up a few appending ETLs as the recursive procedure seem too tidious. I am very happy with the simple outcome.