Select what data loads from data set when editing a data flow

I have a couple of data sets that take nearly 30 mins to run.  This is partly due to the amount of data that is present in the data set and so it is not possible to load all the data.  I often times find it very difficult to properly QA my data because: 

  • I can't load all the data
  • I have zero control over what part of the data set is loaded. 
  • The data that does load is often incomplete and does not provide good test data.

 

It would be exceptionally helpful if I could define and save a where clause that would be used to load data into the data flow editor.  With this abilty I would gain a quality test data set and avoid wasting time by running the full data flow.  Also because the where clause would be saved, any time I come back it would already be configured and if I wanted different data I could simply edit the where clause and relaod the data.  

 

Having this ability would litterally save me hours as I work through the nuances of my data and perfect my data flow.

 

  

10
10 votes

· Last Updated

Comments

  • I think this too is a great idea!  I am running into the same problem when each of the input data is greater than the 500 thousand rows limit.  I think if we have a method to select/control how the raw input set is loaded will be greatly appreciated and makes the QA process much better rather than waiting for the full data flow to complete.

  • This is an amazing idea. Often when working on dataflows we just have to hope our queries are right as we can't actually preview them.

    I wish my dataflows only took 30 mintues :(

  • I had similar problem (long data flow runs) and found out that indexing files sped them up significantly. Unfortunately, this only works in SQL dataflows, since Magic ETL does not seem to have indexing function (or I haven't found it yet).

  • Thx @SLaczny, are you using MySQL DataFlows or Redshift?  I am using Redshift and per Domo support, Redshift DataFlows automatically create indexes.  It is the MySQL DataFlows that require manual indexing.

     

    Have you had different experiences while using Redshift DataFlows?

     

    thanks

  • Has anyone tried using the column selector and removal tool available in SQL dataflows?  Quite often, maybe most of the time for some of us, we include more data than we need to in our datasets when we create and run dataflows.  We can stop unnecessary columns from loading, in order to speed up our dataflows, by reducing our data upfront.  Click on the input dataset and X-out the columns you don't need.  This should reduce the time it takes to initialize and load the temporary dataflow tables in the authoring environment (for speed of development) and you should see improved performance during actual run time as well. YMMV.

    Aaron
    MajorDomo @ Merit Medical

    **Say "Thanks" by clicking the heart in the post that helped you.
    **Please mark the post that solves your problem by clicking on "Accept as Solution"
  • Thanks for the idea @AS.  I havent tried that before, althrough I think my problem is not with wide tables it is with the quantity of rows.  I just tried this idea with a data set that is 5.6 million rows.  Unfortunately my instance of domo is wildly inconsistent in the time it takes to run data flows so it is tough to judge the impact, but it seems immaterial at this point.  Here were my results.

    1. Last run time before change: 8m 22s
    2. Data Set update
      1. First run (Manual Run): 15m 45s
      2. Second run (DataSet update): 7m 32s
      3. Third Run (Manual Run): 9m 35s
      4. Fourth Run (DataSet Update): 12m 7s

    Below is a chart of the last 553 runtimes.  Chart reads newsest (left) to oldest (right).  The rolling average (blue line) includes the last 20 run times.  Orange Line shows actual run times.  X axis does not include date/time becasue excel did not handle the graph well when I included it.  But data set updates every 15 min (ish).  DataSet has not changed during this time period.  

    Runtimes.jpg

     

    Ps. also tried this with a DataFlow that averages between 30 - 45 min and had similar results.

     

  • That's a pretty interesting visual.  I wonder what that pattern of runtime increases relates to.  Hardware demands at certain times of the day?

    If your data isn't too wide I'm not too surprised you haven't seen an impactful change.  We're likely to only use the columns we want in one of our first transforms anyway, so the biggest changes from initial selection would be just once, right at the beginning.

    Aaron
    MajorDomo @ Merit Medical

    **Say "Thanks" by clicking the heart in the post that helped you.
    **Please mark the post that solves your problem by clicking on "Accept as Solution"
  •  I wish it was isolated to a specific time of the day.  I filtered for the 3 longest run times and this is what I found.

    • 1/20 @ 7:12pm : 38 Min
    • 1/21 @ 3:42pm : 40min
    • 1/22 @ 3:42pm : 29min
    • 1/23 @ 11:57am : 38min

     

    When I expand the filter a bit further, itlookslike there is something to the 3pm hour, but the strangest part is all of my heaviest data flows run in the early morning and none of the long run times come from then.  They all come from mid day when the heavy data flows are done.  Anything that runs during the day runs all day long at set intervals so it is strange that the long run times are so sporadic.

     

    When I look at the details of the long running executions, a huge part of the time is attributed to loading one or more data set.  In some cases it is all the data sets and in some cases it is just some of them.  The odd part is it is a mix of imported DataSets and DataSets derived from DataFlows.  

     

    time.jpg

    1) this is a recursive data flow and it took 29 min to load itself.  Prior execution took <9 min and these run 15 min appart.

     

    2 & 3) are data sets that were imported 10+ Hours before this execution kicked off.

     

    Point is I know, in this case, that some of the data sets that are taking along time to load are not because they are actively building.  There is somethign else going on that is not right. 

  • Interesting. Four of those inputs took 27.5 minutes, roughly.  And 5 of them took 4.75 minutes.  I don't know what in what order Dataflows initiate dataset loads, but maybe those were grouped together? Or are the same size and took the same amount of time?

    if this is a continued problem I'd recommend consulting Support.  You shouldn't have a dataflow run for more than 15 minutes if it has inputs that refresh that fast.

    Aaron
    MajorDomo @ Merit Medical

    **Say "Thanks" by clicking the heart in the post that helped you.
    **Please mark the post that solves your problem by clicking on "Accept as Solution"
  • I'm using MySQL function embedded in Domo.

This discussion has been closed.