Combine Data Issues

I have 3 datasets called FYC Historical, FYC Details-Current, and FYC Summary.  I have combined these with our VUE Production Clean Dataset and called the dataflow VUE FYC Combine (takes almost an hour to run).

 

I have then taken this output dataset, called VUE Historical and Current Test, and placed it into a copy of an overall dataflow called FYC TEST of Bridge | Course and Program Enrollment Status, which produces my 2 main datasets (FYC TEST Bridge | Course Enrollment Status and FYC TEST Bridge | Program Enrollment Status).

 

I have tried to run this dataflow, and it has gone longer than a day and ended up canceling with over 400 million rows and almost a terabyte of data.  What am I doing wrong?  What can I do to correct this to get it to run faster?

Comments

  • macrobolic
    macrobolic

    domo

    💎

    Hi DrGWright,

     

    I'm sorry to learn that dataflow is failing to successfully run. The first thing is to make sure you've optimized the dataflow using the tactics described in this Help Center article: https://knowledge.domo.com/Prepare/Magic_Transforms/SQL_DataFlows/Optimizing_an_SQL_DataFlow. The suggestions include Indexing (if in MySQL), filtering the data down to only what will actually be used, grouping data when aggregations are used in the SELECT clause, and breaking up multiple table joins into their own transforms.

     

    Another thing to look at is to ensure your joins are being made on unique keys (either a column with only unique values or on multiple columns which create a unique combination) and trying to join mostly (or only) on numeric columns; string columns take much longer to index when being joined.

  • I'm not using SQL for any of this.  I'm still not sure how to simplify.  Is there anything within the ETL or Dataflow to help correct this issue?

  • Hi DrGWright,

    Looks like you're joining 4 datasets to start. Are you using Dataflow?

    • Databases:
      FYC Historical
      FYC Details-Current
      FYC Summary
      VUE Production Clean Dataset
    • Creates: VUE Historical and Current Test dataset

    This is then joined into another data flow that's producing another 2 datasets. Lot's going on to start.

    Do you have any further information on these databases? Types of database, locally or cloud/SaaS and also what field you are using to join? Also, their size can impact run times. 
    If you can give us a bit more context it will help. For example, if I wanted to join a CRM database and a Transaction database I might use Customer_ID as it's common in both Databases.