Historical Comparisons

Hello,

 

We have completed our 2nd full year in Domo and we are starting to see performance issues with the dataflows.  We have been integrating data sources left and right and our instance is around 500M rows.  Not sure if that is typical or not but it should give you an idea of the data we are processing.

 

Some of our problems can be alleviated by creating true historical datasets that do not need to be re-processed every day and we will be working on this process this year.

 

I am looking for recommendations/best practices for staging/reporting on historical data.  Typically we are doing yoy, ytd, mtd, dtd, 52 weeks, etc.  I have found the Domo POP charts to be too inflexible for our needs.  My current approach is having today datasets and historical.  The historical dataflows are beginning to take hours to run and cards are starting to lag which is not making people happy.  Our business day ends at 6am so if a historical comparison datasets takes 2-3 hours our users are waiting 1/2 the morning for Domo to update.  The real problem is when a dataflow mysteriously takes 5-8 hours to complete.  I have titlted at this windmill from year 1 with Domo engineering and the answer has always been that the dataflows on average are running at acceptable times regardless of spikes in processing times and that Domo is constantly working on improving their processing/prioritizing of jobs.

 

Should I create the comparison totals in beast modes?  How will the card function based on tens of millions of rows?  Do I need to consider aggregating data in a datawarehouse instead of Domo?

 

Thank you in advance for your recommendations.

Comments

  • I'm going to follow this b/c I am bringing in data and some may come in at the billion row level.

     

    But a few questions:

    1.  When you have slow dataflows, are they using Magic ETL, MySQL, or RedShift?

    2.  Have you tried thinning out your datasets and possibly structuring your data differently for the cards?  What I mean is that if you currently have one dataset, that may be denormalized so you can point lots of cards to it, and it has lots of columns and lots of metrics.  You may benefit from splitting that up to two at the output and point cards to one or the other depending on the KPI.

    3.  In your 'aggregation' question - I'd aggregate at whatever the lowest level you can in ETL that doesn't skew your beastmodes in the card.  For example, if you need to know the details of each transaction, you don't want to sum(transaction count) by day.  But maybe you can sum(cash sales) and sum(credit sales) something like that to help.

     

     

  • 1. Both mysql and redshift exhibit these fluctuations in run times.

    2. We have separate datasets and dataflows for different metrics that need to be calculated totally differently.

    3. The aggregrations are an attempt to roll up the data for faster performing dataflows.

     

    I am curious if it would be better to do the comparison totals in beast modes instead of dataflows or if there is a more efficient way to stage the data in a dataflow other than left joining the data back to itself multiple times based on time periods. 

  • Great.

    1. First off, MySQL is slower in performance than Magic ETL.  To me so is RedShift, but officially I'm not sure that is the case.  

    2. Nice

    3. Makes sense.

     

    If you are using a card that does time period comparisons, does that not work for you vs writing out your own dataflow/BM comparisons?  

  • 1.  If magic ETL offers a performance boost then I guess I will need to start dragging tiles around.  It always seemed simpler to me to type the sql.

     

    The Domo Period Over Period cards do not allow custom summary numbers and you are limited to using a date field and only 1 metric.  There are other formatting limitations but all in all they are too limited for our purposes.

     

    I have no problem creating beast modes or dataflows but I do not want to go down the wrong path and find out I chose poorly.  I need the direction moving forward to be the correct and proven method for working within the domo platform.

  • As for the DF, yea, SQL is so much simpler and as a former DBA it was my go-to, but too many drags on mysql forced me to make the change.  It sucks sometimes setting up and it will take some time to get used to it and how to think about it, but performance wise for me it was a much better move.  

     

    As for the comparisons, if you go the DF route, you may run into a problem with too many sources and too many cards.  May be a trial / error...set it up in DF then combine what you've aggregated their with some less complex beast modes.

     

    I'd also reach out to your CSM and see if they can sync you up with a Technical Consultant and get some recommendations.  

     

    Also - you can use workbench to do some of your ETL work too, not sure that would help you or not, but don't forget you may can do some there.

  • I started looking at utilizing the magic etl option and realized that it cannot do complex joins so this will prevent me from using magic etl.  Our POS data has start and end dates and I need to create rows for each day in between depending on the specifics of other columns in the row.  It does not appear that magic etl can handle this so I will need to continue with mySQL or Redshift at least for the workhorse portion of the initial transforms.  The other option as you pointed out would be to build this logic into the workbench side before sending it to Domo. The downside to using workbench is if the logic changes I would need to pull all of the data again from the source. Maybe I can book an appointment at the Brilliance Bar at Domopalooza and get some solid recommendations.  Thank you for your responses.

     

    Thank you.

  • Thing I do which may or may no help you is the following

    I create large large datesets by appending hystorical and current data. I need the data updated through the day so my timing would be different to yours.

    I would then run the historical part of the data, say all data from last year backwards once per day, realistically it only needs to be run once but I like to make sure the data comes through.

    I would run the current data every hour or so.

    I then append the historical to the current, this takes very little time.

     

    I do manually change the dataset cutoff date which is not ideal, you could automate that somehow.

  • How do you append data to the historical dataset quickly?  It usually takes a couple hours to load the historical data, a few minutes to append, and then another couple hours to create and index the historical output.  In my case it takes around 3-4 hours to append 25k records to a 20M historical dataset.  Unfortunately this is one dataset from one source that still needs to be joined to a much larger dataset that combines multiple datasources.