What are best practices for combining huge datasets in Domo?

We have been using Domo for nearly a year now and are heavily utilizing DataFusion and Dataset Views to accomplish joining dimension tables to our primary fact tables. By the time we have joined everything we need, it is not uncommon to be looking at 300-800 columns and 50-100 million rows across 5-6 datasets. We often see performance issues with these tools however, such as not being able to edit/preview the datafusion or dataset view. We have worked with Domo to "optimize" these to the extent they can and are often still running up against a scenario where it works as long as we don't want to edit or make any changes. What are others using to transform/join massive tables like this in Domo?

Additional Info: These are mostly large ERP datasets coming in via workbench. They must be joined separately (not during ingestion) as these are dynamic rollups such as product hierarchy, item attributes, fiscal calendar offset.

Any advice or experiences is much appreciated!

Answers

  • jaeW_at_Onyx
    jaeW_at_Onyx Budapest / Portland, OR 🟤

    @user02790,

    Super important. Can you please update your username handle so it's easier to AT_Mention you? it helps guarantee you get timely notifications... and also helps respondents get an idea of who we're talking to!


    This will start by sounding snide but lead to a solution that i've implemented at companies with multi Billion-row datasets in Domo.


    Could you imagine writing SELECT * on a query that JOINS multiple tables and outputting 100M rows spread across 800 columns of which many were text? Could you imagine then taking that Query and expecting to be able to get a subsecond response or even 1 minute ( the maximum query duration Adrenaline will accept) ?

    of course not right? And even if you did, it's the kind of expensive query where your DBA would say, "just run that query once in a while." b/c if you were paying per query execution, it's super wasteful.


    Solution: Build a Fusion with just the columns you need.

    I know. You need all of them ;) Or, your users haven't defined what they need so you want to give them all of them.

    In Domo Governance there's a dataset BeastModes, Cards and Fields (or a name like that ...) it has one row per field used in a dataset (as a filter, axis, in a beast mode, in the ORDER BY clause etc.) Build a card that lists the number of times each field was used in cards per dataset, and then build an optimized fusion with the top N columns.

    Migrate most of your cards to it, then in drilldown, link to the gargantuan dataset if necessary.


    For Data Exploration.

    Build exploration datasets that provide a sample of the 100m rows (let's say 10k) spread across the 800 columns. As people explore the dataset should feel more responsive. Then when it's time to productionize, discover that htey're still using the same 30 columns and profit :P

    OR build them a custom dataset if necessary.


    Alternatively, find clever ways of reducing the number of Dimensions you use so you can have fewer joins.

    It may feel ugly to create a Date CROSS JOIN Customer hybrid, but if i have 365 * 50 ... that's still a cheap JOIN as opposed to two JOINs.


    Hope that helps!

  • Unfortunately it seems I don't have the ability to change my username, that box is greyed out :( My name is Kyle.

    Thanks for the thoughtful response. Our company is quite complex in nature (large M&A footprint) and we have many end users coming from different businesses with sometimes vastly different data and business processes. Because of this it is a bit challenging as our model has been to attempt to establish a single source of truth, which does provide organizational benefit. I think you are definitely right that we need to reevaluate that model given technical limitations.

    Part of our reluctance to do so has been our inability to get a straight answer on what we can expect to work in a datafusion due to the challenge of knowing what type of query could be created as there are many variables. When thinking about downsizing gargantuan datasets, do you have a target or any best practice that you have seen? I recognize this is a bit of a hard one based on all of the inputs and permutations between rows, columns, joins etc.

    We have begun to test an approach of a dataset that has "most things" and a dataset that has everything. We are hopeful that by reducing the column count substantially in our lighter dataset, that we can reap benefits there.

    Last question. When doing this type of reduction, do you agree that it is important to reduce the columns in the input datasets as opposed to simply dropping these columns out of the fusion as in the latter case the unused columns are still having to be considered by the fusion even though they aren't in the result?

    Thanks,

    Kyle