Reusable dataflows / ETLs
We're trying to come up with a good strategy for dealing with large datasets that contain mostly historical data but need to undergo complex transformations in a dataflow or ETL.
For example, our user activity log (currently about 7M rows) needs to be joined to several other tables and then we need to segment the activity in multiple ways, group some of it together, etc.
We've been using a MySQL dataflow because of the complexity of the transformations but this takes a little while to run -- 20-30 minutes -- even if we only process 6 months' worth of activity (< 2M rows from the activity log). If we process the entire 7M-row activity log the dataflow can take over an hour to run.
So one general thought we've had is to only transform very recent data (say, from the current month) and then UNION it together with transformed historical data in a DataFusion. But we're struggling to come up with a good process for managing this in practice. Coming from a software-development background, my mind goes to reusable dataflows -- if it were possible to configure a DF to use a "variable" input this process would be pretty easy to manage.
So for example on the first day of the month the DF would use the historical activity log as its input, and then on subsequent days it would use the "month-to-date" activity log. I don't think this is literally possible in Domo, without manually reconfiguring the DF twice a month, but maybe someone has come up with a technique that approximates it?
- 10.7K All Categories
- 13 Getting Started in the Community
- 41 Beastmode & Analytics
- 2.1K Data Platform & Data Science
- 59 Domo Everywhere
- 2.7K Charting
- 2.5K Ideas Exchange
- 1.3K Connectors
- 362 Workbench
- 303 Use Cases & Best Practices
- 503 APIs
- 120 Apps
- 48 News
- 753 Onboarding
- 1.2K 日本支部
- Private Company Board