Magic ETL/MySQL ETL debounce period
On a daily basis, I pull in a replacement update of all the relevant tables I want to sync with Domo via DataSet Jobs in the Domo workbench. For simplicity, I schedule them all at the same target time and let the workbench figure out how to make sure they all get updated. It is great at doing this, and I would really prefer not to attempt to micromanage the ordering of this import via guesstimated scheduling slots, since the duration of each job may vary and determining which jobs are "okay" to run concurrently could quickly become an intractable maintenance problem.
Enter the automatically triggered DataFlows.
Let's say I have a DataFlow that derives input from two or more of these DataSets that are scheduled at the same time every day. I decide that I want to automatically recompile this DataFlow every morning after all the input DataSets have been updated. The only available options are to pick which DataSets individually trigger a DataFlow run when they update. My options boil down to two situations:
- Pick a subset of the inputs.
- Check them all
Choice 1 is problematic because I haven't mandated that one input update job must always run last in the Job configuration (and remember that I really don't want to get into the messy business of doing this). Excluding one allows the possibility that it will be the last one and I will randomly miss its changes.
Choice 2 seems to be the only reliable option. However, it involves some additional concerns:
- How many times will the ETL job be recompiled each morning? From some limited experimenting, it seems that the DataFlows will run once as long as the input jobs all announce a new run to the server before one of them completes. However, my workbench logs indicate that there could be 10 or more minutes between the scheduled job start and the actual start. It is unclear whether this introduces the possibility of multiple DataFlow executions.
- How does the ETL process handle jobs that are announced concurrently? Does it simply throttle incoming triggers once it gets the first triggering job? Will the latter jobs get their updates included in this run of the DataFlow?
- What happens if two of the three imports finish and the DataFlow kicks off before the third one happens? How does Domo handle an ETL process that uses the third data source as a job is trying to update it? Does the update get blocked or can the update rip data out from the ETL process as it is running? (In other words, is the ETL input isolation analogous to READ_UNCOMMITTED or READ_COMMITTED)?
- What happens if I am appending to a DataFlow output and one of the inputs didn't get refreshed yet? It is true that a subsequent run will pick up that data eventually. However, the current deduplication node provides no control over how to select the winning cells from the duplicate records, so making sure that the final result contains the updated values for the pokey DataFlow is a non-trivial concern.
The questions above illustrate a fog of uncertainty around how these processes interact and the potential for non-obvious data-integrity issues if a user's assumptions are wrong. My proposal is to create a relatively simple number field on a DataFlow that lets me configure a time delay before a DataFlow that has received an input trigger runs ("debounce"). Since I know my morning input completes in around 15 minutes, it would be simple for me to configure a new DataFlow to wait that long before actually running so that it can wait for all the jobs to report their changes. It should also be relatively simple to implement, since it can be ignorant of the nature of the DataFlow.
- 10.7K All Categories
- 13 Getting Started in the Community
- 38 Beastmode & Analytics
- 2.1K Data Platform & Data Science
- 59 Domo Everywhere
- 2.7K Charting
- 2.5K Ideas Exchange
- 1.3K Connectors
- 362 Workbench
- 303 Use Cases & Best Practices
- 501 APIs
- 118 Apps
- 48 News
- 753 Onboarding
- 1.2K 日本支部
- Private Company Board