Ensuring all the input datasets to an ETL data flow are consistent before rerunning the data flow

For an ETL dataflow with more than one input that is executed when an input is reloaded sometimes it is necessary to ensure the inputs are consistent before the dataflow is executed. 


For example, consider an ETL based on 2 inputs, A and B, where both are triggers. Suppose A and B are scheduled to be loaded at the same time, but A finishes well before B. The dataflow will be run and potentially create incorrect output. When B finishes it is run again the output changed.


One option is a DataFlow centric solution. If a dataflow is triggered by the reloading of an input, then all other trigger inputs that are in the processes of reloading must first complete before the dataflow is rerun, e.g., if A & B are trigger inputs, and A completes but B is still loading, then the workflow is not triggered until B completes loading. This seems like a reasonable heuristic that would always be the right thing to do.



A second option, borrowed from backup, is to use consistency groups.  A consistency group of dataset are schedule to reload and none are considered finished, and available in Domo, until all the datasets in the group have finished reloading. 


I think both options have merit, with the first being quite easy to implement as it does not require a change to the UI.


The second is a more general and robust idea, but harder to implement as it requires some changes to the scheduling model, e.g., something like this

  1. Under dataset > Scheduling add a new tab "Consistency Group"
  2. The data is either
    1. Added to an existing group from a list of groups that also shows the scheduled time and the other datasets in the group, or
    2. Added to a newly created group. The group is given a name and a schedule time based that is based on the existing Basic or Advanced scheduling.


1 votes

· Last Updated

This discussion has been closed.