Dataset overall freshness

Dear MajorDomo's,

I have a question / suggestion.  Just came back from Domopalooza and saw the soon-to-be-released Lineage function.  Great.  Now what I want is a status icon (similar to workbench job status icons) for the overall freshness of the dataset.  Did all predecessor datasets get refreshed on time?  Did all predecessor dataflows run correctly before this dataset was published?  Am I looking at really good data or sort of good data?  It would be nice to know.  


I have a detailed writeup with pictures attached, but I'll try to do it justice here.  



Dojo team members

Re: “Idiot light” indicator for health and status of dataflow dependencies



Dataflows are often the lifeblood of any data science result. Not surprisingly, each dataflow can be scheduled individually. They can also be piped into subsequent flows to be combined with other datasets. I would like to determine a method to create a ‘roll-up’ status light (red, green, yellow) that examines all prerequisite dataflows in any given set.


Executive Summary:

The idea originates from the main jobs screen in Workbench below, one can clearly see the status of any scheduled workbench job.


Figure 1: Workbench job status


This is extremely valuable for assessing status of the data loading processes. This philosophy is carried through to the dataset and dataflow area as well – see Figure 2 below.


Since dataflows and datasets can be chained together, sophisticated results can be obtained that drive a data science project. However, the status information of the chain is lacking. If one dataset in the chain is out of date, the successor sets are not provided with a status indicator of any provenance issues. Now that the Lineage function is becoming available, it may be possible to show the status of the entire flow on the screen.




Figure 2: Data set status


This suggestion is to allow MajorDomos, Data Scientists, and data set owners to build a status card of the constituent datasets and flows. Clearly the data is available; it just needs to be accessible to those concerned with the status. Below is a chart of a typical chained dataflow in the EMC instance.  Each box lists the input dataset, the name of the data flow (in purple), and the resulting dataset output. The flows are chained together to produce multiple output datasets, used for various reporting purposes. In this scenario, if one dataset is old or stale, successor datasets are still reported as ‘Good’ or green because their individual flow has run; however, there is no indicator that the data was in fact, stale.


VxRail DataFlows - Page 1.png


A simple output card for all the constituent datasets and flows would suffice to address this – at least until the Domo team decides to act on the overall need of data provenance (ala the Lineage diagram for dataflows).


In effect, this is very similar to a piping diagram in a power plant; the same principles apply.


The Ask: The ask in the short term is to identify the mechanism which can be used to create the summary card outlined above.

1 votes

· Last Updated


This discussion has been closed.