Dataset overall freshness

Dear MajorDomo's,

I have a question / suggestion.  Just came back from Domopalooza and saw the soon-to-be-released Lineage function.  Great.  Now what I want is a status icon (similar to workbench job status icons) for the overall freshness of the dataset.  Did all predecessor datasets get refreshed on time?  Did all predecessor dataflows run correctly before this dataset was published?  Am I looking at really good data or sort of good data?  It would be nice to know.  

 

I have a detailed writeup with pictures attached, but I'll try to do it justice here.  

 

 

Dojo team members

Re: “Idiot light” indicator for health and status of dataflow dependencies

 

Summary:

Dataflows are often the lifeblood of any data science result. Not surprisingly, each dataflow can be scheduled individually. They can also be piped into subsequent flows to be combined with other datasets. I would like to determine a method to create a ‘roll-up’ status light (red, green, yellow) that examines all prerequisite dataflows in any given set.

 

Executive Summary:

The idea originates from the main jobs screen in Workbench below, one can clearly see the status of any scheduled workbench job.

2017-03-29_09-41-04.png

Figure 1: Workbench job status

 

This is extremely valuable for assessing status of the data loading processes. This philosophy is carried through to the dataset and dataflow area as well – see Figure 2 below.

 

Since dataflows and datasets can be chained together, sophisticated results can be obtained that drive a data science project. However, the status information of the chain is lacking. If one dataset in the chain is out of date, the successor sets are not provided with a status indicator of any provenance issues. Now that the Lineage function is becoming available, it may be possible to show the status of the entire flow on the screen.

 

2017-03-29_09-44-55.png

 

Figure 2: Data set status

 

This suggestion is to allow MajorDomos, Data Scientists, and data set owners to build a status card of the constituent datasets and flows. Clearly the data is available; it just needs to be accessible to those concerned with the status. Below is a chart of a typical chained dataflow in the EMC instance.  Each box lists the input dataset, the name of the data flow (in purple), and the resulting dataset output. The flows are chained together to produce multiple output datasets, used for various reporting purposes. In this scenario, if one dataset is old or stale, successor datasets are still reported as ‘Good’ or green because their individual flow has run; however, there is no indicator that the data was in fact, stale.

 

VxRail DataFlows - Page 1.png

 

A simple output card for all the constituent datasets and flows would suffice to address this – at least until the Domo team decides to act on the overall need of data provenance (ala the Lineage diagram for dataflows).

 

In effect, this is very similar to a piping diagram in a power plant; the same principles apply.

 

The Ask: The ask in the short term is to identify the mechanism which can be used to create the summary card outlined above.

Matthew O Coblentz
1
1 votes

· Last Updated

Comments

  • Thank you for submitting this @mcoblentz. I am assigning to our product manager @ckwright to review and comment.

  • DaniBoy
    DaniBoy

    domo

    💎

    CC @JonSharp

    Dani aka "Mr.Dojo"

    Dojo Admin
    **Say "Thanks" by clicking the "heart" in the post that helped you.
    **Please mark the post that solves your problem by clicking on "Accept as Solution"
    **You can update your Dojo Community name and avatar by clicking on your avatar then the "My Profile" button.
This discussion has been closed.