conditionally start a dataflow

kjones140
kjones140 βšͺ️

I am aware of the 3 ways to start a dataflow...manually, on a schedule, and only when datasets are updated.

I have a use case (which I suspect others may as well) where I only want a dataflow to run conditionally.

Example: We are pulling multiple datasets in from an application that on occasion experiences data quality issues (like duplicate rows). I have no control over changes pushed to the app by our devops team which can directly affect the data being pulled into Domo from that App. So, I have built audit dataflows and alerts on these datasets which notify me immediately if any duplicates are encountered so that I am aware if/when this occurs. There are subsequent dataflows that run based upon these datasets being updated. So, by the time I see the notification, the data is already being processed into Domo for use by our cards. In this situation, I have to notify devops to correct the data (or I revert to a previous Domo dataset that was good) which then reprocesses into Domo. These datasets are updated 6 times a day.

Is there any technique that I can use to NOT run the subsequent dataflows if data quality issues are encountered. I would rather have stale data than bad data.

I can tell that if the output dataset of the audit dataflows has any records that there are dups. It would be great if within the subsequent dataflow(s) it could check that audit dataset first to determine if it should proceed.

Any ideas?

Best Answers

  • GrantSmith
    GrantSmith Indiana πŸ”΄
    Accepted Answer

    Hey @kjones140

    I haven't tried this before but you might be able to utilize a recursive dataflow where you have an input dataset of the new data and also the output dataset as the input. You could then perform some auditing as a 3rd dataset on your new data and determine if it's good or not and add a column with value of 1 (success) or 0 (fail) (and a single row)

    Your new dataset would have a constant added to it with an audit value of 1. Your existing output dataset would have a constant of 0. You then append the new data with the historical dataset and then join to the output of your audit path. This way if the audit passes (1) it would select the new data, if it fails (0) it'd select the old historical data.


    Essentially you're conditionally selecting the dataset to use.

  • GrantSmith
    GrantSmith Indiana πŸ”΄
    Accepted Answer

    @jaeW_at_Onyx

    I suppose yes, you could utilize something like a Python script (utilizing the pydomo package) to export the data, do some validation and then kick off the data flow if they pass as an orchestrator.

    You could set the DataFlow to be triggered by the execution of the dataset that would need to be validated so it'd always be up to date when the DataFlow runs. The idea behind all of this isn't to append the DataFlow to itself taking a snapshot but rather just having the DataFlow decide which input dataset to use based on the audit results.


    A square peg can fit into a round hole if you hit it hard enough (even though this isn't an ideal solution) πŸ˜‚

Answers

  • jaeW_at_Onyx
    jaeW_at_Onyx Budapest / Portland, OR 🟀

    there is not a way to conditionally trigger dataflows built into the UI yet. i have a feeling that DP21 they are going to fix this. but maybe that's just me being unecessarily optimistic.


    if you know python, you could use PyDomo to orchestrate dataflows with more control

    https://www.youtube.com/watch?v=oT5NipvWK1o

  • kjones140
    kjones140 βšͺ️

    Thanks jaew_at_Onyx, I hope you are right about DP21.

    I will take a look at the youtube link you provided to see if that offers a technique I can implement to handle this (and other) situations.

  • kjones140
    kjones140 βšͺ️

    Hi GrantSmith,

    I have used recursive dataflows for other things before so I am familiar with that concept.

    What you propose is very creative and just might do the trick. I will need to think through this a bit and determine how to implement in our environment.

    Thank you for responding!

  • jaeW_at_Onyx
    jaeW_at_Onyx Budapest / Portland, OR 🟀

    @kjones140 this is an interesting idea, but @GrantSmith the part that makes me nervous is that execution would have be be based on whether a previous ETL is up to date.


    to know the answer to that question you'd have to have up to date info about the state of your datasets. Domo typically only wants users to update their governance and stats datasets daily, but for your solution you'd have to be updating domoGovernance_datasets or domoStats_dataflow_history or domoGovernance_dataflows (depending on which you use to answer "are you up to date?") hourly or sub hourly.


    secondly, remember recursive executions get slower with data volumes. yes. Magic 2.0 makes the issue less severe but it is a step backward in terms of some of the optimizations Magic 2.0 implemented under the cover (regarding only executing updated rows).


    for that reason, i would recommend putting the ingenuity to use by building an external orchestrator that can evaluate "is it time to execute this dataflow yet?" BEFORE it commits to running the dataflow. (as opposed to your solution which MUST run the dataflow before it answers "is it time yet?"

  • kjones140
    kjones140 βšͺ️

    @GrantSmith

    I just implemented the recursive dataflow concept that you proposed, Thank you for this!


    @jaeW_at_Onyx

    Thank you for your recommendation as well. I watched your youtube link but will need to get up to speed on Python (and PyDomo) first to implement this in our environment. I would definitely like to have this level of control over our dataflows.