Update Method option on dataflow

So I was editing a redshift dataflow this morning and I noticed a new option on the output dataset named Update Method.  It has a replace and an append option.  I am not sure when this became available or how it should be implemented.  I understand that all current datasets replace themselves and normally to append you need to create a recursive dataflow feeding the base data back into itself.  Could someone give me an example of a use case for the append option?

 

 

Thank you. 

 

Best Answer

  • n8isjack-ret
    Accepted Answer

    @cwolman, yes that is a great use case for it.

     

    Key things to note:

    • There must be no overlap between Dataset A and Dataset B. You cannot update records that already loaded into Dataset A using this new method.
    • Correcting errors is trickier. If a data load must be reloaded or was loaded twice it is difficult to correct it using the new method. 

Answers

  • Hi @cwolman, this is a pretty exciting change but it is for specific situations. It likely will not replace your recursive dataflow.

     

    It will allow you to take data that can simply be appended, but modify it first. Say that you are loading sales transactions. Dealing with hundreds of millions of rows is slow but you can just append the data. This is great, but if you need to do data prep, cleanup, filtering, etc... it has to be done in every card using the data. This new method allows you to transform it before appending it to the dataset.

  • Would this new functionality work for this scenario?

     

    Basic recursive dataflow

    Dataset A - contains 25M rows (base data)

    Dataset B - contains 1M rows (new data)

    transform dataset B and union to Dataset A for final output.  Dataset A now contains 26M rows.  Rinse and repeat daily.

     

    Could I edit this existing dataflow and remove Dataset A as an input and simply transform Dataset B and have it append to the output dataset using this new feature? 

     

    This would allow me to eliminate the time required to load Dataset A first which would decrease processing time.