Help with recursive dataflow

Reply
Highlighted
Yellow Belt

Re: Help with recursive dataflow

I think rephrasing it as following will give more clarity:

if it does, filter it out from A (to avoid dupes), if it doesn't, then append.

 

Your workflow will have one new B and two versions of dataset A.

You will have to use Left Join to mark the duplicate values, say "Duplicate", then filter out the duplicate value and append the rest of the values with the New data B into the dataset A.

 

Please let me know if you have any questions.

 

 

 

Highlighted
Major Brown Belt

Re: Help with recursive dataflow

Sorry late to the game!  But my ten cents.  Don't build a recursive dataflow.

 

Yes, a recursive dataflow will work for your use case, but recursive dataflows don't scale with data volume.  If we're talking 100M+ row datasets, you'll start having to wait in excess of 20 minutes to i've seen 5, 10 and 23.999 hours.  (23.999 b/c Domo kills dataflows that take longer than a day.

 

RDs are fine with small dattasets, but if it's going to get huge, you'll want to do something else.

 

Why don't RDs scale?

B/c every time you run an RD dataflow, the ETL has to transfer your entire dataset to the ETL engine, transform the dataset, then write the entire dataset to disk (Vault).  then transfer the dataset to Domo's database layer (Adrenaline). 

 

Imagine you have a 100M row historical dataset, and a 10k row new_data.  You have to transform 100M+10k rows... every day.  This can get ugly fast.

 

TLDR:  the better solution

Ask your CSM about Data Assembler or UPSERT.  

 

With Data Assembler, your historical dataset gets chunked into partitions (imagine old-school encyclopedia books partitioned by date).  When you get new data, data assembler says "what partitions (new dates) exist in  my new data?"  If that partition already exists in my historical data, DROP the existing partition, and replace with New Partition.

Domo can run this operation much faster because it only has to delete the encyclopedia volume and replace it with a new volume.  Because the volumes are stored as separate files, it's figuratively just a case of replacing a file.

 

UPSERT works like a MERGE statement in SQL if your historical_data table has a 'primary key'.  when you load new data, Domo asks "does this primary key already exist in my historical table?  if yes, then replace the row, if no, then create a new row.

https://docs.microsoft.com/en-us/sql/t-sql/statements/merge-transact-sql?view=sql-server-ver15

 

Both UPSERT and DA are much faster than recursive dataflows because you write significantly less data as your data volumes get larger and larger.

Highlighted
Major Brown Belt

Re: Help with recursive dataflow

I just posted not to use a Recursive Dataflow.  But if you're hell-bent.

 

Don't use a JOIN to combine historical and new.  Use an APPEND (b/c JOINS in MagicETL are slow especially with composite keys).

 

Add a Row_Number() function via the RANK / WINDOW'ed tile where you number the rows based on how you define a duplicate row ordered by batch_upload_date.  Then keep the rows where RowNumber = 1.

 

Announcements
Want to join our weekly online meet up? Please visit this board here and let me know. Click here for more details!