Magic ETL Efficiency

Reply
Orange Belt

Magic ETL Efficiency

Hello,

 

I have a simple Magic ETL that I use to filter a dataset from our company's source system to only show the latest version of each record. The dataflow finds the latest timestamp for each record, then uses an inner join on the record ID and timestamp to only include the most recent row. I've attached a screenshot of the dataflow configuration for reference.

 

Magic ETL.PNG

 

Currently it takes about 15-20 minutes for this dataflow to run, processing about 16M rows to produce an output of 1.7M rows. I'm trying to come up with ways to improve the efficiency of this dataflow as the input dataset continues to grow to ensure that it doesn't impact the timing of downstream dataflows. Does anyone have any suggestions or alternatives? Any help would be greatly appreciated.

 

Thanks!


Accepted Solutions
Highlighted
Black Belt

Re: Magic ETL Efficiency

Hi @MichelleH 

 

you could try the following. I think it might help improved the efficiency of your flow

Follow the instructions in the screenshot attached to

  1. Remove the Group By and add a "Rank & Window"
  2. instructions for the Rank window set up are in the screenshot
  3.  Click on Addf Function
  4.  Name the new column "Rank_Desc"
  5. Select the function to apply a "Dense Rank" from the drop dow menu selection and click Apply
  6. continue to step 2 selecting the column that identifies the lastest version
  7. continue to step 3 and choose "Descending"
  8. continue to step 4 and select the column that makes the partition, in your case "RecordID"
  9. after setting up the rank window function , use the new column "Rank_Desc" to filter your dataset where the column = 1

Capture1.PNG

 

                        

Domo Arigato!

**Say 'Thanks' by clicking the thumbs up in the post that helped you.
**Please mark the post that solves your problem as 'Accepted Solution'

View solution in original post


All Replies
Highlighted
Black Belt

Re: Magic ETL Efficiency

Hi @MichelleH 

 

you could try the following. I think it might help improved the efficiency of your flow

Follow the instructions in the screenshot attached to

  1. Remove the Group By and add a "Rank & Window"
  2. instructions for the Rank window set up are in the screenshot
  3.  Click on Addf Function
  4.  Name the new column "Rank_Desc"
  5. Select the function to apply a "Dense Rank" from the drop dow menu selection and click Apply
  6. continue to step 2 selecting the column that identifies the lastest version
  7. continue to step 3 and choose "Descending"
  8. continue to step 4 and select the column that makes the partition, in your case "RecordID"
  9. after setting up the rank window function , use the new column "Rank_Desc" to filter your dataset where the column = 1

Capture1.PNG

 

                        

Domo Arigato!

**Say 'Thanks' by clicking the thumbs up in the post that helped you.
**Please mark the post that solves your problem as 'Accepted Solution'

View solution in original post

Highlighted
Orange Belt

Re: Magic ETL Efficiency

Hi @Godiepi ,

 

Thanks for the suggestion! I created a copy of the original dataflow and updated it using your instructions, then let the two dataflows run concurrently for a couple of days to observe the results.

 

While your method was faster overall, it only reduced the average run time by 1-2 minutes. I agree that using Dense Rank with a filter is a better option than a Group By and Join, however I'm still open to additional suggestions that could possibly help get the run time to under 10 minutes if possible. 

 

Thanks again!

Highlighted
Orange Belt

Re: Magic ETL Efficiency

There's not much else going on in the dataflow, especially nothing that adds complexity. I don't know if this is an option for you, but have you thought about decreasing the size of the input set? Perhaps increase the frequency of the import and only bring in rows added since the last import? I don't know how wide your data is, but reducing the number of columns might add some speed (assuming there are columns you could eliminate).

Highlighted
Major Brown Belt

Re: Magic ETL Efficiency

Great Suggestion @Godiepi!  I often forget about using Rank!

 

I agree with @bdavis, and was about to suggest adding column selection in the front so it doesn't have to chunk through so much.  

 

You also may be able to remove the 'Remove Duplicates' function if the selection of Rank=1 does the trick.  It seems small, but it's still looking through every row across many attributes, especially if it's before your column selection.  

 

 

 

DataMaven
Breaking Down Silos - Building Bridges
Check out my video!
**Say "Thanks" by clicking the heart in the post that helped you.
**Please mark the post that solves your problem by clicking on "Accept as Solution"
Highlighted
Orange Belt

Re: Magic ETL Efficiency

@DataMaven I hadn't thought about the Remove Duplicates, but I'll get rid of that and see if that helps!

Announcements
Win free lodging at Domoapalooza! We want you to share you favorite Domo tips and tricks. Click here for more details!