Magic ETL Efficiency

Hello,

 

I have a simple Magic ETL that I use to filter a dataset from our company's source system to only show the latest version of each record. The dataflow finds the latest timestamp for each record, then uses an inner join on the record ID and timestamp to only include the most recent row. I've attached a screenshot of the dataflow configuration for reference.

 

Magic ETL.PNG

 

Currently it takes about 15-20 minutes for this dataflow to run, processing about 16M rows to produce an output of 1.7M rows. I'm trying to come up with ways to improve the efficiency of this dataflow as the input dataset continues to grow to ensure that it doesn't impact the timing of downstream dataflows. Does anyone have any suggestions or alternatives? Any help would be greatly appreciated.

 

Thanks!

Best Answer

  • Godiepi
    Godiepi 🟣
    Accepted Answer

    Hi @MichelleH 

     

    you could try the following. I think it might help improved the efficiency of your flow

    Follow the instructions in the screenshot attached to

    1. Remove the Group By and add a "Rank & Window"
    2. instructions for the Rank window set up are in the screenshot
    3.  Click on Addf Function
    4.  Name the new column "Rank_Desc"
    5. Select the function to apply a "Dense Rank" from the drop dow menu selection and click Apply
    6. continue to step 2 selecting the column that identifies the lastest version
    7. continue to step 3 and choose "Descending"
    8. continue to step 4 and select the column that makes the partition, in your case "RecordID"
    9. after setting up the rank window function , use the new column "Rank_Desc" to filter your dataset where the column = 1

    Capture1.PNG

     

                            

Answers

  • Hi @Godiepi ,

     

    Thanks for the suggestion! I created a copy of the original dataflow and updated it using your instructions, then let the two dataflows run concurrently for a couple of days to observe the results.

     

    While your method was faster overall, it only reduced the average run time by 1-2 minutes. I agree that using Dense Rank with a filter is a better option than a Group By and Join, however I'm still open to additional suggestions that could possibly help get the run time to under 10 minutes if possible. 

     

    Thanks again!

  • There's not much else going on in the dataflow, especially nothing that adds complexity. I don't know if this is an option for you, but have you thought about decreasing the size of the input set? Perhaps increase the frequency of the import and only bring in rows added since the last import? I don't know how wide your data is, but reducing the number of columns might add some speed (assuming there are columns you could eliminate).

  • Great Suggestion @Godiepi!  I often forget about using Rank!

     

    I agree with @bdavis, and was about to suggest adding column selection in the front so it doesn't have to chunk through so much.  

     

    You also may be able to remove the 'Remove Duplicates' function if the selection of Rank=1 does the trick.  It seems small, but it's still looking through every row across many attributes, especially if it's before your column selection.  

     

     

     

  • @DataMaven I hadn't thought about the Remove Duplicates, but I'll get rid of that and see if that helps!

  • Did it help?