ETL Slowed down by large Data Input

I'm currently using a dataset in a few dataflows that contains around 10M rows of data (this includes historical extracts from the last day of each month as well as the previous day). Any ETL process that is using this dataset as an input takes a long time to run due to all of the rows having to load in before anything can be done (the actual ETL steps run quickly). What are some best practices to isolate on the data I need in the data input process to reduce run time?

Best Answer

  • jaeW_at_Onyx
    jaeW_at_Onyx Budapest / Portland, OR 🟤
    Accepted Answer

    use a Dataset view (beta feature, ask your csm) to isolate the subset of data that you actually need in ETL..

     

    are you running magic or SQL?  Magic can start ETL before all the data is loaded into the ETL environment, with SQL you have to wait until all the data is loaded into a table (and indexed if you're using redshift) before the dataflow can begin.  

     

    also, SQL enginges cannot leverage a DSV.  if you're using SQL your pipeline has to be:

     

    1) DSV

    2) Dataset Copy (connector) of your DSV.

    then SQL transform.

     

    If you have the Adrenaline dataflows feature, you may be able to leverage that for faster performance b/c it all happens in Adrenaline, but this is a premium feature.

     

    If you work with your support team, you may be able to 

    1) materialize your DSV (so you can avoid dataset copy)

    2) pull the materialized DSV into SQL, but make it clear that that's your intention b/c i think there's additional backend work that has to happen to us a materialized view in a SQL dataflow.

     

    SHORT VERSION OF THE STORY, you'll have a simpler data pipeline in Magic 2.0 

Answers

  • @jaeW_at_Onyx Thank you so much for the suggestions! I will be switching these ETLs over to Magic 2.0 and take a look into Adrenaline.