ETL Slowed down by large Data Input

Reply
Brown Belt

ETL Slowed down by large Data Input

I'm currently using a dataset in a few dataflows that contains around 10M rows of data (this includes historical extracts from the last day of each month as well as the previous day). Any ETL process that is using this dataset as an input takes a long time to run due to all of the rows having to load in before anything can be done (the actual ETL steps run quickly). What are some best practices to isolate on the data I need in the data input process to reduce run time?


Accepted Solutions
Black Belt

use a Dataset view (beta feature, ask your csm) to isolate the subset of data that you actually need in ETL..

 

are you running magic or SQL?  Magic can start ETL before all the data is loaded into the ETL environment, with SQL you have to wait until all the data is loaded into a table (and indexed if you're using redshift) before the dataflow can begin.  

 

also, SQL enginges cannot leverage a DSV.  if you're using SQL your pipeline has to be:

 

1) DSV

2) Dataset Copy (connector) of your DSV.

then SQL transform.

 

If you have the Adrenaline dataflows feature, you may be able to leverage that for faster performance b/c it all happens in Adrenaline, but this is a premium feature.

 

If you work with your support team, you may be able to 

1) materialize your DSV (so you can avoid dataset copy)

2) pull the materialized DSV into SQL, but make it clear that that's your intention b/c i think there's additional backend work that has to happen to us a materialized view in a SQL dataflow.

 

SHORT VERSION OF THE STORY, you'll have a simpler data pipeline in Magic 2.0 


Jae Wilson
Check out my Domo Training YouTube Channel

**Say "Thanks" by clicking the heart in the post that helped you.
**Please mark the post that solves your problem by clicking on "Accept as Solution"

View solution in original post


All Replies
Black Belt

use a Dataset view (beta feature, ask your csm) to isolate the subset of data that you actually need in ETL..

 

are you running magic or SQL?  Magic can start ETL before all the data is loaded into the ETL environment, with SQL you have to wait until all the data is loaded into a table (and indexed if you're using redshift) before the dataflow can begin.  

 

also, SQL enginges cannot leverage a DSV.  if you're using SQL your pipeline has to be:

 

1) DSV

2) Dataset Copy (connector) of your DSV.

then SQL transform.

 

If you have the Adrenaline dataflows feature, you may be able to leverage that for faster performance b/c it all happens in Adrenaline, but this is a premium feature.

 

If you work with your support team, you may be able to 

1) materialize your DSV (so you can avoid dataset copy)

2) pull the materialized DSV into SQL, but make it clear that that's your intention b/c i think there's additional backend work that has to happen to us a materialized view in a SQL dataflow.

 

SHORT VERSION OF THE STORY, you'll have a simpler data pipeline in Magic 2.0 


Jae Wilson
Check out my Domo Training YouTube Channel

**Say "Thanks" by clicking the heart in the post that helped you.
**Please mark the post that solves your problem by clicking on "Accept as Solution"

View solution in original post

Brown Belt

@jaeW_at_Onyx Thank you so much for the suggestions! I will be switching these ETLs over to Magic 2.0 and take a look into Adrenaline. 

Announcements
New & Improved Dojo: Get excited for a new feel and additional features in the online community.

Site will be down February 4-5th for maintenance.

Launch coming February 8th!