ETL joining causes an output dataflow that is too big?

I am very new in using Domo, so hopefully I can explain my situation well.

I have three datasets that have different column names and structures, even though each dataset has same columns too, like 'Employee Name' and 'Employee ID'. 

I tried to join them (using inner join) and run preview with 10k limit, but it showed the message 'Preview took too long to respond. Try running with less data."

 

I then ran the preview with 1k limit and it worked. 

 

So I saved and run the dataflow, but it kept on running without showing any result, and my output dataflow ended up with 0 rows. 

 

Anyone can advise what am I doing wrong here and what can I do to fix this? Thanks a lot.

Tagged:

Comments

  • Just to make myself clearer, I joined the three datasets using Employee Name column.

  • Hi @user04285 welcome to the community, and to Domo.

     

    Magic ETL can be incredibly powerful, here are a few Magic tips that I think may be helpful in your situation:

     

    - Start simple/small, trying just running this as a simple join between two of your datasets before bringing in the third. Take care to note the differences between Inner, Outer etc.

    - Make sure there are no duplicates in your Key Column, you can do this by putting a "Remove Duplicates" function on the data thread before plugging it into the join

    - Employee Name is typically a tough name to use as a Join Key because it is often free-form. I always look for a system-generated (optimal) or at the very least an identifier field that is numeric (so maybe Employee Name)

    - Lastly, a bit of a nuance with a Left Outer joins. Sometimes you will get an error message that says "Greater than 10K duplicates, can't be joined". In those cases, you can change it to a Right Outer (and reverse the dataset positions) and it will work

     

    All that being said, if you will upload a couple screen shots of your datasets and the Magic ETL itself I am happy to have a look!

  • Hi @Chips, thanks for replying!

    So here's my ETL flow, and last time I tried it worked, but the output row number amounts up to 4.4 million rows! I have no idea why since the dataset I have with most rows only have 395k rows. 

    ETLETLtotal rows.png

     

    I wonder what causes this to happen? 

     

    And about your suggestion:

    "..Make sure there are no duplicates in your Key Column, you can do this by putting a "Remove Duplicates" function on the data thread before plugging it into the join"

    - Employee Name is typically a tough name to use as a Join Key because it is often free-form. I always look for a system-generated (optimal) or at the very least an identifier field that is numeric (so maybe Employee Name)

    - Lastly, a bit of a nuance with a Left Outer joins. Sometimes you will get an error message that says "Greater than 10K duplicates, can't be joined". In those cases, you can change it to a Right Outer (and reverse the dataset positions) and it will work

    I am currently joining the tables based on multiple columns and on inner joins for these three datasets. One of them is Employee Name column. 

    So do you mean that I should change the joining method?

     

    Thanks again for helping Chips!

  • You probalby have repeated EmployeeNames

    When joining if the keys are repeated, extra rows are added, unlke vlookups in excel

    If you have two of the same on two datasets, 4 rows will be generated. The more repeats the more rows, exponentially.

  • Check your join column for empty strings. In the case where you have empty strings in both left and right datsets, each row in the left will join with each row on the right, multiplying your row counts.

Sign In or Register to comment.

Hey, Stranger!

It looks like you're new here. Those who sign in get access to engage with even MORE fire content. To get involved, click one of these buttons!