Error joining data. The left input cannot include over 10,000 duplicates.

I have a magic ETL throwing the following error message:

 

"Error joining data. The left input cannot include over 10,000 duplicates. Please switch your inputs, group your data, or remove duplicates before joining."

 

There are 3 joins in the ETL. Each joins a reference table (5-100 rows) with a larger dataset (~50K rows). My reading is Domo doesn't like reference tables. Since they're so common, though, I assume others have run into this.

 

Can someone anyone help? 

 

 

 

Best Answer

  • AS
    AS 🔵
    Accepted Answer

    I don't know what the underlying problem is but a quick fix is to reverse the inputs and the direction of the join.  Essentially the same thing but backwards.

Answers

  • Are you joining on a unique identifier?  could you provide some more information about the data involved?

  • Hi Scott, 

     

    As an example, say I want to join table1 and table2 below. Table1 has customer data from an event and table 2 has data about that event along with a unique event ID. Table 2 is what I referred to as the "reference table".

     

    So I want to join table1 with table2 on the event_id column to connect event and customer data. Domo says I can't do a join with 10k+ duplicates, leading me to suspect Domo looks at the results of the join and doesn't like that a bunch of copies of -- in the example below -- martypalooza are brought in. I thought it odd because if it's a feature of magic ETL then it makes one-to-many relationships a pain. 

     

    table1:

     

    firstName | lastName | event _id |

    --------------|--------------|--------------|

    Jim           |     Jones   |     abc1    |

    ------------- |---------------|-------------|

    Scott         | Thompson|     abc1   |

    --------------|----------------|-------------|

    .

    .

    .

    --------------|------------------|-------------|

    Omega     |    Johnson   |    abc1    |

     

    table2:

     

    event_id   |    event_venue    |     event_location | ..... | event_name

    --------------|--------------------------|----------------------|.......| ------------------

    abc1         | Convention Center|    Chicago          |......|   Martypalooza

     

     

     

     

     

     

     

     

     

     

     

     

     

     

     

  • Thanks, Aaron. Just got of the phone w/ Domo and that's their recommended solution as well. 

     

    I'm curious though since this type of join/relationship is a pretty common -- at least in my experience -- and a fundamental component of relational design. It just seems strange Domo would make it illegal. 

     

    Anyway at least now there's a documented solution in da Dojo.

     

    thanks

  • The only thought that I have is that ETL just doesn't like joining data when the left table has more than 10,000 duplicate "keys" in it.  It's an interesting question as to why this is an issue, but I would just keep your larger lists on the right side of the joins.  

  • Hello All, 

    I wanted to provide a little insight into why Domo will give you an error when your Left table contains many duplicates. 

    ETL is a very easy-to-use tool and makes doing joins very simple. We found that many people were making large cartesian joins because of how quick and easy it was to use. 

    Because a left join is the most common we included this error message to help people realize that they might be creating a large cartesian join. This way you do not duplicate your data on accident. 

    As mentioned in the thread the fix for this is to swap the order of your inputs and change the join from a Left to a Right. 

    Cheers!

  • That's kind of hilarious!  It sort of makes it "idiot resistant"!  If you fully understand the logic, you will push until you get a solution because it doesn't make sense to not be able to use reference tables against large datasets.  If you shouldn't be playing with such things, someone somewhere will look at your stuff and make sure it's ok, or you'll give up.  

     

    It's caught me a couple times, but now that I know the logic...never again!  lol

  • I have a couple questions about this.

    1) Is the threshold different from the message? I have a dataflow that seemed to work fine yesterday, but broke today. I think the 10,000 duplicates must have been breached a long time ago... unless

    2) What does the 10,000 duplicates refer to? Is it if the left side is joining on a column with 10,000 unique values each with at least one duplicated row or is it when 1 particular value has 10,000 duplicates or is it when the total number of rows that aren't unique reaches 10,000?

    3) The dataflow is automatically disabled after throwing the 10,000 duplicates error. If re-enabled and run, it seems to run for a long time (about 20 minutes) and then throws an error (Error joining data. The left input cannot include over 10,000 duplicates.) and automatically restarts. Is this expected behavior?

  • This error occurs when the Left data set of your join has over 10,000 duplicates.  If you are joining sales data to customer data for example.  The assumption would be that you would join on a customer id field.  If you place your sales data on the left, your customer data on the right and then join on customer id; then the join will error out if you have a single customer id with more than 10,000 rows of sales data.  The fix, as mentioned above, would be to put your customer data set on the left and the sales data on the right.  Then, when you join on customer id your left data set would not have duplicates.