Why is there a large performance hit for mixing columns from a join statement?

Dr_Duru
Dr_Duru ⚪️

At the risk of my issue being very data specific, I want to ask for clues as to why I am seeing a large performance hit from what looks like a simple change in the SQL in a data flow.

I started with the following and got acceptable performance of 15 minutes or so:

select A.* from A inner join B on A.field1 = B.field2 and A.field3 = B.field4

However, when I added a column from the table on the right side of the join...

select A.*, B.field5 from A inner join B on A.field1 = B.field2 and A.field3 = B.field4

Suddenly, my performance soared to 14 hours.

What can explain this kind of performance hit? Is there a work-around? I tried a left join instead. I tried playing around with the indexing. Nothing worked to bring the performance back in line with the original SQL.

Comments

  • GrantSmith
    GrantSmith Indiana 🥷

    @Dr_Duru

    I'm assuming you're using a MySQL dataflow here.

    What are you data types for field 3 and 4? String comparison is typically slower than numerical comparison.

    Also have you added indexes on field3 and field 4?

    I'd also recommend trying to utilize Magic ETL 2.0 as that is typically much more performant than MySQL dataflows.

    **Was this post helpful? Click Agree or Like below**
    **Did this solve your problem? Accept it as a solution!**
  • Dr_Duru
    Dr_Duru ⚪️

    I've tried everything and nothing worked except to change my data flow so the field I wanted to add came through in the left-hand side of the join. My query now runs in 19 minutes.

    I even tried adding an index on the field5 (which is a string).

    The join is on a date field and an integer field.

    The data flows are too complex for use in Magic ETL 2.0.

    {Note I had a discussion with my customer solutions rep last week about this, and the issue left him just as perplexed as I am. I also was able to get performance down to about 3 hours before I gave up and used a different flow }

  • is the relationship a 1:1 relationship? how many rows of data do you expect to match? If it's a 1 to many relationship, you could be exponentially growing your output dataset which would cause all subsequent steps in the process to take longer as well


    ______________________________________________________________________________________________
    “There is a superhero in all of us, we just need the courage to put on the cape.” -Superman
    ______________________________________________________________________________________________
  • jaeW_at_Onyx
    jaeW_at_Onyx Budapest / Portland, OR 🔴

    indexing IS very dataset specific. and you're using MySQL 5.6 so all the updates in the platform in the last ... 10? years are not in place. so yeah... a DBA is your best bet if you insist on optimizing your query performance by hand.

    PS in MySQL dataflows you can use the 'explain sql' button (it's a drop down under "run SQL" ) to show the execution plan which will give you a hint as to whether your dataflow is even taking advantage of the index.

    also keep in mind each transform will CREATE TABLE so you have to index your intermediate steps if you're going that route.



    i have implemented dataflows in Domo for over 100 customers ... i can count on one hand the number of design patterns that absolutely cannot be done in Magic... with the advent of FORMULA tiles and integration of SQL-esque functionality in different parts of the Magic platform... most jobs can be shifted into Magic 2.0 and perform pretty quickly.

    alternatively, divide the task. do the JOIN in a dataset view and the row level transforms in Magic.

    Jae Wilson
    Check out my 🎥 Domo Training YouTube Channel 👨‍💻

    **Say "Thanks" by clicking the ❤️ in the post that helped you.
    **Please mark the post that solves your problem by clicking on "Accept as Solution"