Removing duplicate from group by strings separated by comma

Using the ETL tool, I've got a column that is ALMOST exactly what I need.  I'm trying to identify the sales type mix of a customer's basket with the number of units in the basket, see image for example.

 

What I'd like to do, though, is remove the duplicates from that column.  For instance, in the row that has '2' units with the Mix column showing 'Regular Price, Regular Price', I'd like to only see Regular price.

 

In instances where there are mixed sales types, it's fine to keep each.  I just want to remove duplicates from within that column.

 

Anybody have any ideas?

 

Thanks!

Best Answer

  • AS
    AS 🔵
    Accepted Answer

    I'd try this first since you want to sum price and quantity by the ordered, grouped pricecatweb string:

    SELECT
    `date`
    ,`ordernumber`
    ,SUM(`price`) as 'Total Retail'
    ,SUM(`qtyordered`) as 'Total Units'
    ,GROUP_CONCAT(DISTINCT `pricecatweb` ORDER BY `pricecatweb`) as 'Mix'
    FROM
    `df_int_sql_prep_sale_type`
    GROUP BY
    `date`
    ,`ordernumber`
    ,GROUP_CONCAT(DISTINCT `pricecatweb` ORDER BY `pricecatweb`)

      

Answers

  • Is the Mix column an output of your ETL, using the GROUP BY tile and the "Combine strings separated by ," operator?

    I don't think it's available in ETL, but SQL has a function called GROUP_CONCAT, which is an aggregator function like a SUM or AVG, and just combines string values into a single string, like that ETL operator.  The special thing about GROUP_CONCAT is that you can add a DISTINCT modifier that will filter out the duplicates like you want to.

    Your Mix column values would turn out more like "Regular Price, Sale Price" instead of "Regular Price, Regular Price, Sale Price, Regular Price, Regular Price,"  You can insert ORDER BY modifiers and even change the delimiter if you wanted.

    If you're stuck, due to skillsets available, to using ETL, I'd recommend logging an enhancement request to add a "combine DISTINCT strings" option in ETL.

     

    For more information on GROUP_CONCAT, see:
    https://www.w3resource.com/mysql/aggregate-functions-and-grouping/aggregate-functions-and-grouping-group_concat.php

  • Hi Aaron,

     

    Yes, that's correct.  I used the 'Combine Strings by ,' operator.

     

    Would I need to apply this SQL statement to my original input dataset instead of the one in ETL?

     

    I tried running a SQL transform on the ETL output, but ran into an error (my SQL is pretty limited).

     

    Thanks for your help!  I feel like this is 98% to a solution ?

  • I don't have a lot of the details, but...

    Are you using the ETL to only give you this extra column and is the input to your ETL an output of a SQL dataflow already?

    If that's the case, forget using the ETL and inject this bit into your SQL.  It will have to operate on the original column, not the already aggregated string.

  • No, I'm using an ETL to join together several datasets to reference off an order-line level dataset (very granular).  After joining those together, I use the group by function in ETL for the actions which I've attached.

  • OK, then you have a choice, you can join your datasets with ETL, and do the groupings with the GROUP_CONCAT in a secondary dataflow, or just merge it all under one SQL dataflow.  

     

    As for the syntax to bring your strings together, it would be something like this:

     

     

    SELECT
    date
    ,ordernumber
    ,SUM(`price`) as 'Total Retail'
    ,SUM(`qtyordered`) as 'Total Units'
    ,GROUP_CONCAT(DISTINCT `pricecatweb`) as 'Mix'
    FROM
    [name of ETL output dataset if you choose to go that way, or your entire SQL join conditions from your ETL inputs]
    GROUP BY
    date
    ,ordernumber

     

     

  • Yes!  Perfect!  Thank you so much for your help.


    Edit:  Sorry, was a little hasty in my response.  This is close, but now I have another issue.  It's treating the order differently; for instance: "Regular price, sale price" as different than "Sale price, regular price."

     

    Is there a way to account for this?

  • Yes, with the GROUP_CONCAT function you can insert an ORDER BY clause.

     

    SELECT...

    FROM...

    WHERE...

    GROUP BY...

    ORDER BY GROUP_CONCAT(DISTINCT...)

  • Will using that ORDER BY statement organize the contents of each value, or just organize the dataset itself?

  • That placement will do it by the dataset but you can also use ORDER BY inside the GROUP_CONCAT function to order the list itself.

  • If I do the ORDER BY before the GROUP_CONCAT, will it first organize the values then remove the duplicates?  I'm just trying to eliminate the possible combinations, as when this dataset ran it came back with some 10k+ unique values due to all of the possible ways items were purchased.

  • That would also work to have  separate query step to sort the data by the independent mix values before the aggregation.

    I think in your query you'd have something like

    ,GROUP_CONCAT(DISTINCT `pricecatweb` ORDER BY `pricecatweb`) as 'Mix'

     And the same thing in your GROUP BY at the end.

  • So the total query would look like this?

     

    SELECT
    `date`
    ,`ordernumber`
    ,SUM(`price`) as 'Total Retail'
    ,SUM(`qtyordered`) as 'Total Units'
    ,GROUP_CONCAT(DISTINCT `pricecatweb` ORDER BY `pricecatweb`) as 'Mix'
    FROM
    `df_int_sql_prep_sale_type`
    GROUP BY
    `date`
    ,`ordernumber`

    Thanks so much for your help by the way - I really appreciate it.

  • Thanks Aaron, this worked perfectly.  I didn't need the last statement in there (it actually threw an error), but it worked well enough without it.

     

    Thanks again!