Combining data sets without duplication

I have 2 datasets:

  1. Contains individual transactions with an employee's name
     Employee Name Sale PriceSale Date
    John Doe100 11/2/20
    John Doe250 11/6/20
    Jane Doe50011/8/20
    Jane Doe10011/12/20
  2. Contains each employee and their number of consultations and new clients
    Employee NameStart DateEnd DateConsultationsNew Clients
    John Doe11/1/2011/7/2031
    Jane Doe11/1/2011/7/2022
    John Doe11/8/2011/14/2043
    Jane Doe11/8/2011/14/2054

I need to be able to filter both data sets such that dataset a) contains a single row for each employee with their total sales within a given date range and dataset b) contains a single row for each employee with their total consultations and new clients.  Once I have these datasets, I will need to join them on the employee's name to create a single dataset containing total sales, consultations and new clients.

 

Employee NameConsultationsClientsSale Price
John Doe74350
Jane Doe76600

 

I was able to accomplish this, but I run into an issue where I have to do the filtering within the ETL, which forces me to edit the ETL each time I want to query a new date range.  If I try to do the filtering in a card and remove the totals, I end up with duplicated data on that final table join.

 

Employee NameStart DateEnd DateConsultationsNew ClientsSale DateSale Price
John Doe11/1/2011/7/203111/2/20100
John Doe11/1/2011/7/203111/6/20250
Jane Doe11/1/2011/7/202211/8/20500
Jane Doe11/1/2011/7/202211/12/20100
John Doe11/8/2011/14/204311/2/20100
John Doe11/8/2011/14/204311/6/20250
Jane Doe11/8/2011/14/205411/8/20500
Jane Doe11/8/2011/14/205411/12/20100

 

Is there some way to make this work such that I can query for a date range after running the ETL and my data isn't duplicated?  

Comments

  • GrantSmith
    GrantSmith Indiana 🔴

    Hi @user060355 

     

    Your data has two different time slice units (daily - sales and multiple days-consultations) which makes filtering your data difficult. What I'd recommend you do is reformat your data so that your sales and consultations are both sliced by day. This would allow you to easily join and slice your data on any date range and utilize the date slicers on the cards.

  • At a high-level @GrantSmith is right -- mixing date dimensions is always tricky (in this case you have single day vs. multi-day range). The best thing to do is normalize the dates as much as possible.

    In your case I have a slightly different suggestion that sounds like it may solve your need. Can you try joining the "transaction/sale" table with the "consultant/info" table where the `Sale Date` is between the `Start` and `End` dates?

    Something like:

    SELECT
         a."Employee Name"
         , a."Start Date"
         , a."End Date"
         , a."Consultations"
         , a."New Clients"  
         , b."Sale Price Total"
         , b."Sale Units Total"
    FROM 
         "Consultation Table" a
         LEFT JOIN
              (SELECT 
                    "Employee Name"
                    , "Sale Date"
                    , SUM("Sale Price")   as  "Sale Price Total"
                    , COUNT("Sale Price") as "Sale Units Total"
                 FROM 
                     "Sale Table"
                 GROUP BY
                     "Employee Name", "Sale Date"
                 ) b ON a."Employee Name" = b."Employee Name" AND b."Sale Date" BETWEEN a."Start Date" and b."End Date"

    You would some granular sale data (i.e. multiple sales on one day) but it seems like it would get you the sales quantity and totals per day which seems like it's what you're after. If this methodology doesn't work then you have to go down Grant's path of normalizing date range. My suggestion would be doing something like EXTRACT(WEEK FROM "Sale Date") so you can line it up with your "Consultant Date(s)".
         

  • jaeW_at_Onyx
    jaeW_at_Onyx Budapest / Portland, OR 🟤

    https://www.youtube.com/watch?v=PVbOeLSae9o&t=748s

     

    don't JOIN your data in ETL in any fashion.

    Instead, STACK the data using APPEND as demonstrated in this tutorial.  If you JOIN in ETL and aggregate you limit your ability to filter.  If you go the opposite direction and duplicate rows, you have to  do 'crazy math' to deduplicate your daily activity.

     

    if you absolutely must consolidate down to a single customer table ... although i can't imagine why you'd need that since you can easily GROUP BY Customer in analyzer, then extend the customer dimension as a very last step based on the STACKED dataset.