ETL setup - duplication breaking calculations

Hi,

I'm making a dataset where the concept is to take staff's login and logout times and figure their logged in hours per day. Then, we want to know certain measures based on their hours logged in.

So, in ETL, I've gotten their total logged in hours per day to one row, grouped by the date and the staff member. Then, group the other measures (calls per day, appointments scheduled per day, etc..) by date and staff member and join on staff and day = staff and day.

This was all working great until they want to know count of specific "call outcome codes" per hour. Shown in my excel attachment, you can see how bringing in the "call outcome codes" messes up my 1 row per staff member, per day, creating multiple rows due to several different outcome code possibilities.

I don't want to pivot the outcome codes, because their are a lot of them and I think this might also make it difficult for card builders.

I have tried using partitions in beastmodes along the lines of (outcome code)/ (max hours partitioned by user, day), but partition beastmodes always give me issues.


I did have one thought, but I wanted to get others' input before I change my ETL.

Since hours per day will always be my denominator, I was wondering if I could just put a row count in there in the etl

for the hours. Then, in my beastmode just say (sum of outcome code when outcome code = x) / (sum of hours when row count = '1')


Best Answer

  • MarkSnodgrass
    MarkSnodgrass Portland, Oregon 🟤
    Accepted Answer

    You might try appending your call outcome codes rather than joining. When appending, choose to include all columns. This should keep your data from duplicating, but still allow you to filter in various ways.

Answers

  • GrantSmith
    GrantSmith Indiana 🔴

    Hi @Jbrorby

    You may be able to get around it by utilizing a MIN or MAX function on your appointments and possibly might need a window function.


    MIN(MIN(`Logged in hours per day`)) OVER (PARTITION BY `Staff` , `Login Date`) / MIN(MIN(`Logged in hours per day`)) OVER (PARTITION BY `Staff` , `Login Date`)
    


  • Jbrorby
    Jbrorby ⚪️

    Thanks guys, I'll try it out

  • ST_-Superman-_
    edited July 23

    @Jbrorby - My recommendation would be to stack your data. If you could get your data to look something like this:

    You could then use something like this for outcome - appointments per hour:

    sum(case when `METRIC`='Outcome - Appointment' then `VALUE` else 0 end) / sum(case when `METRIC`='Logged Hours per day' then `VALUE` else 0 end)
    

    If you include Staff and Date in your visualization, that would give you the ratio you are looking for.


  • jaeW_at_Onyx
    jaeW_at_Onyx Budapest / Portland, OR 🟤

    In most of my consulting engagements, a hybrid of @MarkSnodgrass and @ST_-Superman-_ 's approach works best.

    Scott is describing the output of Enterprise Stacker, a tool built by Domo that UNIONs all the data and the aggregates it down to a Value's column as in Scott's screenshot. Very clean, but when you pre-aggregate data you lose visibility on individual transactions when you need drill through capability.

    I prefer Mark's approach to just APPEND the data using a UNION statement with conformed column names. You keep Drill Through, and if you add a Constant for each Data Source or Metric Type, you can recycle the Beast Modes approach Scott recommends

    sum(case when Metric = '...' then <this column> end )
    / 
    sum( case when Metric = '...' then <that column> end>
    
  • Jbrorby
    Jbrorby ⚪️
    edited July 27

    Thanks everyone for the input.

    In the end I decided to go with the append Mark suggested because, like jaeW says, I like the visibility, especially for future trouble shooting.

    I ended up un-aggregating all of the other tables as well, with the exception of login hours per day, and used "staff" and "date" as a common column for the append. I have not finishes validating it all yet, but so far so good

    So as of now it is looking similar to the attachment. And so i'm able to say count(distinct case when donation_id '1' then donation_id end) / sum(login hours)


    And when I get to building beastmodes for call outcomes, I think I'll be able to do

    count(distinct case when outcome_code is not null then outcome_id end)


    and then they can filter out outcome codes and leave the null in for the hours, i'm hoping


  • Jbrorby
    Jbrorby ⚪️

    Good morning


    So now I'm having an issue with one of my beastmodes that I cannot quite figure out, and looking at the data table (SS2) seems like it should be working.

    The issue comes into play when I add a series.

    The concept of the beastmode is taking the distinct count of HC_donor_call_outcomes.id and dividing it by the sum of HC_staff_user_logins.time_logged_in_hours

    This beastmode works just fine with no series (SS1). But when I add the series, HC_donor_call_outcome_codes.description, the chart goals blank. (SS3)

    HC_donor_call_outcome_codes.descrption is basically just a categorization of HC_donor_call_outcomes.id (the numerator), so they're linked together

    In SS2, you can see the furthest right column (what I'm trying to use for the series) is linked to the numberator (third column) and that all columns are linked to the date 1/4/2021 and staff user id 491.


    So I don't why it goes blank when I put the description column in the series

    beastmode:

    (COUNT(DISTINCT case when `HC_donor_call_outcome_codes.code` is not null

        then `HC_donor_call_outcomes.id` end))

     

        /

      ( SUM(`HC_staff_user_logins.time_logged_in_hours`))


      

  • jaeW_at_Onyx
    jaeW_at_Onyx Budapest / Portland, OR 🟤

    use a table card.

    separate your numerator and denominatorr into separate beast modes, then add the columns to your axis. I'm betting you don't have Description in the rows that constitute the Denominator, which drives your denominator to 0 and N/0 is undefined, which domo would represent as a null or blank.

  • Jbrorby
    Jbrorby ⚪️

    Okay that makes sense. it works when I don't use the series because I'm not excluding the denominator from the single whole, but once I put them into buckets, the denominator becomes separated from the numerator, or something like that, i think?


  • jaeW_at_Onyx
    jaeW_at_Onyx Budapest / Portland, OR 🟤

    if the numerator and denominator are not in the same row then yeah... you can't / shouldn't use that column (description). Because each log in hasn't been attributed to a description.

  • Jbrorby
    Jbrorby ⚪️
    edited July 29

    Okay, I ended up finding a way to make it work. It might be similar to what jaeW was explaining to me in his first comment but I just wasn't quite grasping the whole thing

    In ETL, I took a branch off and grouped by Date, Staff, Call Hours. Then another branch off of Date, Staff, Description. Then I joined these two together on a concat of staff and date.

    Then, I took a third branch off and grouped by date, staff, and description with an aggregation of the distinct count of outcome.id's per description.

    Then I joined that to the first part on concats of date, staff, description.

    Then, I just appended this to the rest with the common columns of staff and date

    Now, I can say, when description equals description, then sum my aggregated column divided by hours.

    This is maybe not the most efficient way to go about it, but its working and I'm happy. This was difficult for me.

    I just wanted to follow up for the sake of sharing information to the dojo community


    Thanks again guys

  • GrantSmith
    GrantSmith Indiana 🔴

    @Jbrorby - Thanks for following up and your contribution to the Dojo community!

  • jaeW_at_Onyx
    jaeW_at_Onyx Budapest / Portland, OR 🟤

    @Jbrorby make sure to check your results on edge cases with numbers that you know the answer to. You're doing alot of JOINs and GROUP BYs... sounds pretty dangerous because you might be allocating rows to unrelated data or double counting in some cases.