Joining Datasets with different numbers of rows

We’re looking to analyze staff utilization data. There are two data sets:


Data Set 1: Raw data from a timekeeping system with daily time entries from each staff member, there are multiple rows per person per month.


Data Set 2: The total available hours per person per month, there is one row per person per month.

 

We’re looking to calculate utilization per month per person by summing all of their hours entries for a specific month from Data Set 1, and dividing by their corresponding available hours for that month in Data Set 2.


When we joined the two data sets, using their unique ID# and the identifier, the output includes the monthly available hours value for every row in which there is a raw data entry. We need to create a Beast Mode that is the sum of all raw data hours per month divided by just one instance of the available hours per month. 

 

Any ideas? 

Comments

  • Great question! As long as the card is reporting by month, the following will work:

     

    SUM(`Hours_Field_From_Dataset_1`) / AVG(`Available_Hours_From_Dataset_2`)

     

    You could also use MAX() or MIN() instead of average since there's just one value per rep per month. Should you wish to report by quarter or year, you will need to create quarterly and yearly availability fields in Dataset 1 before joining them and create beast modes for quarterly and yearly utilization.

     

     

  • I think I would approach this by bringing both datasets to the same date grain before the join. In Data Set 1 SUM up all the the hours in the month by employee so that you have just one row per employee per month. This would be my prefered method.

     

    Alternatively, you might try something like this:

     

    SUM(`Hours Logged by Employee`) / MAX(`Available Hours by Employee`)

     

    As long as the Available Hours is never equal to zero this should work.

  • Thanks!

     

    What if the number of rows varies? IOW, a higher Dataset 2 value included five times will weight the average differently than a lower Dataset 2 value included five times. 

     

    We tried the Beast Mode and the numbers don't look quite right yet. 

     

  • In that case, unless there is a reason you can't do it, then I would second Kurt's recommendation that you get both datasets on the same level of granularity.

     

    My answer assumes that dataset 2 has data that looks like this:

     

    Name           Date          Available_Hours  

    John Doe       01/01/2018    140

    Jane Roe       01/01/2018    119

    John Doe       02/01/2018    115

    Jane Roe       02/01/2018    137

     

    As long as the card is displaying by month and you are breaking it out by individual user, then that formula will work.

     

    Another option would be to fix the data after the join. You would need to add row numbers and partition it by person by month. You would then update the data so that their available hours got set to 0 where the row number was greater than 1. Then you could use SUM()/SUM().

  • Where would we update the Available Hours to 0 in the dataflow? Not sure which function to use.