• jaeW_at_Onyx
    jaeW_at_Onyx Budapest / Portland, OR 🟤

    what is the shape of your data? what is the granularity? (one row per judge per year? one row per judge when they were made 'active?')


    if it's one row per judge, then just do count(distinct judge_id) as a beast mode, and then Active Date on the axis.


    if it's one row per year, then you want to filter where year(active_date) = year( transaction_date) and then do a count distinct.

  • jaeW_at_Onyx
    jaeW_at_Onyx Budapest / Portland, OR 🟤

    my question about how the data is structured is important to the solution to this problem.  what is the granularity of the dataset.


    a window function would get you halfway there


    sum(sum(amount) over (order by year(date)) --


    IF the granularity of the dataset was one row per judge.

    but if the granularity of the row is one row per judge per year, or one row per judge per activity, you cannot use a window function because you cannot count distinct across partitions the way you'd like to.


    sum(count(distinct judgeID)) over (order by year(date))

    would return the wrong result b/c you'd count distinct by year and then sum, which would count judges that appeared in multiple years more than once.


    and you can't use any other aggregation type count(count(distinct)) or any permutation of that b/c when you've aggregated to the year level, you lose context of the individual judges and you have an aggregate number like 5, 400 or 30.


    IF you want to retain the granularity of one row per judge per year, then reshape your data such that you have one row per judge per year and measurement of 1 or 0 to indicate if they were active.  then you can just take a SUM across the board.




  • i solved this myself. Thanks