Getting a Sum of Values for X Days When There are Gaps in the Dates

RobB
RobB ⚪️
edited September 28 in Dataflows

We have our dataset of transactions for a large number of SKUs. Dataset is over 1 Million rows. We want to examine the date of each transaction and go back 30 days to get a 30 day average of sales for each SKU based on the date of that record. Normally I would be thinking Row and Window tile in Magic ETL. But here's the catch. Not every SKU sells every day and we don't have records for a SKU if there were no transactions on that day. So if we use Row/Window with a SUM for 30 leading rows, we will go beyond the 30 days.

The table below shows an example. For each transaction of each SKU we want to go back to the previous transactions for the past 30 days. But even if we go just 7 days back you can see the problem. Look at last transaction of each SKU. If we go up 7 rows, we will get a history of more than 7 days because there are missing days; days when no transactions for a SKU occurred. This means that using a SUM with leading row in Row/Window will not work because we exceed the time frame needed.

Row/Window functions aggregate without any logic capability. There's no Case statement capability in a Row/Window aggregate. So I'm down to a SQL cursor type operation or a LOOP where for each SKU and Date in a given record, find the transactions for that SKU where the dates of those transactions are 30 days before the date of the record currently being examined.

Is this even possible in Magic ETL? Or am I looking at using R or Python scripting?

Tagged:

Best Answers

  • GrantSmith
    GrantSmith Indiana 🔴
    Accepted Answer

    Hi @RobB

    Are you wanting 30 days prior to the current date or from the date of the row?

    You might be able to filter your data (using a formula tile and ETL2.0 with a case statement to check if it's less than 30 days ago - alternatively you can calculate 30 days ago from the current date using a date time and then filter your dataset where your Date field is greater than the calculated 30 days ago date in ETL 1.0), group your data to get the total sales and then divide the total sales by 30 to get your 30 day average.

    The other option is to to a cartesian join of all your dates and SKUs so you have one record for each SKU & date, then left join your dataset to this dataset to fill in the missing dates, use the window function to get the last 30 days. This will drastically increase the number of records you need to process and will affect execution time but should handle your issue if you're looking for a rolling 30 day average.

  • MarkSnodgrass
    MarkSnodgrass Portland, Oregon 🟤
    Accepted Answer

    I would suggest using the Domo Dimensions calendar dataset to create a dataset that has all the dates for a given time period. You can then left join this to your sku dataset and replace nulls with zero. Then use your rank and window tile to go back x days.

    If you haven't used the Domo Dimensions connector, you can find it by going to Connectors in the Data Center and then search for Dimensions and choose Calendar.

    Hope this helps.

  • MarkSnodgrass
    MarkSnodgrass Portland, Oregon 🟤
    Accepted Answer

    @RobB If I am understanding you correctly, for a given sku and a given date, you want the sales amount for that same sku 30 days prior, even if there is no record for that sku 30 days prior. You can do this as I described earlier and this is what it would look like in Magic ETL. I will try and describe it below.

    Use the Domo Calendar from the Domo Dimensions dataset. I chose to filter it down to a range that was in my test data, but you don't necessarily have to.

    Get a distinct list of your skus by using the select columns and remove duplicates tiles so that you just have 1 column that lists your different skus.

    Use the Add Constants tiles to a constant value of 1 to both datasets.

    Join your calendar dimensions and your sku list and join on your constants value. This will give you a list of every date on the calendar for each of your skus

    Use a Left Join to join this data with your main sales data and join on date and sku.

    Use the formula tile to replace nulls in the sales column with 0 so your amount column does not have any nulls

    Use the Rank & Window tile to use the Lag function use an offset of 30 from the Sales Amount column and sort by the date column from the dimensions table ascending. Also, partition by sku.

    This should give you the result you are looking for.

Answers

  • @GrantSmith I looking for 30 days from the date of the row.

  • GrantSmith
    GrantSmith Indiana 🔴
    edited September 28

    Then I'd go the second option where you'd need to create a dataset with all of the possible SKU / Date combinations.

    A starting point as @MarkSnodgrass mentioned is the Date dataset from the Domo Dimensions connector, filter out any future dates to speed up processing, add a constant of 1 can call it 'Join Column'.

    Get a unique list of your SKUs and add a constant of 1 can call it 'Join Column'. Join these two together on the Join Column.

    Take that resulting dataset and left join it to your dataset (yours is on the right side in the join tile) and then use the rank and window tile to calculate the 30 day rolling sum, then use the formula tile or calcuation tile to divide that by 30 to get your average.


    PS. @MarkSnodgrass I beat you :)

  • jaeW_at_Onyx
    jaeW_at_Onyx Budapest / Portland, OR 🟤

    @GrantSmith , but @MarkSnodgrass solution is cleaner.


    while buiding the universe of SKUs and Dates is an option, all that facilitates is building an (expensive) window function.


    a date dimension with a lag_from_current_date (0, 1,2,3,4,5) would be a much cleaner way of reporting on 30 day rolling average.


    even with an offset calendar (rolling average today, rolling average 7 days ago, rolling average 2 weeks ag) you'd have a clean JOIN with a Date Dim and no window function.

  • @jaeW_at_Onyx I fear I may be misunderstanding this. My lag isn't from the current date. My lag needs to use the date on each record as its fixed point. For each SKU, for each date, on each transaction, I need to go back 30 days from that point to get my numbers. To me, that means cartesian, as ugly as that will be. A left join on dates will return all the dates from the dimension, but will not create a record for each date and for each SKU which I will need to traverse the history.

    I may be missing your take on this and if so, I'd love to know what it is I don't know your suggested method.

  • @MarkSnodgrass ,

    Thank you for the corroboration. This was my expected direction after bringing the dimensions online. I think that @GrantSmith was also going this way. It's a lot of records but to get a 30 day history record by record, it's what we have to do. Thanks to @jaeW_at_Onyx as well, the conversation was very helpful. I'll put in in play and let you know how it goes.