"Column referenced but not found" Python error creating date-sensitive columns in ETL

Objective:

To take an existing dataset that shows event start and end dates, and a total value and parse that info into a daily value that appears in columns representing all dates between start and end.

Method:

Using Domo Dimensions Calendar, select only rows from current date to +2 years.

Use the result in a Python script that appends a new column to an existing table, one per calendar row. i.e. todays script will add columns '2022-12-21' to '2024-12-21' to the data table.

Problem:

However, the script regularly fails with "Column referenced but not found: 2022-12-09" (This is today's example) when writing the output dataframe.

I used to think that this was maybe a timezone problem (I am n GMT) and that the referenced column represented yesterdays date in my timezone, but today on the servers timezone.

However, today's error referenced a date 12 days ago!

This problem is not consistent though and tends not to occur in the afternoon GMT. Again, hinting at a timezone problem. I am now not so sure and need expert guidance as I am a Python noob


# Import the domomagic package into the script 
from domomagic import *
import pandas as pd
import numpy as np
from datetime import *

# read data from inputs into a data frame
cal = read_dataframe('CAL.WHERE')
eventdf = read_dataframe('EVENTS.APPEND')

# for each value in the dt column of the calendar
# create an array of length = number of calendar columns
cols = [None] * cal.shape[0]
# create an array of default values for each new column to be added to the dataframe
vals = [0] * cal.shape[0]
rng = range(cal.shape[0])
ind = [str(x) for x in rng]
today = date.today()

for i, row in cal.iterrows():
 dt = row['dt'].date()  
 cols[i]= str(dt)

# now we need to turn our cols (columns) and vals (rows) into a dictionary
# so that we can then in turn convert that into a new dataframe
zip_iterator = zip(cols, vals)
dictionary = dict(zip_iterator)

# create a new empty dataframe from the dictionary
new_df = pd.DataFrame(dictionary, index=ind)

# create a new dataframe from our two input dataframes: df and new_df
output_df = pd.concat([eventdf, new_df], axis=1, ignore_index=False)

# iterate through each row in the dataframe to populate date columns between
# start and end dates with the daily value
rowcount = 0
skip = 0
for i, row in output_df.iterrows():
 eventid = row["ID"]
 if type(eventid ) != "<class 'str'>":
  eventid = str(eventid)
  
 start_dt = row["Start Date"].date()
 end_dt = row["End Date"].date()
 # if either of the date values are NaT, then skip this row
 if pd.isnull(start_dt):
  print('Row ' + str(rowcount) + '. ' + eventid + " skipping as start date is null")
  skip = skip + 1
  continue
 if pd.isnull(end_dt):
  print('Row ' + str(rowcount) + '. ' + eventid + " skipping as end date is null")
  skip = skip + 1
  continue   
 # we are only interested in future values, so ignore any dates prior to today
 if start_dt < today:
  print(eventid + " begins before today. Setting start date to today")
  start_dt = today
   
 # if the event ends before today, skip it
 if end_dt < today:
  print('Row ' + str(rowcount) + '. ' + eventid + " skipping as end date before today")
  skip = skip + 1
  continue
  
 effort = row["Story Daily Effort"]  
 dates = pd.date_range(start=start_dt, end=end_dt)
 #print(story_id + ': ' + str(start_dt) + ' - ' + str(end_dt))
 col_nm = str(start_dt)
 j = output_df.columns.get_loc(col_nm)
 output_df.iloc[i, j:j+len(dates)] = effort
 rowcount = rowcount + 1 
print("Skipped " + str(skip))
print("Successfully processed " + str(rowcount))
# write a data frame so it's available to the next action
write_dataframe(output_df)


Best Answer

  • har_d_har1
    har_d_har1 ⚪️
    Answer ✓

    To workaround this issue and allow by ETL to complete, I have created a separate ETL dataflow just to produce a table of columns from the Domo Dimensions Calendar, which I then use as an Input DataSet to the above code, essentially removing rows 11-30 above.