R and Python for determing dataframe equality in ETL

willrnt
willrnt ⚪️

Greetings Domo Community,

I have currently been refactoring Magic ETL tiles and have been attempting to use Python and R scripts to test for equality before finalizing the refactored tile components. Using both approaches with Python and R I have had significant difficulties in completing this task, which I will outline here.


Python:

I began by using domomagic to load both datasets into the Python tile. Upon doing so, I used Pandas as a tool for a comparison. Then, I sorted all of the rows by their corresponding column values. I print the shapes to ensure they are the same - in each case they were the same - and then I print the column names and test their equality. Here comes the issue, I do a join to test what exists only in the left and what exists only in the right dataframes. I additionally make an assert_frame_equal call from the Pandas library and I test the equality. I consistently find that the left and right only dataframes are exactly equal upon inspection, the assert_frame_equals never triggers (And I have seen it before) and the .equals() returns false - completely contradicting each other. I have gone through every debugging technique I can think of and have not made any progress. It is worth noting, that even though the rows are sorted and the column is reset before testing equality, the left and right suggest different indexes when printed. If anyone has any suggestions or insights about the internals of Python in an ETL, I would be grateful for your feedback. The Python code is attached below. Please continue reading for Ruby issue.

# Import the domomagic package into the script

from domomagic import *

import pandas as pd

from pandas.testing import assert_frame_equal

import numpy as np


pd.set_option('display.max_columns', None)


# read data from inputs into a data frame

input1 = read_dataframe('Total Formulas')

input2 = read_dataframe('Address Nulls from SUM joins 4')


input1 = input1.sort_values(by = input1.columns.tolist()).reset_index(drop=True)

input2 = input2.sort_values(by = input2.columns.tolist()).reset_index(drop=True)


print("Data frame 1 Shape: " + str(input1.shape))

print("Data frame 2 Shape: " + str(input2.shape))


print("Dataframe 1 Columns: " + str(input1.columns))

print("Dataframe 2 Columns: " + str(input2.columns))


print(input1.columns == input2.columns)

dfLeft = input1.merge(input2, how = 'outer' ,indicator=True).loc[lambda x : x['_merge']=='left_only']


dfRight = input1.merge(input2, how = 'outer' ,indicator=True).loc[lambda x : x['_merge']=='right_only']


print("\n\n\n\n")

print("Left only shape")

print(dfLeft.shape)

print(dfLeft)

print("\n\n\n\n")

print("Right only shape")

print(dfRight.shape)

print(dfRight)

print("\n\n\n\n")


assert_frame_equal(input1, input2, check_dtype=True)


print(input1.info())

print(input2.info())

print(input1.equals(input2))


#This additionally returns False, even being exactly the same upon inspection.

print(dfLeft.equals(dfRight))


Ruby:

Now for the Ruby issue. Keep in mind that I use this script with the exact same two datasets that I attempt with Python. This script works far better. However, the anti_join call, which just tells what rows are in right and not in left (and vise versa) has for me at most printed out 1 row on any sample run with large input. However, on this specific sample run, it says that there is one column that is only in the left and one that is only on the right. However, upon manual inspection, these rows are available in both and exactly the same. Furthermore, the left only row is the first row of the left hand side dataset and right only is the second row. For the right hand side dataset, it is the exact same thing, except switched backwards, Left only on row 2 and right only on row 1. This return false to the all.equal. My debugging process began by making sure that the column names were read by R and that perhaps it was maybe reading each first row as the column header, this would make sense, right? After printing the column headers this was not the case. As far as sorting, even if the sorting was done wrong in the R code, many of the other elements that exist (and I verified in both) are not in any specific order upon input to the R tile. It is worth noting that the data was preceded by a Grouping on Project and FY Period, so there is no reason there would be any duplicates. Final remarks, there are numerous cases with different size inputs that Python says True and R says false with nothing more than their respective .equals() call. However, I have found R to be far more accurate from manual inspection.


# Import the domomagic library into the script.

library('domomagic')

library('dplyr')


# read data from inputs into a data frame

#Theirs

input1 <- read.dataframe('Total Formulas')


#Mine

input2 <- read.dataframe('Address Nulls from SUM joins 4')


input1 <- input1[order("Project", "FY Period"),]

#input1 <- arrange_all(input1)

input2 <- input2[order("Project", "FY Period"),]

#input2 <- arrange_all(input2)


print(all.equal(input1,input2))


Diff <- anti_join(input1, input2, by = c("Project", "FY Period"))


print(Diff)


Diff1 <- anti_join(input2, input1, by = c("Project", "FY Period"))


print(Diff1)


# write a data frame so it's available to the next action

write.dataframe(input1)


I appreciate any help. Thanks, Will