Data Flow help

Hello, Experts,

I am humbly reaching out for some help.

I have a vendor sending a snapshot of a database table every day and not sending me diffs. I am trying to determine the diffs. The connector type is SFTP CSV. The file they send daily is overwritten daily. I currently have the data connector appending the data from the previous day to the new file they sent over that day.

Is this the correct approach? Should I have the connector setup as Append vs Replace or vice versa?

What success should look like as an output dataset that appends updated records? Below, you will find an image of 6 records that are the same as an individual in the file.

As you can see, it's the same row the only difference is the row_id which is really when the file was run and appended per the connector. So for example, imagine that these 6 rows are John Smith, what we care to see is the original John Smith record and only any new/updated changes. So if there were three changes I would only see three new records.

Thank you so much in advance. We are happy to answer any questions.

Best Answer

  • MarkSnodgrass
    MarkSnodgrass Portland, Oregon 🔴
    Accepted Answer

    @imelendez if the rows are exactly the same, does it matter which one it keeps? If there is another column that is a date entered type column, then the rows would become unique and then both would be kept.

    If you have rows that are the same and the only difference is a date entered type column, you could use the group by tile and choose min or max on the date column to get the first or last record (whichever one you are wanting to keep) and then your columns that you are looking for to be unique in the select list. This would function the same as remove duplicates but allow you to decide if you want the latest or earliest record.