How to append based on new IDs only - not upsert

Is there a way to append only NEW IDs in an upload job? I understand that I can do a "Replace" job for an initial load of a table, and then a regularly scheduled "Append" job using an UPSERT key and a smaller subset of data, such as data with a change date in the past 7 days per the Domo example here:

https://domohelp.domo.com/hc/en-us/articles/360043038714-Using-Upsert-in-Workbench-5-1. "In this example, we know that our data could be updated within 7 days of the transaction, so we are pulling the last 7 days' worth of transactions to capture any updated rows as well as any new ones each day. This is only one simple example of many different ways to identify the "change data" or "delta".


What if I don't really need to UPDATE data but rather just want to append any data with an ID that doesn't exist on the Domo side? If I do an UPSERT against the whole dataset, wouldn't that unnecessarily be updating data that doesn't need to be updated?

Answers

  • jaeW_at_Onyx
    jaeW_at_Onyx Budapest / Portland, OR 🔴

    @JunkDoom given that you are describing ingesting Transactions into Domo, I would strongly recommend you investigate using Partitions instead of UPSERT.


    UPSERT works well when you're uploading a massive Customer, Product (or similar) table where you have no idea when a particular Customer_ID or SKU might be updated.

    Partition is the preferred ingest method on transactional data because you could (as you say) upload all activity that was created in the last 7 days and partition on the creation_date column (or whatever's appropriate.


    While UPSERT might sound easier, the downside is processing, for each row i want to insert into DOMO i'd have to check if the transaction_id exists. Keeping in mind that the data is stored in a flat file, that search grows increasingly expensive as your data volume increases (more values to check) whereas with Partitioning i just have to answer "does this volume (date) exist? if yes, discard the existing date and replace, if no, create a new volume"


    hope that helps.

    Jae Wilson
    Check out my 🎥 Domo Training YouTube Channel 👨‍💻

    **Say "Thanks" by clicking the ❤️ in the post that helped you.
    **Please mark the post that solves your problem by clicking on "Accept as Solution"
  • The lack of documentation on Partitioning in Workbench gives me pause.

    If we're not specifying an UPSERT key, wouldn't this only work for transactional data that never changes at the source?

    For example, say I create a "Replace" job for an initial load. I partition on a "DateUpdated" field. I then change the query to only to grab the past 3 days ("DateUpdated") and I run that daily. A month down the road, "John A. Smith" changes to "John B Smith" and his record gets updated. Since we're not specifying an UPSERT key, a new record would get inserted, but his old record would still be there with the old "DateUpdated", right?


    I think I have a few situations where source data never changes once written where this would work. Would I still filter the query to just the past few days, which would wipe everything for those days out and rebuild them and leave older data alone?

    Regardless of how I set these up, I feel like I'll always need to have a "replace" job at the ready to reload the entire table in case something old got updated, either expectedly or not.

  • jaeW_at_Onyx
    jaeW_at_Onyx Budapest / Portland, OR 🔴

    Don't partition a table of customers. that is the perfect use case for UPSERT if you use the customer_id as the upsert key.

    "UPSERT works well when you're uploading a massive Customer, Product (or similar) table where you have no idea when a particular Customer_ID or SKU might be updated."


    Yes, the ideal case for partitioning is the assumption that history never changes (i.e. transactional records). if you have back posted activity (financial stuff) then yes, you may need to periodically reload history, or if you have two columns posting_date and last_modified_date, if you identify that you back post one day last month AND you are partitioning by month, then you'd have to re-upload the entire previous month.


    what you don't want to do is UPSERT a (large) transactional dataset because that would be a massive strain on Domo as the dataset grows and eventually cause the data to perform inconsistently as volume increases. (low millions of rows will probably be fine but 10s or 100s of millions would probably be where upload starts to suffer.

    Jae Wilson
    Check out my 🎥 Domo Training YouTube Channel 👨‍💻

    **Say "Thanks" by clicking the ❤️ in the post that helped you.
    **Please mark the post that solves your problem by clicking on "Accept as Solution"