DOMO internal functionality on APPEND/REPLACE

Hi Everyone,

 

How dataset APPEND, REPLACE and ETL(Update Method only new rows) work internally? Does DOMO create a unique key internally and only update the modified or newly entered records from source?

 

Suppose, we have some trillions of data in SFMC(salesforce marketing cloud). First, I will create a app cloud connector to SFMC. Now, what is the best approach to have this working? Which will work faster way with no performance issue? If I schedule it to run on daily basis, as per my knowledge, APPEND will cause duplicates , REPLACE will completely truncate the dataset and reload it all over again, which will be time taking approach, and creating ETL(only new rows), will work for only dataflow, dataset still have all appended data? Please advise, for better performance how internally DOMO works, what would be the best option or practice for million and trilions of data from SFMC.

 

Thanks,

Neeti

Comments

  • jaeW_at_Onyx
    jaeW_at_Onyx Budapest / Portland, OR 🔴

    literally trillions?

     

    if you're dealing with trillions of rows of data, don't use a web connector.  they just weren't designed for that kind of heavy load.  instead consider the JavaCLI or Workbench to support data ingestion.  Why?  B/c both of those workflows can accommodate data uploaded in batches (multiple CSVs) which is way safer than trying to stream a trillion rows of data through the internet and hoping you don't have a weird glitch along the way.

     

    APPEND does what it says on the tin.  you get a bunch of rows and smack them on the end of your dataset.  given that you're talking 'a trillion rows' i'd think twice about using APPEND.  

     

    REASON: b/c you want to design and build a data pipeline that can safely fall over and be restarted without manual intervention.  if you're doing an append, IF you accidentally start the job twice you have double data.  You could do a recursive query to clean that dataset ... but that's awful on a trillion rows.

     

    REPLACE: a good option IF you're only moving 'the most recently modified data' and then have a workflow in Domo for appending that data to the full dataset (consider Data Assembler)

     

    UPSERT (update only new rows).  Upsert works like Upsert in the rest of the BI world.  You define a key column, and if that key value exists in your historical data, then you UPDATE it else you INSERT.

     

    UPSERT great for Dimensional Data or datasets that don't have a clean partition (like Customer records or Product SKU tables) whereas PARTITION (data assembler) works best on large datasets where you can say "if this block of data exists, transactions on Jan 1, delete the entire block, and replace it with a new block" 

     

    UPSERT works into the low hundred millions of rows, but remember for UPSERT to work, first it has to scan a list of UPSERT KEY values, as the number of Key values grow, each row INSERT takes longer.

     

    To address your Salesforce question:

    1) if UPSERT can be enabled, check with your CSM, use it.  But you have to guarantee the uniqueness of the UPSERT_KEY if it has a duplicate it can break the dataset (until support can fix it).

    2) use REPLACE and only capture modified rows + Data Assembler (with UPSERT or PARTITION depending on use case) ... DO NOT USE RECURSIVE QUERIES.  

    3) if the dataset is under a few million rows, avoid Data Assembler and opt for straight REPLACE

     

    Hope that helps.

    Jae Wilson
    Check out my 🎥 Domo Training YouTube Channel 👨‍💻

    **Say "Thanks" by clicking the ❤️ in the post that helped you.
    **Please mark the post that solves your problem by clicking on "Accept as Solution"
  • @jaeW_at_Onyx ,

     

    Thanks for the advice. However, I am not sure if workbench or upsert method, I could use to connect to salesforce marketing cloud. I may have to think about amount of data. Will check on this and use it wisely.

     

    Regards,

    Neeti

This discussion has been closed.