Best practice for dataset refreshing in DataFlow

Hi, I am wondering if there is a better way than mine with regard to dataset refreshing.

Please advise.

 

I am dealing with an aggregated daily order dataset on DOMO which has more than million records in total.

Let's call it "order_dataset".

The original data source is an external database, so I push data from it via Workbench on daily basis.

Everyday, new records are generated and many past records are updated.

I push those new and updated records to a DOMO dataset with "Replace Existing Datasource" mode.

Let's call this DOMO dataset "newly_uploaded_dataset".

 

My dataflow gets started as soon as "newly_uploaded_dataset" is refresh by Workbench.

Let me explaine each step in the dataflow, although it's a bit lengty.

I am using MySQL-base dataflow, by the way.

 

Step #1: Define a stored procedure to drop/recreate indexes for both of "order_dataset" and "newly_uploaded_dataset" and have the procedure executed.

The indexes will be used in a query later.

Since I have no idea how to check if there is an existing index on Domo (I got an error checking INFORMATION_SCHEMA) and I think using a brand-new index is not a bad idea, I drop/recreate everytime.

 

Step #2: Delete the records in "order_dataset" whose keys match to the records of "newly_uploaded_dataset".

The indexes will be used in this query. This deletion is to avoid record duplication.

 

Step #3:In Output Dataset step of Dataflow,  select "order_dataset" and "newly_uploaded_dataset" with UNION ALL to store the result set to "order_dataset".

At first, I tried to use INSERT "order_dataset" from select * from "newly_uploaded_dataset";

However, I got stuck with a straing "java.sql.SQLException: Data truncated for column xyz" error, so I simply merged these two datasets in Output Dataset and it worked.

 

This way, newly refreshed "order_dataset" has been created.

This dataflow has been working properly, but actually I haven't maintained the dataset not for a long time, so I am a bit concerned about future issues.

 

Thanks!

 

 

 

 

 

 

 

 

 

 

 

Best Answer

Answers

  • A correction.

    Actually, my dataflow does not drop indexes explictly.

    It creates indexes specifiying index names, but it can create the indexes with the same name over and over again without getting an error.

    My stored procedure tried to drop the indexes, but it actually got an error saying these indexes do not exist.

    My error handling logic in the procedure prevents the dataflow from stopping by the error, so I was mistaken.

     

    Sorry for confusion.

    If possible, I want to know the reason the indexes can be created in this way.

     

     

     

     

  • Hi all,

     

    Can anybody help @yoshidadaisu01 with their situation?

    Thanks!

  • @yoshidadaisu01, great writeup!

     

    Your setup is actually very common from my own experience implementing new clients in Domo.

     

    With regards to indexes, hopefully this helps clarify:

     

    * No indexes are ever created automatically on datasets/columns in MySQL DataFlows

    * With the "ALTER TABLE table ADD INDEX (`column`);" syntax, you are correct in that an index is re-created each time the statement is called; i.e. it doesn't throw an error when an index already exists

    * With the "DROP INDEX (`column`);" statement, however, it *DOES* throw an error if an index does not already exist on the column

     

    Hopefully that helps!

  • Thank you for the advice, quinnj.

     

    Just for a clarification about what you mentioned about ALTER TABLE table ADD INDEX,

    I know that MySQL allows it to be exeucted over and over again if an index name is not specified in the statement.

    A new index with a systematically assinged name is created every time.

     

    I didn't want it, so I tried to specify the name like ALTER TABLE test_table ADD INDEX test_table_index01 (col1);

    In that case, I thought I would get "duplicate index" error in the next run, so I put DROP INDEX in the data flow.

    It doesn't have to, though, as I mentioned in the previous post and you advised.

     

    Domo seems to be running on somewhat different mechanism even when MySQL-based dataflow is selected.

    The index is "volatile" and exists only during the dataflow execution.

    I don't have to specify an index name in ALTER TALBE ADD INDEX.

     

    This is what I learned.

    If I'm wrong, please advise me once again.