Any way to delete data automatically after a certain time?

So I'm updating a dataset of mine through a connector on a daily basis, and I foresee that after some time the dataset will be huge (having millions of rows!) I want to delete rows from this dataset that I don't need anymore. The criteria for this can be multiple e.g: Delete rows with dates before a certain time. OR Delete rows which are classified as 'Obsolete'. 

 

Is there any way in which I can do this?

 

(P.S note that with the connector I'm using (CSV import) I cannot filter on the data I'm bringing in. Also I need to perform some SQL transforms on this dataset to tell which rows are obsolete)

Best Answer

  • hamza_123
    hamza_123 🟡
    Accepted Answer

    Hello, 

     

    In the solution that you give, my initial dataset will still have all the rows. It is only in the next step that I filter, but the initial dataset remains huge. 

     

    I want to reduce the size of the initial dataset because our organization now has tens of dataflows running (soon to go into hundreds), and we've seen that dataflows with large number of rows puts a lot of load on the server. The dataflow running time hence goes up 3 or 4 times (especially during the day when multiple teams are vewing dashboards). So we want to reduce the initial number of rows to begin with. 

     

    Since you mention Domo supports hundreds of millions of rows, could you please tell how many rows are permitted in one dataset/dataflow? This would be very helpful. 

     

    A solution I think might be applicable to my case is that I use the replace method (as you highlighted in the first response) to bring in data. I will add this data to my second dataset (after SQL transformations). In the second dataset, I then run another transformation which DELETES the entries that are obsolete. In that way, my first dataset will not be very large (because replace method is being used) while the second dataset will also not be large (since obsolete rows are being deleted). 

     

    Let me know what you think, or if there is a better way to do it. 

     

     

Answers

  • MarkSnodgrass
    MarkSnodgrass Portland, Oregon 🟤

    If you look at this thread, you can do it with MySQL

    https://dojo.domo.com/t5/Data-Sources-and-Connectors/Deleting-records-from-a-dataset-using-MySQL/m-p/23359

    Also, you could think about temporarily using the replace method and then go back to your append method. 

  • I'm sorry I couldn't get what's mentioned in the thread you posted. So how do  we use MYSQL exactly?

     

    Here are the steps in my dataflow:

     

    Step 1: Amazon S3 connector imports data using the append method. Data goes into a dataset. It has now hundreds of thousands of rows. 

     

    Step 2: I perform SQL transforms on that dataset and generate an output dataset. This transform incloves cleaning data and labelling which rows are "obsolete"

     

    ---

     

    So in which step do I use MySQL? If I use MYSQL transformations on the origninal dataset and generate a dataset with less rows, the original dataset is still going to be there! (having millions of rows). How do I get rid of that dataset?

  • Hi @hamza_123 ,

     

    After Step 2 of what you outlined in your dataflow process, you could:

     

    1) Add another SQL transform in the same SQL Dataflow with a where clause along the lines of:

        WHERE

             isObsolete = 0 OR

             dateCreated > DATEADD(Week, -2, dateCreated) 

       This will only include rows that are not obsolete, or rows that are dated within that last 2 weeks.

     

    2) Do the same thing as option 1, but instead using MagicETL Filter tiles

     

    Also, is there a reason you want to exclude old/obsolete data at the dataflow level rather than filter it out at the card level? Domo has the capacity to support hundreds of millions of rows, so I personally find it more useful to get all data into Domo, and then filter down to what's necessary at the card level

     

    Cheers,

    Eric