How to remove rows from a dataset, with the output being the same dataset
Summary: I want to remove rows from a dataset based on certain conditions. I want the final dataset to be the same as the input dataset (but with the deleted rows)
Step 1: I get a file with a 100,000 rows everyday (through the Amazon S3 connector). I use the append method, so each day's data keeps on adding to the old data. Let's call this "S3 File".
Step 2: I then use the SQL dataflow to join this "S3 File" with some another table (to add additional information). I get another output file from this flow. Let's call this the "Final File". I use the "Final File" in a dashboard.
The "S3 File", over a period of a month, now has over 3 million rows. This has 2 drawbacks:
.1. The SQL dataflow (Step 1 in my explanation) takes a lot of time to run. Now its over 1 hour
.2. In the morning when hundreds of our company's dataflows are running, the server gets slowed down (we're assuming because of huge files like the one discussed being processed). Processing times are doubled, even tripled.
.3. We have a limit to the number of rows we're allowed, and it will soon run out if this keeps on going.
.4. With a dataflow time of one hour (which can soon go to 2 hours or even more), our dashboard users get to see the latest data very late in the day, which they prefer seeing earlier.
I'm looking for a way in which I can only keep the rows that are relevant in the dataset. I determine relevance in Step 2 (in my explanation above) when I do the SQL dataflow (when I join tables with the "S3 File"). For simplicity, let's assume I see the "start date" for each row. If the start date is 30 days older than the current date, I want to delete that row.
One way proposed was to use the replace method in Step 1 (for the "S3 File"). I then take file to Step 2, perform a SQL dataflow, and weed out outdated rows (with delete command, or just the select command), storing the relevant ones only. The problem with this is that the data that we get daily is ALWAYS relevant that day, for upto 30 days. So I cannot just weed out outdated rows when we get data.
The only way then is to perform SQL operations on the big dataset, and delete the outdated rows. The problem with this is that in a SQL dataflow, the output dataset cannot be the same as the input dataset (I tried naming them the same, but it doesn't help). Even if I make a new dataset with reduced rows, there is still going to be a big dataset in the backend.
I would love an answer on this from the Domo experts.