Deleteing intermediary Dataset values after final OutputDataset

AnwarBham
AnwarBham ⚪️
edited April 30 in New to Domo

Hi I Am new to Domo we have a lot of very large etls that are broken down in stages.

after each stage an intermediary dataset it created to the pont we have billions of rows in datasets that are only used as staging tables.


my question is these tables take up space.

once the final table is created, how can we delete all the rows from the input and intermediary tables.

we would still maintain the structure of the tables with column names but with 0 rows or data size.


i have tried creating a query via MYSql to delete from the etl and the source tables but it does not work


it seems very inefficiant to keep these tables that occupy space .

Tagged:

Best Answer

  • SpencerWilcox
    Accepted Answer

    Hey Anwar, this is Spencer Wilcox, one of the consultants on the project. Totally agree that there are a ton of staging tables currently and that we are all aligned on consolidating these data flows down to fewer, more holistic ETLs. The reason it was built this way had nothing to do with tooling limitations.

    It was built like this to make visible fine grain detail at each step of the business logic. This detail at each step was needed to suss out the business logic disagreements between several groups within the client and then to help build trust as we worked with those groups to align business logic around a common standard.

    That job of alignment now being done, we agree that it is just fine (and preferable) to consolidate these data flows down to a much smaller count.

    These consolidated flows can live in either our SQL environments or our Magic ETL environment.

Answers

  • jaeW_at_Onyx
    jaeW_at_Onyx Budapest / Portland, OR 🟤

    @AnwarBham you probably shouldn't delete intermediate tables and dataflows. It'll be impossible to reconstruct your pipeline without rebuilding the dataflows all over again.


    1) you could refactor and consolidate your intermediate steps

    2) do nothing. if row limits are not a problem, it doesn't really matter.


    bigger question. what inefficiency are you worried about?

  • @jaeW_at_Onyx i dont want to delete the tables. i only want to delete the rows. this way i maintain the the dataflow and pipelines.

    inefficient because these are only temporary datasets and are not used in any Cards , they just take up space .

    in SQL i can create the same tables virtually in memory and dispose of them the same in Qlik .

    Let me ask another way.

    is there anyway i can update a dataset etl o MYsql without createing a new dataset,

  • jaeW_at_Onyx
    jaeW_at_Onyx Budapest / Portland, OR 🟤

    @AnwarBham i think the important thing to keep in mind is that you're not working in a SQL environment. so while it helps to understand how sql works and know where the overlaps are, it's not a 1:1 match.

    can you update a dataset in ETL without creating a new dataset. yes. add a filter that's always false.

    should you do that... well ... no. because then you have to manually update the dataflow next time you want to run it again. THAT is a bigger inefficiency than a data center full of datasets you don't look at in your staging database.

    there is no performance cost to having the datasets. and you as an indivudal (unlike DBAs) don't have to worry about the cost implication of a database sitting around in Vault / Domo's Data Lake layer.

    to put a finer point on it. while you can do what you're asking about, unless you are running into row limits (i.e. Domo has a cap on how many rows of data you can have in your data center) your activity has zero value-add in terms of technical optimization of your data center, and probably has zero value add to the business, b/c as you said it's the staging layer, so they don't care. and you're making work which costs time.

    We did an entire conference session about how Domo works at the IDEA exchange

    https://www.youtube.com/watch?v=kmC-0ncu9PA&list=PLUy_qbtzH0S6-5oDbx3BsIv2Xk-JxJxWi&index=8

  • we have a limitaion on the number of rows in domo.

    so these temp tables are taking up our Quota.

    inefficient use of tables. they hold no value to the buisness

  • so i tried via the Python API

    it only allows me to modify and update the datasets created via the Python API that i find limited. i also think there needs to be more documentation on the python side.

    i have seen several posts where there have been similar requests but no solution.

  • rahul93
    rahul93 NY 🟠

    Hey @AnwarBham,


    One of the ways you can do this is by assigning a timestamp column to these tables and based on your loading times, you can write something that would delete all the historical data that you don't want in your tables.

    I would also consider what @jaeW_at_Onyx is saying. Your problem looks like data is being transformed and added into datasets that in Domo can be put into a single dataflow without creating a new dataset. If you can provide some examples that would be great as this looks like a pretty solvable issue.


    Thanks!

  • AnwarBham
    AnwarBham ⚪️

    well this is the solution domo has provided us i refer to the diffrent etls, by domo consultants. so recreating in one big etl is not an option we have 1.9 billion rows in the final table.

    the timestamp is a good idea but as domo only lets you run the etl once a day in the schedular, and we do a daily load it will not be able feasable.

  • Giacomo
    Giacomo ⚪️

    If you were to set up your dataflows as recursive dataflows then I suspect you might be able to do what you're looking for. I'm thinking you can take your existing workflow and add a branch out to a filter whose value is always false like 1=0 and then you'd output to your input dataset recursively. So that upon completion your input dataset is overwritten with a zero row version of itself.

    Hope that all made sense

  • AnwarBham
    AnwarBham ⚪️

    Thanks for the help guys. But from looking in to domo extensivly i have come to the personal conclusion the etl is very limited.

    you have to create several inefficient pipelines to create a simple task that can be carried out in Sql,Qlik,Dax. having 4 types of ETL tools in one application limits your possibility to improve all of them. i remember the words jack of all trades master of none.

    personally i think domo need to invest more in the actual product and the technical aspects. At the moment it seems the investments has been in PR and the number of connectors.

    we have over 1000 etls and datasets mixed from datafusion, redshift, mysql, magic to do the same task we can do with one etl from another vendor. you can see from my point of view when i question if this is an agile solution or efficient use of time trawling through 100 etls to find the origins of one field.

    we are just going to use another ETL to do the heavy work and leave simple etl tasks for DOMO. the software is just not mature enough. we have several contractors from onyxreporting who have agreed.

  • jaeW_at_Onyx
    jaeW_at_Onyx Budapest / Portland, OR 🟤

    @AnwarBham please don't throw my consultancy's name around. You do not have several contractors who have agreed with anything. Thanks.

  • AnwarBham
    AnwarBham ⚪️

    my apologies it was domo who agreed to do the heavy etls outside. but jae i can see you have created a few datasets and dataflows in our environment we can set up a call in future to discuss cleaning these and keeping the house clean . thanks