Reusable dataflows / ETLs

We're trying to come up with a good strategy for dealing with large datasets that contain mostly historical data but need to undergo complex transformations in a dataflow or ETL.

For example, our user activity log (currently about 7M rows) needs to be joined to several other tables and then we need to segment the activity in multiple ways, group some of it together, etc.

We've been using a MySQL dataflow because of the complexity of the transformations but this takes a little while to run -- 20-30 minutes -- even if we only process 6 months' worth of activity (< 2M rows from the activity log). If we process the entire 7M-row activity log the dataflow can take over an hour to run.

So one general thought we've had is to only transform very recent data (say, from the current month) and then UNION it together with transformed historical data in a DataFusion. But we're struggling to come up with a good process for managing this in practice. Coming from a software-development background, my mind goes to reusable dataflows -- if it were possible to configure a DF to use a "variable" input this process would be pretty easy to manage.

So for example on the first day of the month the DF would use the historical activity log as its input, and then on subsequent days it would use the "month-to-date" activity log. I don't think this is literally possible in Domo, without manually reconfiguring the DF twice a month, but maybe someone has come up with a technique that approximates it?

Comments

  • MarkSnodgrass
    MarkSnodgrass Portland, Oregon πŸ₯·

    Have you tried to rebuild your MySQL dataflow in Magic ETL? The latest Magic ETL runs extremely fast and I would expect it to handle 7m in under a couple minutes. You can also do formula rules in the filter tile so that you could filter to month to date at the beginning and then do the rest of your work if you found you needed to try and make the ETL run even faster. If you haven't watched the Domopalooza breakout session called Demystifying Data Pipelines, I would suggest watching that is it gives a nice overview of the different types of data transformation tools and when to use them.

    Hope this helps.

    **Make sure to <3 any users posts that helped you.
    **Please mark as accepted the ones who solved your issue.
  • Ben_Dunlap
    Ben_Dunlap 🟑
    edited March 30

    Thanks, I'll check out that breakout session. We're kind of committed to using SQL dataflows for various reasons, at least for transforms of any complexity. These datasets are just going to keep getting bigger in any case so I'm really more interested in specific techniques for breaking large, mostly-static datasets down before transformation and then reassembling.

    The general approach I have in mind would refactor this naive approach:

    1) LARGE_DATASET -> DF -> LARGE_OUTPUT

    To something like this:

    1) MOST_OF_LARGE_DATASET -> DF -> LARGE_PARTIAL

    2) VERY_RECENT_DATA -> DF -> SMALL_PARTIAL

    3) LARGE_PARTIAL + SMALL_PARTIAL -> Union via Fusion or ETL -> FINAL_OUTPUT

    Where step 1 would only run, say, monthly. But I don't know how to make that happen without having two copies of the key dataflow, which then have to be maintained separately.

  • MarkSnodgrass
    MarkSnodgrass Portland, Oregon πŸ₯·

    @jaeW_at_Onyx is well-versed in this area. Jae, do you have some thoughts you can share to @Ben_Dunlap ?

    **Make sure to <3 any users posts that helped you.
    **Please mark as accepted the ones who solved your issue.
  • jaeW_at_Onyx
    jaeW_at_Onyx Budapest / Portland, OR 🟀

    @Ben_Dunlap

    Flat out, with only 7M rows of data you (99% guaranteed) will have the best and fastest performance with Magic 2.0. The underlying transformation engine is more performant than the way data is piped in a MySQL dataflow (out of Domo Vault / Amazon S3 and into a MySQL VM and then back the S3 vs. streaming in an Apache-esque ETL environment).


    Especially with the pending release of Magic 2.0 + Partitioning (which you're essentially describing through your incremental load process) your Dataflows are going to smoke MySQL performance.


    Domo has not invested in making MySQL Dataflows faster in over 4 years and that's reflected in the version of MySQL that's the dataflows are run in (5.6 vs. 8+).


    I work with 1B+ and 500M+ row datasets and have consulted 100+ clients prior to joining Sony... I know how hard a pill it is to swally Magic dataflows, there are VERY few uses cases where SQL will outperform Magic. That said Domo does have a premium feature (Adrenaline Dataflows) that allows you to write SQL directly against the database engine that Domo uses to present cards which will give you even faster performance than Magic 2.0 ... but if you're just transforming 7M I'd still point you at Magic.


    What you can do is use Dataset Views (which unfortunately is still somewhat buggy) to JOIN your datasets and perform light transforms, THEN pull the DSV into Magic 2.0 for any last processing steps and materialization. If you can do your ETL phase in a DSV then you can just use Magic 2.0 for UNIONing and materializing (UNLOAD the dataset to disk in S3) the data. This would give you best of all worlds (facilitate transforming incrementally loaded data, and take advantage of the fastest ETL tools in Domo without relying too much on Magic.)


    ... but seriously... for 7M rows just use Magic.

    Jae Wilson
    Check out my πŸŽ₯ Domo Training YouTube Channel πŸ‘¨β€πŸ’»

    **Say "Thanks" by clicking the ❀️ in the post that helped you.
    **Please mark the post that solves your problem by clicking on "Accept as Solution"
  • Ben_Dunlap
    Ben_Dunlap 🟑

    I'm sure you're right about performance but we have other reasons apart from performance for wanting to stick with SQL dataflows. We're actually just wrapping up a major project of migrating all of our ETLs to SQL dataflows. For the most part the run times aren't a problem, and even in this case we can live with the long run time since the underlying raw data is controlled by a vendor that only gives us access to daily snapshots. So the DF really only needs to run once a day. Just wondering if anyone had solved the partitioning problem using vanilla SQL DFs without the need for periodic manual reconfiguration of DFs or duplicate DFs.

  • @Ben_Dunlap - I'll defer to @MarkSnodgrass and @jaeW_at_Onyx for understanding the technical workings under the hood of magic 2.0 over MySQL. It sounds to me like your hands may be tied when it comes to which type of transform you are using and the decision has already been made to use MySQL.

    In that case, I would highly encourage you to filter your input datasets as much as possible and make sure that you are adding an index before any joins. You could also reach out to Domo to see if a technical consultant could review your dataflow to help you optimize it.

    Out of curiosity, what are the "various reasons" that are preventing you from switching to the magic 2.0 engine? For me, we ran into some roadblocks when it came to documentation. Some of our data flow processes needed to be documented in specific ways and pasting a screenshot of the ETL or the JSON text was not sufficient.

    If it is just a matter of needing to have precise control over "complicated" joins, there is probably a way to manage that with magic 2.


    ______________________________________________________________________________________________
    β€œThere is a superhero in all of us, we just need the courage to put on the cape.” -Superman
    ______________________________________________________________________________________________
  • jaeW_at_Onyx
    jaeW_at_Onyx Budapest / Portland, OR 🟀

    @Ben_Dunlap

    Ooof... migrated away from Magic. ooook.


    are you familiar with recursive dataflows? I cannot ... de-recommend them strongly enough but it is an anti-pattern that works.

    You can add indexes to your MySQL dataflow which may boost performance during the JOIN action

    https://domohelp.domo.com/hc/en-us/articles/360042923014-Optimizing-a-SQL-DataFlow



    https://domohelp.domo.com/hc/en-us/articles/360043428133-Creating-a-Recursive-Snapshot-SQL-DataFlow

    Jae Wilson
    Check out my πŸŽ₯ Domo Training YouTube Channel πŸ‘¨β€πŸ’»

    **Say "Thanks" by clicking the ❀️ in the post that helped you.
    **Please mark the post that solves your problem by clicking on "Accept as Solution"
  • Ben_Dunlap
    Ben_Dunlap 🟑
    edited April 1

    Thanks all, we're using indexes in our DFs and we've pushed some of our transforms down into the Postgres layer (our key raw datasets are coming out of a read-only Postgres db that our vendor provides to us). Both of these optimizations help a lot with performance.

    @ST_-Superman-_ we had a few reasons for migrating to SQL:

    1) We were finding that it often took several non-obvious steps in Magic to do something that would be very simple in SQL, which led to bloated and difficult-to-understand ETLs. I'm sure a lot of this has been solved in Magic 2.0 but we made this decision when all of our ETLs were still pre-2.0

    2) At the time of the decision, our business relationship with Domo was at an all-time low. This has improved quite a bit since then, but back then the goal was to become less dependent on Domo's proprietary tech, not more. SQL queries can be copy/pasted as a simple export/backup.

    3) Along the same lines, we had just hired our first full-time data analyst and I was more interested in giving this person an opportunity to develop portable skills than in having them become a Domo expert

    So to summarize: We prioritized portability and "readability of code" (so to speak) over performance. Given our relatively small datasets and the way that the data is made available to us, this still looks to me like a good trade-off for at least the next few years.

    @jaeW_at_Onyx Recursive dataflows are intriguing, but in our use case one of the main performance bottlenecks is the step of loading all of the historical data as an input to the DF, which I think would still need to happen in a recursive dataflow. It's giving me some ideas though. I'll post back here if any of them come to anything.

  • DataMaven
    DataMaven 🟒

    @Ben_Dunlap - I strongly concur with the rest of the folks here that you should give Magic another shot. 7 Million rows isn't a very large dataset, so even if you just use Magic when you need performance, it would help.

    If your analysts aren't interacting a lot with the end-users, SQL can make sense as a way to leverage existing skillsets. However, if they are working a lot with end-users, it sure does help build trust and understanding!

    There are some minor logic shifts which make the transformations much easier to navigate. If you have any questions about how to best achieve results, this group can help. A big advantage to Magic is that processes can run concurrently, so a well-designed dataflow will totally smoke SQL.

    If you want to know how to grab the code from the Magic ETL dataflow, the folks responding here can help with that, too. Wanting to be able to mine your code and not be beholden totally makes sense.

    DataMaven
    Breaking Down Silos - Building Bridges
    Check out my video!
    **Say "Thanks" by clicking the heart in the post that helped you.
    **Please mark the post that solves your problem by clicking on "Accept as Solution"
  • Ben_Dunlap
    Ben_Dunlap 🟑

    Thanks again all. I'm definitely seeing the appeal of Magic 2. I think we will probably end up with some sort of compromise where we still use a fair amount of SQL but do heavier-duty joins in Magic or in the underlying Postgres db. Just discovered the Java CLI thanks to a post elsewhere by @jaeW_at_Onyx, so now we have access to version-controllable JSONs of all of our stuff, which is a game-changer. I'm always amazed at how much more there is to dig into with Domo!