Row Numbers from original data

OllieSouthgateAKA
edited September 2 in Dataflows

Hi there! Huge fan of the forums and hoping you can help me with something that previous posts and many late night Google searches have eluded me on...

We have an email connector pulling in many CSV files from attachments and appending them one after the other in the order received. The order of the data within each CSV is very important as it's not "raw data" as such - the first few lines of each CSV contain important information about everything in the lines underneath them. We successfully solved for this using SQL but for that Dataflow to work (the next step after an initial MagicETL) it's very important that the order is retained so I am trying to make that stick by giving myself a RowNumber field to order on.

In testing with a few thousand rows, this was easily achieved by first adding a Constant column (with a fixed value of 1) and then using Rank & Window to calculate row numbers based on that - for whatever reason, even though they were all "1", MagicETL very happily ranked them 1 through the maximum row number, in the order that the data had arrived in. This made it super easy to always have a reference point to give both MagicETL and a subsequent Dataflow to order from and apply many different order-dependent transforms to in order to pull the "headers" in the top of the CSV down against every row, until it found the next one, and so on.

However I'm now trying to do the exact same thing on the complete dataset of about 1.3M rows and something about the increased size means it's now picking its own order to rank the constant on which is making nonsense of the rest of the steps in the MagicETL flow - it loads the data from the email connector in the right order, when I add the constant "1" column it remains in that order, but once I try and give that constant column a ranking it starts moving arbitrary rows around into a different order (is this something to do with the number of rows? As I mentioned when it was just a few thousand rows it understood how to rank them in the same order).

Unfortunately the raw data as it comes in the CSV does not have any columns I can base a logical order on (and comes from a third party so I'm not going to be able to be fix that), but since DOMO clearly knows somehow to retain that order for all the steps up to this point, I'm really confused as to why it suddenly forgets that at the Rank step! Any tips? Or any alternative ways using either type of flow or in configuring the original connector that I can add row numbers based on the original order the data was collected and appended in?

Thanks in advance - some screenshots below :)

Ollie


Raw data from email connector:


After adding the constant, the above looks exactly the same.


Then this Rank & Window function is applied:


And after that, although the RowNum column appears in the correct order (1 at the top, +1 for each row that follows), it has moved all the rows into an order I can't make sense of:


Answers

  • GrantSmith
    GrantSmith Indiana 🔴

    Hi @OllieSouthgateAKA

    My assumption here is that when Domo is processing your data the underlying architecture / platform they're using to process the data doesn't respect row ordering when performing a rank operation. This is typically the case with Big Data solutions because it's faster to read your large data set in chunks simultaneously instead of reading the dataset one row after another. If you're unable to get a sorting method implemented within the CSV file records you're likely out of luck using a Magic ETL dataflow.

    Since you mentioned you're already using a SQL dataflow have you thought about attempting to calculate the row number for the CSV when you're processing it in that SQL dataflow?

  • jaeW_at_Onyx
    jaeW_at_Onyx Budapest / Portland, OR 🟤

    if all you want to do is apply header row values (in col 1) across all rows of your dataset, then


    1) split your data into header rows and transaction rows (FILTER looking for NULLs)

    2) spread the header values (organized in rows) into columns using PIVOT

    3) add the constant 1 to both header_set and transaction_set.

    4) cross apply using inner join on 1 = 1


    I have an example here

    but it 's obviously not 100% the same implementation you need.

  • OllieSouthgateAKA
    edited September 2

    Hi @GrantSmith - the SQL dataflow I referred to actually currently happens AFTER the ETL. But I am not opposed to adding one in between the raw data and the ETL if that could fix it! Any tips on what I would do in the SQL Dataflow part in that case? Googling around only returned functions that are only available in later versions of MySQL...

    Thanks @jaeW_at_Onyx . What's weird is this is exactly what I already have implemented (just on multiple layers of headers) and in a preview in MagicETL2 it's still in the right order right up to previewing the final output title, even up to a 400K row preview, but then when I actually run it on the complete dataset the ordering all goes awry again :( but thanks for the tips!

  • jaeW_at_Onyx
    jaeW_at_Onyx Budapest / Portland, OR 🟤

    sure. but keep in mind MAgic 2.0 is a distributed ETL engine, if it can, it will chunk your job into smaller parts and distribute it. so b/c it's being distrbuted row sort order can't necessarily be guaranteed.


    the solution i've described doesn't require the rows to return in a specific order. for your header rows you're filtering on rows that you've identified as 'header rows' b/c the value columns are NULL.

  • Omaurya
    Omaurya ⚪️
    edited September 12

    Hi Ollie,

    I am wondering whether you tried ranking the data immediately after CSV connector, before appending the same with the larger dataset? Since the data coming from the connector is small, it might be easier to get the ranking done and it is likely to work correctly.

    Thanks! Om