Rows Gone Missing in ETL

user027926
user027926 βšͺ️

I created an ETL. The final step -- where it leads to the output suddenly causes ~20K rows to go missing. Any insights?

Thanks!

Best Answers

  • MarkSnodgrass
    MarkSnodgrass Portland, Oregon πŸ”΄
    Accepted Answer

    I just created a sample dataflow to confirm what I have been saying. See image below.

    I have a dataset that has 15 rows. I added a filter tile that filters out 3 rows. Notice that it still says 15 rows processed next to filter rows. This is how many rows came into that tile. The result of the filtering resulted in 12 rows, which is what you see in the output dataset.

    Your number in your output dataset is the result of your removing duplicates tile. That rows processed is how many rows come into that tile, not how many come out of it.

    Hope that makes sense.

  • user027926
    user027926 βšͺ️
    Accepted Answer

    Thank you. It makes complete sense. Meaning-- the join data tile is reading 700K+. It just means that I need to find those missing rows elsewhere, as the output should be the higher number. Thank you!

  • user027926
    user027926 βšͺ️
    Accepted Answer

    OK, so after a lot of Domo assistance, we have finally discovered the issue. In Domo Beta-- when you do a join, you can also select what you do with repeat columns. I did a right join, but mistakenly fixed the right columns by dropping them. This created a slew of nulls, as there were rows which no longer had my joining element, and therefore were then lost on the next connect.

Answers

  • jaeW_at_Onyx
    jaeW_at_Onyx Budapest / Portland, OR 🟀

    You mean other than "remove duplicates?"


    Really not trying to be snide, it sounds like there's a disconnect between what you expect to happen after "remove duplicates" and what is actually happening... If you don't want to lose rows, can you just remove the "remove duplicates" tile?

  • user027926
    user027926 βšͺ️

    That was the previous step- when it went from 700k to 400k. This is just sliding that data to the output.

  • MarkSnodgrass
    MarkSnodgrass Portland, Oregon πŸ”΄

    @user027926 I think those numbers are telling you how many rows came into that tile, not how many came out of it. So, there were 484,445 coming into the Remove Duplicates tile and after it completed, there were 468,410 rows.

    It would be a nice enhancement if they showed both the incoming and outgoing numbers for each tile, but you can deduce it by following the steps.

  • user027926
    user027926 βšͺ️

    The other number was accurate. It was 700K going in (or it should have been) and the 400K was the outgoing. That's the only way that makes sense (I did a 2 sided join the tile before and then deducted the duplicates)

  • MarkSnodgrass
    MarkSnodgrass Portland, Oregon πŸ”΄

    You could test with some sample data by making up a basic Excel file with 10 rows or something and run it through a few of the same steps in your current dataflow and see how the numbers look. When I looked at one my dataflow details, that is what I deduced: that it is telling you how many rows it processed (looked at), not how many it outputted.

  • user027926
    user027926 βšͺ️

    Interesting-- so any idea why it is losing some? Here is a screen shot of a few steps back to give you a sense: