Compare rows and delete if one of the fields do not match

Hello,

 

I have data like this.

1, red

1, blue

2, purple

2, purple

3, yellow

3,red

 

I'd like to edit the data to show the data like this:

1, red

2, purple

2, purple

3, yellow

 

So I'd only like to only keep duplicates if the first and second field each have a match in the data. Any ideas? Assuming it is ETL somehow but can't wrap my head around it.

Best Answer

  • n8isjack
    n8isjack 🟠
    Accepted Answer

    First, let me make sure I have my head wrapped around what you are trying to do.

     

    Playing with your data, with a name for the columns:

     

    Number   Color
    1 red
    1 blue
    2 purple
    2 purple
    3 yellow
    3 red
    3 blue
    3 yellow

    If I understand correctly, you don't want the `Number` column to be associated with more than one value for `Color`. And if there are multiple values associated with the `Number` then you only want to keep the first found value. Also, any duplicates that have the same value as the first one can also be kept?

     

    Assuming that I am on the right track, you can get there. Here is the way I think I would do it using Magic ETL:

     

    Use a [Rank and Window] action to add a "Row Number" column that restarts at 1 for each `Number value`. These settings should work:

    1. Functions, just add one:
      1. "Row"
      2. Row Number
    2. Sorting is required and may re-arrange your data. If you are worried about which comes first you will need an extra column that can be sorted in your data.
    3. Ascending
    4. `Number`

     

    The data will look like this with the new column:

     

    Number   Color      Row
    1 red 1
    1 blue 2
    2 purple 1
    2 purple 2
    3 yellow 1
    3 red 2
    3 blue 3
    3 yellow 4

    Then split to two different [Filter Rows] actions. One where `Row` = 1 and the other where `Row` Not = 1 and they will look like this:

     

    [Filter 1]: Row = 1
    Number Color Row
    1 red 1
    2 purple 1
    3 yellow 1

    [Filter 2]: Row Not = 1
    Number Color Row
    1 blue 2
    2 purple 2
    3 red 2
    3 blue 3
    3 yellow 4

    We will reference them as [Filter 1] and [Filter 2]

     

    This gets a little harder to conceptualize, I'll try my best. 

     

    Concept: Remove the rows we don't want from the [Filter 2] set and then append it back into the [Filter 1] set. Sounds simple enough.

     

    Find valid matches:

    To identify them we will do a [Join Data] action with these settings:

    • [Filter 2] is on the left ([Filter 1] goes on the right automatically)
    • Inner Join
    • Key Fields
      • Number
      • Color

    The preview should look like this at that point:

    Number   Color      Row  Number_1  Color_1  Row_1
    2 purple 2
    3 yellow 4

    Now we want to add this data back into the results of [Filter 1] and we will use an [Append Rows] action. In order to ignore the new columns with an "_1" after them you want to choose the correct dataset in Step 1 of the append rows action. You want to pick the [Filter 1] input.

     

    That should do it. Your final preview should look like:

    Number   Color      Row
    1 red 1
    2 purple 1
    3 yellow 1
    2 purple 2
    3 yellow 4

    You can use a [Select Columns] action to remove the `Row` column if it bugs you.

     

Answers

  • Fantastic Reply!  thank you for the walkthrough and that should do it!  Will let you knowif I have any questions but looks somewhat straightforward!