Compare rows and delete if one of the fields do not match

Reply
White Belt

Compare rows and delete if one of the fields do not match

Hello,

 

I have data like this.

1, red

1, blue

2, purple

2, purple

3, yellow

3,red

 

I'd like to edit the data to show the data like this:

1, red

2, purple

2, purple

3, yellow

 

So I'd only like to only keep duplicates if the first and second field each have a match in the data. Any ideas? Assuming it is ETL somehow but can't wrap my head around it.


Accepted Solutions
Major Blue Belt

Re: Compare rows and delete if one of the fields do not match

First, let me make sure I have my head wrapped around what you are trying to do.

 

Playing with your data, with a name for the columns:

 

Number   Color
1        red
1        blue
2        purple
2        purple
3        yellow
3        red
3 blue
3 yellow

If I understand correctly, you don't want the `Number` column to be associated with more than one value for `Color`. And if there are multiple values associated with the `Number` then you only want to keep the first found value. Also, any duplicates that have the same value as the first one can also be kept?

 

Assuming that I am on the right track, you can get there. Here is the way I think I would do it using Magic ETL:

 

Use a [Rank and Window] action to add a "Row Number" column that restarts at 1 for each `Number value`. These settings should work:

  1. Functions, just add one:
    1. "Row"
    2. Row Number
  2. Sorting is required and may re-arrange your data. If you are worried about which comes first you will need an extra column that can be sorted in your data.
  3. Ascending
  4. `Number`

 

The data will look like this with the new column:

 

Number   Color      Row
1        red        1
1        blue       2
2        purple     1
2        purple     2
3        yellow     1
3        red        2
3        blue       3
3        yellow     4

Then split to two different [Filter Rows] actions. One where `Row` = 1 and the other where `Row` Not = 1 and they will look like this:

 

[Filter 1]: Row = 1
Number   Color      Row
1        red        1
2        purple     1
3        yellow     1

[Filter 2]: Row Not = 1
Number   Color      Row
1        blue       2
2        purple     2
3        red        2
3        blue       3
3        yellow     4

We will reference them as [Filter 1] and [Filter 2]

 

This gets a little harder to conceptualize, I'll try my best. 

 

Concept: Remove the rows we don't want from the [Filter 2] set and then append it back into the [Filter 1] set. Sounds simple enough.

 

Find valid matches:

To identify them we will do a [Join Data] action with these settings:

  • [Filter 2] is on the left ([Filter 1] goes on the right automatically)
  • Inner Join
  • Key Fields
    • Number
    • Color

The preview should look like this at that point:

Number   Color      Row  Number_1  Color_1  Row_1
2        purple     2 
3        yellow     4 

Now we want to add this data back into the results of [Filter 1] and we will use an [Append Rows] action. In order to ignore the new columns with an "_1" after them you want to choose the correct dataset in Step 1 of the append rows action. You want to pick the [Filter 1] input.

 

That should do it. Your final preview should look like:

Number   Color      Row
1        red        1
2        purple     1
3        yellow     1
2        purple     2 
3        yellow     4 

You can use a [Select Columns] action to remove the `Row` column if it bugs you.

 


**Say "Thanks" by clicking the "heart" in the post that helped you.
**Please mark the post that solves your problem by clicking on "Accept as Solution"

All Replies
Major Blue Belt

Re: Compare rows and delete if one of the fields do not match

First, let me make sure I have my head wrapped around what you are trying to do.

 

Playing with your data, with a name for the columns:

 

Number   Color
1        red
1        blue
2        purple
2        purple
3        yellow
3        red
3 blue
3 yellow

If I understand correctly, you don't want the `Number` column to be associated with more than one value for `Color`. And if there are multiple values associated with the `Number` then you only want to keep the first found value. Also, any duplicates that have the same value as the first one can also be kept?

 

Assuming that I am on the right track, you can get there. Here is the way I think I would do it using Magic ETL:

 

Use a [Rank and Window] action to add a "Row Number" column that restarts at 1 for each `Number value`. These settings should work:

  1. Functions, just add one:
    1. "Row"
    2. Row Number
  2. Sorting is required and may re-arrange your data. If you are worried about which comes first you will need an extra column that can be sorted in your data.
  3. Ascending
  4. `Number`

 

The data will look like this with the new column:

 

Number   Color      Row
1        red        1
1        blue       2
2        purple     1
2        purple     2
3        yellow     1
3        red        2
3        blue       3
3        yellow     4

Then split to two different [Filter Rows] actions. One where `Row` = 1 and the other where `Row` Not = 1 and they will look like this:

 

[Filter 1]: Row = 1
Number   Color      Row
1        red        1
2        purple     1
3        yellow     1

[Filter 2]: Row Not = 1
Number   Color      Row
1        blue       2
2        purple     2
3        red        2
3        blue       3
3        yellow     4

We will reference them as [Filter 1] and [Filter 2]

 

This gets a little harder to conceptualize, I'll try my best. 

 

Concept: Remove the rows we don't want from the [Filter 2] set and then append it back into the [Filter 1] set. Sounds simple enough.

 

Find valid matches:

To identify them we will do a [Join Data] action with these settings:

  • [Filter 2] is on the left ([Filter 1] goes on the right automatically)
  • Inner Join
  • Key Fields
    • Number
    • Color

The preview should look like this at that point:

Number   Color      Row  Number_1  Color_1  Row_1
2        purple     2 
3        yellow     4 

Now we want to add this data back into the results of [Filter 1] and we will use an [Append Rows] action. In order to ignore the new columns with an "_1" after them you want to choose the correct dataset in Step 1 of the append rows action. You want to pick the [Filter 1] input.

 

That should do it. Your final preview should look like:

Number   Color      Row
1        red        1
2        purple     1
3        yellow     1
2        purple     2 
3        yellow     4 

You can use a [Select Columns] action to remove the `Row` column if it bugs you.

 


**Say "Thanks" by clicking the "heart" in the post that helped you.
**Please mark the post that solves your problem by clicking on "Accept as Solution"
White Belt

Re: Compare rows and delete if one of the fields do not match

Fantastic Reply!  thank you for the walkthrough and that should do it!  Will let you knowif I have any questions but looks somewhat straightforward!

Announcements
Looking for the latest Community solutions? Please visit our accepted solutions board here!