Combining datasets by looking for strings in one column

I have a couple datasets... 1) Product IDs with information 2) Google Analytics data with URLs that contain the product IDs on product pages

I'd like to join the first dataset to the GA data to include product information that aligns with the product page. Is there a way to join these datasets in which I have the second dataset look for a product ID that is contained in the URL and match accordingly?

Example:

Dataset 1

Tagged:

Answers

  • MarkSnodgrass
    MarkSnodgrass Portland, Oregon 🥷

    Do the URLs have a consistent format to them? If so, you can use the formula tile in Magic ETL to extract the ID at the end of the URL to create a new column called Product ID. You can then use the Join tile in Magic ETL and join the two datasets together. An example to use for the formula tile based on the two samples you gave would be to use the RIGHT function and extract the last 6 characters like this:

    RIGHT(url,6)

    This would get the product ID from the URL for you.

    **Make sure to <3 any users posts that helped you.
    **Please mark as accepted the ones who solved your issue.
  • They have the product ID at the end of the URL... however the Product IDs come in different lengths. They are however preceded by a '/' ex: https://climate.emerson.com/en-us/shop/1/copeland-sku-zp20kae-pfv-130

    So, I suppose I could try and pull out the last segment after a '/'. How would I do that?

  • MarkSnodgrass
    MarkSnodgrass Portland, Oregon 🥷

    I like to use the SPLIT_PART() function for this. It would look something like this:

    SPLIT_PART(url,'/',7)

    This should return everything after the last slash (i.e. copeland-sku-zp20kae-pfv-130)

    SPLIT_PART(STR,SEP,N)

    Returns the Nth substring of str split by sep. If N is zero or smaller, returns null. If there are fewer than N substrings after str has been split by the given separator, returns empty string.

    **Make sure to <3 any users posts that helped you.
    **Please mark as accepted the ones who solved your issue.
  • There might be a different number of '/' though for each URL... I also noticed this has "Copeland-sku-" in front of the part number.... maybe I can use the sku- as a deliminator.

  • MarkSnodgrass
    MarkSnodgrass Portland, Oregon 🥷

    Yes, try SPLIT_PART(url,'copeland-sku',2) and see if that gets you what you want.

    **Make sure to <3 any users posts that helped you.
    **Please mark as accepted the ones who solved your issue.
  • GrantSmith
    GrantSmith Indiana 🥷

    Hi @User_32265

    Regular expressions are a great tool for this use case when you're trying to parse our text from another with variable length or positioning. You can utilize this regular expression in a formula tile:

    REGEXP_REPLACE(`url`, '^.*/([^/]*)$', '$1')
    

    ^ Says start at the beginning of the string

    .* Match any number of any character

    / Explicitly matches the forward slash

    () Captures the group so we can replace the text in the second parameter

    [^/]* Matches any non-forward slash character 0 or more times

    $ Matches the end of the string

    In layman's terms we're saying return to me the characters at the end of the string after the last forward slash.

    $1 says to replace all of the text we found in the first part with the first matching group (what we saved from the regular expression using () )



    **Was this post helpful? Click Agree or Like below**

    **Did this solve your problem? Accept it as a solution!**
  • Thanks all... I've learned that not all the pages end in the product id... going back to my original question. Is there a way to look for strings contained in a field in order to match from another dataset with a field that contains that string?

  • GrantSmith
    GrantSmith Indiana 🥷

    Magic ETL doesn't support conditional joins so you'd need to set a constant for each of your datasets called 'Join Column' and set to a value of 1, then do a join on both datasets together. After that you can use this expression to filter your results.

    `URL` LIKE CONCAT('%', `Product ID`, '%')
    

    Not the most efficient method as it'll do a Cartesian join and then filter instead of filtering on the join (removing the amount of data needed to process).



    **Was this post helpful? Click Agree or Like below**

    **Did this solve your problem? Accept it as a solution!**
  • Thanks Grant.