Skip to main content

Clean Up Your Catalog with Semantic Product Matching

Clean Up Your Catalog with Semantic Product Matching header

Duplicate product listings are more than just a nuisance—they confuse shoppers, dilute search relevance, and waste marketing resources. If you’ve ever spent hours sifting through near‑identical items or worrying that a hidden duplicate is stealing clicks, you know the hidden cost of a messy catalog. This workflow is built to take that burden off your shoulders and give you confidence that every item in your storefront truly stands alone.

You describe it

Semantic Product Matching

1. Overview

This process examines a list of product descriptions and identifies any items that are likely the same product, based on how closely their descriptions match in meaning. It produces a list of suspected duplicate products together with a similarity score for each pair.

2. Business Value

  • Reduces duplicate listings that can confuse customers and dilute search results.
  • Saves time for catalog managers by automatically spotting likely duplicates.
  • Improves inventory accuracy and prevents wasted marketing spend on duplicate items.

3. Operational Context

  • When to run: Whenever a new batch of product data is added or when a periodic clean‑up of the catalog is needed.
  • Who uses it: Data engineers, catalog managers, or any staff responsible for maintaining a clean product catalog.
  • Frequency: Typically run after a major data import, or on a scheduled basis (e.g., monthly) to keep the catalog tidy.

4. Inputs

Name/LabelTypeDetails Provided
Product CatalogList of product itemsEach item contains:
Product ID – a short, human‑readable identifier (e.g., “TSHIRT‑BLUE‑001”)
Product Name – the name shown to customers (e.g., “Blue Cotton T‑Shirt”)
Product Description – the full description used in the storefront (e.g., “A comfortable blue cotton T‑shirt for everyday wear.”)
Similarity ThresholdNumber (0 – 1)The minimum similarity score required to flag two products as duplicates. The value must be between 0 (no similarity) and 1 (exact match).
Pre‑Processing Settings (optional)List of optionsLowercase – convert text to lower case (default: enabled)
Remove punctuation – strip punctuation characters (default: enabled)
Trim whitespace – remove extra spaces (default: enabled)
Run Date (for audit)DateThe date the process is executed (e.g., “2025‑08‑11”). This value is used only for reporting purposes.

5. Outputs

Name/LabelContentsFormatting Rules
Duplicate Pairs ListA collection of product pairs that are considered duplicates. For each pair:
Product ID #1
Product ID #2
Similarity Score (rounded to two decimals)- List each pair only once (always present the lower‑alphabetical ID first).
  • Order the list from highest to lowest similarity.
  • Use a plain‑text table format (see example). | | Summary Report | A brief narrative summarizing the run, including: • Total number of products processed • Number of duplicate pairs found • Any records that were skipped because of missing descriptions • The date of the run | - One paragraph summary followed by bullet points for each statistic.
  • Include a “Status” line: “Success” if the process completed without errors, otherwise “Error – see details”. |

6. Detailed Plan & Execution Steps

  1. Load the product catalog

    • Verify the list is not empty.
    • Ensure every item has a non‑empty Product Description; if a description is missing, add the product to a “Missing Description” list and skip it for the similarity check.
  2. Pre‑process all descriptions

    • Convert text to lower case.
    • Remove punctuation characters (e.g., commas, periods).
    • Trim extra spaces.
    • Keep a copy of the cleaned text for each product (for debugging if needed).
  3. Create semantic vectors

    • For each cleaned description, generate a semantic vector that captures its meaning (use a built‑in language model).
  4. Calculate pairwise similarity

    • For every unique pair of products (i.e., do not compare a product to itself).
    • Compute the similarity score (cosine similarity) between their vectors.
    • Round the score to two decimal places.
  5. Apply the similarity threshold

    • If the score the supplied Similarity Threshold, mark the pair as a duplicate.
  6. Build the Duplicate Pairs List

    • For each flagged pair, record Product ID #1, Product ID #2, and the Similarity Score.
    • Ensure each pair appears only once (e.g., “A‑001” & “B‑002” is the same as “B‑002” & “A‑001”). Use alphabetical order of the IDs to decide placement.
  7. Sort the list

    • Order the rows from highest to lowest similarity score.
  8. Generate the Summary Report

    • Count the total number of products processed.
    • Count the number of duplicate pairs found.
    • List the identifiers of any products that were skipped due to missing descriptions.
    • Record the Run Date and the status (“Success” unless any fatal error occurred).
  9. Return the outputs

    • Produce the Duplicate Pairs List and the Summary Report exactly as described in the “Outputs” table.
  10. Log any errors

  • If any step fails (e.g., the similarity threshold is missing or outside 0‑1), stop processing and produce a Summary Report with “Status: Error – <reason>”, and do not produce a Duplicate Pairs List.

7. Validation & Quality Checks

  • Input Validation

    • Confirm that Similarity Threshold is a number between 0 and 1. If not, stop with an error.
    • Verify each product has a non‑empty description; missing ones must be listed in the “Missing Description” section of the Summary Report.
  • Processing Checks

    • All similarity scores must be between 0.00 and 1.00 (inclusive).
    • Ensure each pair appears only once and that IDs are ordered alphabetically within each pair.
  • Output Verification

    • The Duplicate Pairs List must be sorted descending by similarity score.
    • The Summary Report numbers must match the actual data (e.g., the number of pairs listed matches the “Number of duplicate pairs found” statement).
  • Post‑run Review

    • Spot‑check 5 random pairs to verify that the descriptions look similar to a human reviewer.

8. Special Rules / Edge Cases

SituationHandling
Missing DescriptionAdd the product to the “Missing Description” list in the Summary Report. Do not include it in any similarity calculation.
Exact Same DescriptionTreat as a duplicate with a similarity score of 1.00 and list it as a pair.
Similarity Score = ThresholdTreat as a duplicate (≥ threshold).
Threshold Too Low (e.g., < 0.5)Add a warning note in the Summary Report: “Low threshold may produce many false positives.”
Large Catalog ( > 10 000 items)Recommend processing in smaller batches to avoid memory overload; note this in the Summary Report if the batch size is reduced.
Multiple duplicates for one productList each pair separately. Example: if product A matches both B and C, both A‑B and A‑C appear as distinct rows.
Failure to generate vectorsIf the language model fails for any description, record the product in an “Error” list in the Summary Report and omit it from further calculations.
Empty Input ListProduce a Summary Report with “No products to process.” and stop.

9. Example

Input (provided for a single run)

  • Product Catalog (3 items)

    1. Product ID: “TSHIRT‑001” Product Name: “Blue Cotton T‑Shirt” Description: “A comfortable blue cotton T‑shirt for everyday wear.”
    2. Product ID: “TSHIRT‑002” Product Name: “Blue T‑Shirt” Description: “A comfortable blue cotton T‑shirt for everyday use.”
    3. Product ID: “SHOE‑001” Product Name: “Red Running Shoes” Description: “Lightweight red running shoes with breathable mesh.”
  • Similarity Threshold: 0.85

  • Pre‑Processing Settings: (default – lower case, remove punctuation, trim whitespace)

  • Run Date: 2025‑08‑11


Expected Output

Duplicate Pairs List

Product ID #1   | Product ID #2 | Similarity Score
-------------------------------------------------
TSHIRT‑001     | TSHIRT‑002    | 0.92

Summary Report

Run Date: 2025‑08‑11
Status: Success
- Total products processed: 3
- Duplicate pairs found: 1
- Items skipped due to missing description: 0

Note: No duplicate pairs found for the product “SHOE‑001”.

Appendix A – FAQ

Q1: What is “semantic similarity”? A: It measures how close two pieces of text are in meaning, not just in wording. Two descriptions that talk about the same product will get a high score even if they use different words.

Q2: How is the similarity score calculated? A: Each description is turned into a numeric representation (a vector). The similarity is then computed as the cosine of the angle between two vectors – a standard method for comparing meanings.

Q3: How should I choose the similarity threshold? A: - 0.90‑0.95: Very strict; only near‑identical descriptions will be flagged.

  • 0.80‑0.90: Balanced; catches most duplicates with moderate wording differences.
  • Below 0.80: Likely to generate false positives; use only for a very clean catalog.

Q4: What if two products have the same name but different descriptions? A: The process only looks at the Description field. If the descriptions differ significantly, they will not be flagged as duplicates unless the similarity score meets the threshold.

Q5: How are missing descriptions handled? A: Any product lacking a description is placed in the “Missing Description” list in the Summary Report. These items are not included in the similarity calculation.

Q6: What if I have more than 10,000 products? A: Split the catalog into smaller batches (e.g., 5,000 per batch). Run the process separately for each batch and combine the resulting duplicate lists.

Q7: Can I change the pre‑processing steps? A: Yes. The default settings work well for most English language product catalogs. If you need to preserve numbers or symbols (e.g., model numbers), you can disable the “Remove punctuation” option.

Q8: What if the language model fails for a description? A: The product is placed in an “Error” section of the Summary Report and excluded from further comparisons.

Q9: How are ties (identical scores) ordered? A: When scores are identical, pairs are ordered alphabetically by the first product ID, then by the second.

Q10: How do I interpret the similarity score? A: - 0.00 – no similarity at all.

  • 0.50 – moderate similarity, likely unrelated.
  • 0.80 – fairly similar.
  • 0.95 – almost identical; high confidence duplicate.

Q11: Should I manually review all flagged pairs? A: Yes. The process identifies candidates, but a human reviewer should confirm that the flagged items truly represent the same product before taking any action in the catalog.

Q12: How is the output used? A: The list of duplicate pairs can be exported to the catalog management system for merging, or it can be handed to a catalog manager for manual reconciliation.

Appendix B – Glossary

  • Product ID: A short, human‑readable identifier that uniquely names a product in the catalog (e.g., “TSHIRT‑001”).

  • Product Description: The textual field shown to customers that explains what the product is and its features.

  • Semantic Similarity: A measure of how closely two pieces of text convey the same meaning.

  • Embedding / Vector: A numeric representation of a piece of text that captures its meaning.

  • Cosine Similarity: A number between 0 and 1 that indicates how similar two vectors are; higher numbers mean the texts are more alike.

  • Threshold: The minimum similarity score required for two products to be considered duplicates.

  • Pre‑Processing: The steps applied to text (e.g., lowercasing, removing punctuation) before creating the vector representation.

  • Duplicate Pair: Two distinct products that have been flagged as likely representing the same item.

  • Batch: A subset of the whole product list processed at one time to manage resource usage.

  • Summary Report: A short textual summary that tells the user how many items were processed, how many duplicate pairs were found, and any issues encountered.

Appendix C – Reference Materials

C.1 Recommended Pre‑Processing Steps (default)

  1. Lowercase – Convert every character to its lowercase version.
  2. Remove punctuation – Delete characters such as . , ; : ? ! " ' ( ) [ ] { } – –.
  3. Trim whitespace – Replace multiple spaces with a single space; remove leading/trailing spaces.

Note: If product names or descriptions contain important symbols (e.g., “+” in “Vitamin C+”), consider disabling the “Remove punctuation” step for those specific fields.

C.2 Suggested Similarity Threshold Ranges

Use‑CaseRecommended ThresholdExpected False‑Positive Rate
High confidence only (e.g., final catalog clean‑up)0.90 – 0.95Low
Balanced (regular maintenance)0.80 – 0.90Moderate
Exploratory (large catalog, initial pass)0.70 – 0.80Higher (use for further manual filtering)

C.3 Handling Very Large Catalogs

Approximate SizeRecommended Approach
≤ 5 000 itemsRun in a single batch.
5 001 – 20 000Split into 2‑4 equal batches.
> 20 000Consider a streaming approach: compute similarity in chunks, store intermediate results, and merge after each chunk.

C.4 Example of Pre‑Processed Text

Original DescriptionAfter Pre‑Processing
“A Blue cotton T‑Shirt, perfect for everyday wear!”“a blue cotton t shirt perfect for everyday wear”
“Red Running Shoes – lightweight & breathable.”“red running shoes lightweight breathable”

C.5 Error Handling Templates

Missing Description

Product ID: [ID] – Missing description.

Vector Generation Failure

Product ID: [ID] – Failed to generate semantic vector. (Reason: <error message>)

C.6 Reporting Template

Header

Run Date: YYYY‑MM‑DD
Status: [Success | Error – <reason>]

Body

- Total products processed: X
- Duplicate pairs found: Y
- Records with missing description: Z
- Errors generating vectors: W

Duplicate List

[Product ID #1] | [Product ID #2] | Score

Notes

  • Include any relevant observations (e.g., “Threshold set low; review recommended.”)

C.7 Best Practices for Review

  1. Validate the data source – Ensure the product catalog export is current.
  2. Run a sample – Before processing a whole catalog, run the procedure on a small sample to ensure the similarity threshold works for your data.
  3. Document decisions – Record why a particular threshold was chosen and any adjustments made after reviewing the initial output.
  4. Iterate – Adjust the threshold if you notice too many false positives or false negatives.
  5. Maintain a log – Keep a log file of each run, including the threshold used, number of duplicates found, and any manual changes made after review.

C.8 Sample Log Entry

2025‑08‑11 10:23:05 – Run started (Threshold: 0.85)
Processed 3,452 products
Found 128 duplicate pairs
Skipped 5 records (missing description)
Run completed – Status: Success

Additional Notes

  • The process does not modify any data; it only reads the product list and produces reports.
  • The generated Duplicate Pairs List can be fed into a catalog management system for further action, such as merging duplicate entries or flagging them for review.
  • Always keep a backup of the original product catalog before applying any merge or deletion operation based on the output.
  • If you need to run this process frequently, consider saving the vector representation for each product so that the similarity calculation can be done faster in subsequent runs.

**.

We build it

Find Duplicates

Identify likely duplicate products in a catalog by semantic similarity of descriptions. Upload your product list, set matching parameters, and review flagged duplicates.

Semantic Product Matching Input

Provide your product catalog and matching parameters.

Try me

The Real Cost of Duplicate Listings

When the same product appears under multiple IDs, customers may hesitate, thinking the site is unreliable. Marketing teams end up splitting budget across duplicate ads, and inventory systems can misallocate stock. Manual review of hundreds or thousands of descriptions is error‑prone and drains valuable time that could be spent on growth initiatives.

How Semantic Matching Changes the Game

Instead of relying on exact text matches, this workflow uses modern language models to understand the meaning behind each description. By converting text into semantic vectors and measuring cosine similarity, it flags pairs that read alike even if the wording differs. The result is a curated list of likely duplicates that you can review with confidence.

  • Speed: What once required days of manual comparison now happens in minutes.
  • Precision: By setting a similarity threshold, you control how strict the matching is, reducing false positives.
  • Scalability: The process works whether you have a few hundred items or tens of thousands, with optional batch handling for very large catalogs.

What You Receive

OutputValue to Your Team
Duplicate Pairs ListA ranked table of product ID pairs with similarity scores, ready for review or import into your catalog system.
Summary ReportA concise snapshot of the run, including total items processed, duplicate count, any missing descriptions, and overall status.

These artifacts give you both the granular detail needed for precise merges and the high‑level overview to track catalog health over time.

Insight

Key Insight
Even a modest similarity threshold can surface hidden duplicates that simple keyword searches miss, turning a hidden risk into a visible opportunity for cleanup.

Immediate Benefits You’ll Notice

Cleaner Search Results – Customers find the right product faster, improving conversion.
Reduced Marketing Waste – No more splitting ad spend across identical listings.
Lower Operational Load – Catalog managers spend less time hunting duplicates and more time enriching product data.

A Subtle Boost to Your Workflow

By integrating this semantic matching workflow into your regular data import or monthly maintenance schedule, you embed quality assurance directly into the lifecycle of your catalog. The process respects existing data pipelines, requires only the product descriptions you already maintain, and returns results in a format that fits seamlessly with most catalog management tools.

In the long run, a consistently de‑duplicated catalog not only strengthens the shopper experience but also builds a foundation for more advanced initiatives—like personalized recommendations or automated pricing—because the underlying data is reliable.

Let the power of language models handle the heavy lifting so you can focus on strategic decisions that grow your business.

Ready to Automate?

Get started with this workflow template in minutes. No complex setup required.

View Documentation