Semantic Product Matching

1. Overview

This process examines a list of product descriptions and identifies any items that are likely the same product, based on how closely their descriptions match in meaning. It produces a list of suspected duplicate products together with a similarity score for each pair.

2. Business Value

Reduces duplicate listings that can confuse customers and dilute search results.
Saves time for catalog managers by automatically spotting likely duplicates.
Improves inventory accuracy and prevents wasted marketing spend on duplicate items.

3. Operational Context

When to run: Whenever a new batch of product data is added or when a periodic clean‑up of the catalog is needed.
Who uses it: Data engineers, catalog managers, or any staff responsible for maintaining a clean product catalog.
Frequency: Typically run after a major data import, or on a scheduled basis (e.g., monthly) to keep the catalog tidy.

4. Inputs

Name/Label	Type	Details Provided
Product Catalog	List of product items	Each item contains:
• Product ID – a short, human‑readable identifier (e.g., “TSHIRT‑BLUE‑001”)
• Product Name – the name shown to customers (e.g., “Blue Cotton T‑Shirt”)
• Product Description – the full description used in the storefront (e.g., “A comfortable blue cotton T‑shirt for everyday wear.”)
Similarity Threshold	Number (0 – 1)	The minimum similarity score required to flag two products as duplicates. The value must be between 0 (no similarity) and 1 (exact match).
Pre‑Processing Settings (optional)	List of options	• Lowercase – convert text to lower case (default: enabled)
• Remove punctuation – strip punctuation characters (default: enabled)
• Trim whitespace – remove extra spaces (default: enabled)
Run Date (for audit)	Date	The date the process is executed (e.g., “2025‑08‑11”). This value is used only for reporting purposes.

5. Outputs

Name/Label	Contents	Formatting Rules
Duplicate Pairs List	A collection of product pairs that are considered duplicates. For each pair:
• Product ID #1
• Product ID #2
• Similarity Score (rounded to two decimals)	- List each pair only once (always present the lower‑alphabetical ID first).

Order the list from highest to lowest similarity.
Use a plain‑text table format (see example). | | Summary Report | A brief narrative summarizing the run, including: • Total number of products processed • Number of duplicate pairs found • Any records that were skipped because of missing descriptions • The date of the run | - One paragraph summary followed by bullet points for each statistic.
Include a “Status” line: “Success” if the process completed without errors, otherwise “Error – see details”. |

6. Detailed Plan & Execution Steps

Load the product catalog
- Verify the list is not empty.
- Ensure every item has a non‑empty Product Description; if a description is missing, add the product to a “Missing Description” list and skip it for the similarity check.
Pre‑process all descriptions
- Convert text to lower case.
- Remove punctuation characters (e.g., commas, periods).
- Trim extra spaces.
- Keep a copy of the cleaned text for each product (for debugging if needed).
Create semantic vectors
- For each cleaned description, generate a semantic vector that captures its meaning (use a built‑in language model).
Calculate pairwise similarity
- For every unique pair of products (i.e., do not compare a product to itself).
- Compute the similarity score (cosine similarity) between their vectors.
- Round the score to two decimal places.
Apply the similarity threshold
- If the score ≥ the supplied Similarity Threshold, mark the pair as a duplicate.
Build the Duplicate Pairs List
- For each flagged pair, record Product ID #1, Product ID #2, and the Similarity Score.
- Ensure each pair appears only once (e.g., “A‑001” & “B‑002” is the same as “B‑002” & “A‑001”). Use alphabetical order of the IDs to decide placement.
Sort the list
- Order the rows from highest to lowest similarity score.
Generate the Summary Report
- Count the total number of products processed.
- Count the number of duplicate pairs found.
- List the identifiers of any products that were skipped due to missing descriptions.
- Record the Run Date and the status (“Success” unless any fatal error occurred).
Return the outputs
- Produce the Duplicate Pairs List and the Summary Report exactly as described in the “Outputs” table.
Log any errors

If any step fails (e.g., the similarity threshold is missing or outside 0‑1), stop processing and produce a Summary Report with “Status: Error – <reason>”, and do not produce a Duplicate Pairs List.

7. Validation & Quality Checks

Input Validation
- Confirm that Similarity Threshold is a number between 0 and 1. If not, stop with an error.
- Verify each product has a non‑empty description; missing ones must be listed in the “Missing Description” section of the Summary Report.
Processing Checks
- All similarity scores must be between 0.00 and 1.00 (inclusive).
- Ensure each pair appears only once and that IDs are ordered alphabetically within each pair.
Output Verification
- The Duplicate Pairs List must be sorted descending by similarity score.
- The Summary Report numbers must match the actual data (e.g., the number of pairs listed matches the “Number of duplicate pairs found” statement).
Post‑run Review
- Spot‑check 5 random pairs to verify that the descriptions look similar to a human reviewer.

8. Special Rules / Edge Cases

Situation	Handling
Missing Description	Add the product to the “Missing Description” list in the Summary Report. Do not include it in any similarity calculation.
Exact Same Description	Treat as a duplicate with a similarity score of 1.00 and list it as a pair.
Similarity Score = Threshold	Treat as a duplicate (≥ threshold).
Threshold Too Low (e.g., < 0.5)	Add a warning note in the Summary Report: “Low threshold may produce many false positives.”
Large Catalog ( > 10 000 items)	Recommend processing in smaller batches to avoid memory overload; note this in the Summary Report if the batch size is reduced.
Multiple duplicates for one product	List each pair separately. Example: if product A matches both B and C, both A‑B and A‑C appear as distinct rows.
Failure to generate vectors	If the language model fails for any description, record the product in an “Error” list in the Summary Report and omit it from further calculations.
Empty Input List	Produce a Summary Report with “No products to process.” and stop.

9. Example

Input (provided for a single run)

Product Catalog (3 items)
1. Product ID: “TSHIRT‑001” Product Name: “Blue Cotton T‑Shirt” Description: “A comfortable blue cotton T‑shirt for everyday wear.”
2. Product ID: “TSHIRT‑002” Product Name: “Blue T‑Shirt” Description: “A comfortable blue cotton T‑shirt for everyday use.”
3. Product ID: “SHOE‑001” Product Name: “Red Running Shoes” Description: “Lightweight red running shoes with breathable mesh.”
Similarity Threshold: 0.85
Pre‑Processing Settings: (default – lower case, remove punctuation, trim whitespace)
Run Date: 2025‑08‑11

Expected Output

Duplicate Pairs List

Product ID #1   | Product ID #2 | Similarity Score
-------------------------------------------------
TSHIRT‑001     | TSHIRT‑002    | 0.92

Summary Report

Run Date: 2025‑08‑11
Status: Success
- Total products processed: 3
- Duplicate pairs found: 1
- Items skipped due to missing description: 0

Note: No duplicate pairs found for the product “SHOE‑001”.

Appendix A – FAQ

Q1: What is “semantic similarity”? A: It measures how close two pieces of text are in meaning, not just in wording. Two descriptions that talk about the same product will get a high score even if they use different words.

Q2: How is the similarity score calculated? A: Each description is turned into a numeric representation (a vector). The similarity is then computed as the cosine of the angle between two vectors – a standard method for comparing meanings.

Q3: How should I choose the similarity threshold? A: - 0.90‑0.95: Very strict; only near‑identical descriptions will be flagged.

0.80‑0.90: Balanced; catches most duplicates with moderate wording differences.
Below 0.80: Likely to generate false positives; use only for a very clean catalog.

Q4: What if two products have the same name but different descriptions? A: The process only looks at the Description field. If the descriptions differ significantly, they will not be flagged as duplicates unless the similarity score meets the threshold.

Q5: How are missing descriptions handled? A: Any product lacking a description is placed in the “Missing Description” list in the Summary Report. These items are not included in the similarity calculation.

Q6: What if I have more than 10,000 products? A: Split the catalog into smaller batches (e.g., 5,000 per batch). Run the process separately for each batch and combine the resulting duplicate lists.

Q7: Can I change the pre‑processing steps? A: Yes. The default settings work well for most English language product catalogs. If you need to preserve numbers or symbols (e.g., model numbers), you can disable the “Remove punctuation” option.

Q8: What if the language model fails for a description? A: The product is placed in an “Error” section of the Summary Report and excluded from further comparisons.

Q9: How are ties (identical scores) ordered? A: When scores are identical, pairs are ordered alphabetically by the first product ID, then by the second.

Q10: How do I interpret the similarity score? A: - 0.00 – no similarity at all.

0.50 – moderate similarity, likely unrelated.
0.80 – fairly similar.
0.95 – almost identical; high confidence duplicate.

Q11: Should I manually review all flagged pairs? A: Yes. The process identifies candidates, but a human reviewer should confirm that the flagged items truly represent the same product before taking any action in the catalog.

Q12: How is the output used? A: The list of duplicate pairs can be exported to the catalog management system for merging, or it can be handed to a catalog manager for manual reconciliation.

Appendix B – Glossary

Product ID: A short, human‑readable identifier that uniquely names a product in the catalog (e.g., “TSHIRT‑001”).
Product Description: The textual field shown to customers that explains what the product is and its features.
Semantic Similarity: A measure of how closely two pieces of text convey the same meaning.
Embedding / Vector: A numeric representation of a piece of text that captures its meaning.
Cosine Similarity: A number between 0 and 1 that indicates how similar two vectors are; higher numbers mean the texts are more alike.
Threshold: The minimum similarity score required for two products to be considered duplicates.
Pre‑Processing: The steps applied to text (e.g., lowercasing, removing punctuation) before creating the vector representation.
Duplicate Pair: Two distinct products that have been flagged as likely representing the same item.
Batch: A subset of the whole product list processed at one time to manage resource usage.
Summary Report: A short textual summary that tells the user how many items were processed, how many duplicate pairs were found, and any issues encountered.

Appendix C – Reference Materials

C.1 Recommended Pre‑Processing Steps (default)

Lowercase – Convert every character to its lowercase version.
Remove punctuation – Delete characters such as . , ; : ? ! " ' ( ) [ ] { } – –.
Trim whitespace – Replace multiple spaces with a single space; remove leading/trailing spaces.

Note: If product names or descriptions contain important symbols (e.g., “+” in “Vitamin C+”), consider disabling the “Remove punctuation” step for those specific fields.

C.2 Suggested Similarity Threshold Ranges

Use‑Case	Recommended Threshold	Expected False‑Positive Rate
High confidence only (e.g., final catalog clean‑up)	0.90 – 0.95	Low
Balanced (regular maintenance)	0.80 – 0.90	Moderate
Exploratory (large catalog, initial pass)	0.70 – 0.80	Higher (use for further manual filtering)

C.3 Handling Very Large Catalogs

Approximate Size	Recommended Approach
≤ 5 000 items	Run in a single batch.
5 001 – 20 000	Split into 2‑4 equal batches.
> 20 000	Consider a streaming approach: compute similarity in chunks, store intermediate results, and merge after each chunk.

C.4 Example of Pre‑Processed Text

Original Description	After Pre‑Processing
“A Blue cotton T‑Shirt, perfect for everyday wear!”	“a blue cotton t shirt perfect for everyday wear”
“Red Running Shoes – lightweight & breathable.”	“red running shoes lightweight breathable”

C.5 Error Handling Templates

Missing Description

Product ID: [ID] – Missing description.

Vector Generation Failure

Product ID: [ID] – Failed to generate semantic vector. (Reason: <error message>)

C.6 Reporting Template

Header

Run Date: YYYY‑MM‑DD
Status: [Success | Error – <reason>]

Body

- Total products processed: X
- Duplicate pairs found: Y
- Records with missing description: Z
- Errors generating vectors: W

Duplicate List

[Product ID #1] | [Product ID #2] | Score

Notes

Include any relevant observations (e.g., “Threshold set low; review recommended.”)

C.7 Best Practices for Review

Validate the data source – Ensure the product catalog export is current.
Run a sample – Before processing a whole catalog, run the procedure on a small sample to ensure the similarity threshold works for your data.
Document decisions – Record why a particular threshold was chosen and any adjustments made after reviewing the initial output.
Iterate – Adjust the threshold if you notice too many false positives or false negatives.
Maintain a log – Keep a log file of each run, including the threshold used, number of duplicates found, and any manual changes made after review.

C.8 Sample Log Entry

2025‑08‑11 10:23:05 – Run started (Threshold: 0.85)
Processed 3,452 products
Found 128 duplicate pairs
Skipped 5 records (missing description)
Run completed – Status: Success

Additional Notes

The process does not modify any data; it only reads the product list and produces reports.
The generated Duplicate Pairs List can be fed into a catalog management system for further action, such as merging duplicate entries or flagging them for review.
Always keep a backup of the original product catalog before applying any merge or deletion operation based on the output.
If you need to run this process frequently, consider saving the vector representation for each product so that the similarity calculation can be done faster in subsequent runs.

**.

Clean Up Your Catalog with Semantic Product Matching