Semantic Product Matching
1. Overview
This process examines a list of product descriptions and identifies any items that are likely the same product, based on how closely their descriptions match in meaning. It produces a list of suspected duplicate products together with a similarity score for each pair.
2. Business Value
- Reduces duplicate listings that can confuse customers and dilute search results.
- Saves time for catalog managers by automatically spotting likely duplicates.
- Improves inventory accuracy and prevents wasted marketing spend on duplicate items.
3. Operational Context
- When to run: Whenever a new batch of product data is added or when a periodic clean‑up of the catalog is needed.
- Who uses it: Data engineers, catalog managers, or any staff responsible for maintaining a clean product catalog.
- Frequency: Typically run after a major data import, or on a scheduled basis (e.g., monthly) to keep the catalog tidy.
4. Inputs
| Name/Label | Type | Details Provided | 
|---|
| Product Catalog | List of product items | Each item contains: | 
| • Product ID – a short, human‑readable identifier (e.g., “TSHIRT‑BLUE‑001”) |  |  | 
| • Product Name – the name shown to customers (e.g., “Blue Cotton T‑Shirt”) |  |  | 
| • Product Description – the full description used in the storefront (e.g., “A comfortable blue cotton T‑shirt for everyday wear.”) |  |  | 
| Similarity Threshold | Number (0 – 1) | The minimum similarity score required to flag two products as duplicates. The value must be between 0 (no similarity) and 1 (exact match). | 
| Pre‑Processing Settings (optional) | List of options | • Lowercase – convert text to lower case (default: enabled) | 
| • Remove punctuation – strip punctuation characters (default: enabled) |  |  | 
| • Trim whitespace – remove extra spaces (default: enabled) |  |  | 
| Run Date (for audit) | Date | The date the process is executed (e.g., “2025‑08‑11”). This value is used only for reporting purposes. | 
5. Outputs
| Name/Label | Contents | Formatting Rules | 
|---|
| Duplicate Pairs List | A collection of product pairs that are considered duplicates. For each pair: |  | 
| • Product ID #1 |  |  | 
| • Product ID #2 |  |  | 
| • Similarity Score (rounded to two decimals) | - List each pair only once (always present the lower‑alphabetical ID first). |  | 
- Order the list from highest to lowest similarity.
- Use a plain‑text table format (see example). |
| Summary Report | A brief narrative summarizing the run, including:
• Total number of products processed
• Number of duplicate pairs found
• Any records that were skipped because of missing descriptions
• The date of the run | - One paragraph summary followed by bullet points for each statistic.
- Include a “Status” line: “Success” if the process completed without errors, otherwise “Error – see details”. |
6. Detailed Plan & Execution Steps
- 
Load the product catalog 
- Verify the list is not empty.
- Ensure every item has a non‑empty Product Description; if a description is missing, add the product to a “Missing Description” list and skip it for the similarity check.
 
- 
Pre‑process all descriptions 
- Convert text to lower case.
- Remove punctuation characters (e.g., commas, periods).
- Trim extra spaces.
- Keep a copy of the cleaned text for each product (for debugging if needed).
 
- 
Create semantic vectors 
- For each cleaned description, generate a semantic vector that captures its meaning (use a built‑in language model).
 
- 
Calculate pairwise similarity 
- For every unique pair of products (i.e., do not compare a product to itself).
- Compute the similarity score (cosine similarity) between their vectors.
- Round the score to two decimal places.
 
- 
Apply the similarity threshold 
- If the score ≥ the supplied Similarity Threshold, mark the pair as a duplicate.
 
- 
Build the Duplicate Pairs List 
- For each flagged pair, record Product ID #1, Product ID #2, and the Similarity Score.
- Ensure each pair appears only once (e.g., “A‑001” & “B‑002” is the same as “B‑002” & “A‑001”). Use alphabetical order of the IDs to decide placement.
 
- 
Sort the list 
- Order the rows from highest to lowest similarity score.
 
- 
Generate the Summary Report 
- Count the total number of products processed.
- Count the number of duplicate pairs found.
- List the identifiers of any products that were skipped due to missing descriptions.
- Record the Run Date and the status (“Success” unless any fatal error occurred).
 
- 
Return the outputs 
- Produce the Duplicate Pairs List and the Summary Report exactly as described in the “Outputs” table.
 
- 
Log any errors 
- If any step fails (e.g., the similarity threshold is missing or outside 0‑1), stop processing and produce a Summary Report with “Status: Error – <reason>”, and do not produce a Duplicate Pairs List.
7. Validation & Quality Checks
- 
Input Validation 
- Confirm that Similarity Threshold is a number between 0 and 1. If not, stop with an error.
- Verify each product has a non‑empty description; missing ones must be listed in the “Missing Description” section of the Summary Report.
 
- 
Processing Checks 
- All similarity scores must be between 0.00 and 1.00 (inclusive).
- Ensure each pair appears only once and that IDs are ordered alphabetically within each pair.
 
- 
Output Verification 
- The Duplicate Pairs List must be sorted descending by similarity score.
- The Summary Report numbers must match the actual data (e.g., the number of pairs listed matches the “Number of duplicate pairs found” statement).
 
- 
Post‑run Review 
- Spot‑check 5 random pairs to verify that the descriptions look similar to a human reviewer.
 
8. Special Rules / Edge Cases
| Situation | Handling | 
|---|
| Missing Description | Add the product to the “Missing Description” list in the Summary Report. Do not include it in any similarity calculation. | 
| Exact Same Description | Treat as a duplicate with a similarity score of 1.00 and list it as a pair. | 
| Similarity Score = Threshold | Treat as a duplicate (≥ threshold). | 
| Threshold Too Low (e.g., < 0.5) | Add a warning note in the Summary Report: “Low threshold may produce many false positives.” | 
| Large Catalog ( > 10 000 items) | Recommend processing in smaller batches to avoid memory overload; note this in the Summary Report if the batch size is reduced. | 
| Multiple duplicates for one product | List each pair separately. Example: if product A matches both B and C, both A‑B and A‑C appear as distinct rows. | 
| Failure to generate vectors | If the language model fails for any description, record the product in an “Error” list in the Summary Report and omit it from further calculations. | 
| Empty Input List | Produce a Summary Report with “No products to process.” and stop. | 
9. Example
Input (provided for a single run)
- 
Product Catalog (3 items) 
- Product ID: “TSHIRT‑001”
Product Name: “Blue Cotton T‑Shirt”
Description: “A comfortable blue cotton T‑shirt for everyday wear.”
- Product ID: “TSHIRT‑002”
Product Name: “Blue T‑Shirt”
Description: “A comfortable blue cotton T‑shirt for everyday use.”
- Product ID: “SHOE‑001”
Product Name: “Red Running Shoes”
Description: “Lightweight red running shoes with breathable mesh.”
 
- 
Similarity Threshold: 0.85 
- 
Pre‑Processing Settings: (default – lower case, remove punctuation, trim whitespace) 
- 
Run Date: 2025‑08‑11 
Expected Output
Duplicate Pairs List
Product ID #1   | Product ID #2 | Similarity Score
-------------------------------------------------
TSHIRT‑001     | TSHIRT‑002    | 0.92
Summary Report
Run Date: 2025‑08‑11
Status: Success
- Total products processed: 3
- Duplicate pairs found: 1
- Items skipped due to missing description: 0
Note: No duplicate pairs found for the product “SHOE‑001”.
Appendix A – FAQ
Q1: What is “semantic similarity”?
A: It measures how close two pieces of text are in meaning, not just in wording. Two descriptions that talk about the same product will get a high score even if they use different words.
Q2: How is the similarity score calculated?
A: Each description is turned into a numeric representation (a vector). The similarity is then computed as the cosine of the angle between two vectors – a standard method for comparing meanings.
Q3: How should I choose the similarity threshold?
A: - 0.90‑0.95: Very strict; only near‑identical descriptions will be flagged.
- 0.80‑0.90: Balanced; catches most duplicates with moderate wording differences.
- Below 0.80: Likely to generate false positives; use only for a very clean catalog.
Q4: What if two products have the same name but different descriptions?
A: The process only looks at the Description field. If the descriptions differ significantly, they will not be flagged as duplicates unless the similarity score meets the threshold.
Q5: How are missing descriptions handled?
A: Any product lacking a description is placed in the “Missing Description” list in the Summary Report. These items are not included in the similarity calculation.
Q6: What if I have more than 10,000 products?
A: Split the catalog into smaller batches (e.g., 5,000 per batch). Run the process separately for each batch and combine the resulting duplicate lists.
Q7: Can I change the pre‑processing steps?
A: Yes. The default settings work well for most English language product catalogs. If you need to preserve numbers or symbols (e.g., model numbers), you can disable the “Remove punctuation” option.
Q8: What if the language model fails for a description?
A: The product is placed in an “Error” section of the Summary Report and excluded from further comparisons.
Q9: How are ties (identical scores) ordered?
A: When scores are identical, pairs are ordered alphabetically by the first product ID, then by the second.
Q10: How do I interpret the similarity score?
A: - 0.00 – no similarity at all.
- 0.50 – moderate similarity, likely unrelated.
- 0.80 – fairly similar.
- 0.95 – almost identical; high confidence duplicate.
Q11: Should I manually review all flagged pairs?
A: Yes. The process identifies candidates, but a human reviewer should confirm that the flagged items truly represent the same product before taking any action in the catalog.
Q12: How is the output used?
A: The list of duplicate pairs can be exported to the catalog management system for merging, or it can be handed to a catalog manager for manual reconciliation.
Appendix B – Glossary
- 
Product ID: A short, human‑readable identifier that uniquely names a product in the catalog (e.g., “TSHIRT‑001”). 
- 
Product Description: The textual field shown to customers that explains what the product is and its features. 
- 
Semantic Similarity: A measure of how closely two pieces of text convey the same meaning. 
- 
Embedding / Vector: A numeric representation of a piece of text that captures its meaning. 
- 
Cosine Similarity: A number between 0 and 1 that indicates how similar two vectors are; higher numbers mean the texts are more alike. 
- 
Threshold: The minimum similarity score required for two products to be considered duplicates. 
- 
Pre‑Processing: The steps applied to text (e.g., lowercasing, removing punctuation) before creating the vector representation. 
- 
Duplicate Pair: Two distinct products that have been flagged as likely representing the same item. 
- 
Batch: A subset of the whole product list processed at one time to manage resource usage. 
- 
Summary Report: A short textual summary that tells the user how many items were processed, how many duplicate pairs were found, and any issues encountered. 
Appendix C – Reference Materials
C.1 Recommended Pre‑Processing Steps (default)
- Lowercase – Convert every character to its lowercase version.
- Remove punctuation – Delete characters such as . , ; : ? ! " ' ( ) [ ] { } – –.
- Trim whitespace – Replace multiple spaces with a single space; remove leading/trailing spaces.
Note: If product names or descriptions contain important symbols (e.g., “+” in “Vitamin C+”), consider disabling the “Remove punctuation” step for those specific fields.
C.2 Suggested Similarity Threshold Ranges
| Use‑Case | Recommended Threshold | Expected False‑Positive Rate | 
|---|
| High confidence only (e.g., final catalog clean‑up) | 0.90 – 0.95 | Low | 
| Balanced (regular maintenance) | 0.80 – 0.90 | Moderate | 
| Exploratory (large catalog, initial pass) | 0.70 – 0.80 | Higher (use for further manual filtering) | 
C.3 Handling Very Large Catalogs
| Approximate Size | Recommended Approach | 
|---|
| ≤ 5 000 items | Run in a single batch. | 
| 5 001 – 20 000 | Split into 2‑4 equal batches. | 
| > 20 000 | Consider a streaming approach: compute similarity in chunks, store intermediate results, and merge after each chunk. | 
C.4 Example of Pre‑Processed Text
| Original Description | After Pre‑Processing | 
|---|
| “A Blue cotton T‑Shirt, perfect for everyday wear!” | “a blue cotton t shirt perfect for everyday wear” | 
| “Red Running Shoes – lightweight & breathable.” | “red running shoes lightweight breathable” | 
C.5 Error Handling Templates
Missing Description
Product ID: [ID] – Missing description.
Vector Generation Failure
Product ID: [ID] – Failed to generate semantic vector. (Reason: <error message>)
C.6 Reporting Template
Header
Run Date: YYYY‑MM‑DD
Status: [Success | Error – <reason>]
Body
- Total products processed: X
- Duplicate pairs found: Y
- Records with missing description: Z
- Errors generating vectors: W
Duplicate List
[Product ID #1] | [Product ID #2] | Score
Notes
- Include any relevant observations (e.g., “Threshold set low; review recommended.”)
C.7 Best Practices for Review
- Validate the data source – Ensure the product catalog export is current.
- Run a sample – Before processing a whole catalog, run the procedure on a small sample to ensure the similarity threshold works for your data.
- Document decisions – Record why a particular threshold was chosen and any adjustments made after reviewing the initial output.
- Iterate – Adjust the threshold if you notice too many false positives or false negatives.
- Maintain a log – Keep a log file of each run, including the threshold used, number of duplicates found, and any manual changes made after review.
C.8 Sample Log Entry
2025‑08‑11 10:23:05 – Run started (Threshold: 0.85)
Processed 3,452 products
Found 128 duplicate pairs
Skipped 5 records (missing description)
Run completed – Status: Success
Additional Notes
- The process does not modify any data; it only reads the product list and produces reports.
- The generated Duplicate Pairs List can be fed into a catalog management system for further action, such as merging duplicate entries or flagging them for review.
- Always keep a backup of the original product catalog before applying any merge or deletion operation based on the output.
- If you need to run this process frequently, consider saving the vector representation for each product so that the similarity calculation can be done faster in subsequent runs.
**.