Product Data Normalization & Cleanup
1. Overview
This process cleans and standardizes a set of e‑commerce product records. It fixes inconsistent formatting, converts units to a common standard, fills missing attributes with default values, and removes duplicate entries. The result is a tidy catalog ready for publishing, search, and reporting.
2. Business Value
- Improves product discoverability on the website and in search engines.
- Reduces order‑fulfilment errors caused by inconsistent or missing data.
- Enables accurate reporting and analytics by ensuring data consistency.
- Saves time for the catalog team by automating repetitive clean‑up tasks.
3. Operational Context
-
When to run:
- After a bulk product import.
- Before a seasonal catalog launch.
- Periodically, e.g., weekly or monthly, to keep the catalog clean.
-
Who uses it:
- Catalog Manager (primary user).
- E‑commerce data analyst.
- Operations team when preparing data for marketplace feeds.
-
Frequency: As needed, typically after each major data load or on a scheduled basis.
4. Inputs
The only input required for a single run is a Product Data List containing individual product records. The list is provided as a collection of product entries, each with the fields listed below.
4.1 Product Data List
- Name/Label: Product Data List
- Type: List of product entries (each entry is a single product)
- Details Provided: Each entry contains the fields listed in the table at the end of this section.
| Field | Type | Description | Example |
|---|
| Product Name | Text | Full name of the product, as displayed to customers. | “Wireless Bluetooth Mouse” |
| SKU | Text | Unique stock‑keeping identifier for the product. | “WM-001‑BLU” |
| Description | Text | Full description of the product. | “Ergonomic wireless mouse with 2‑year battery life.” |
| Price | Number | Price value (numeric) | 24.99 |
| Currency | Text (3‑letter code) | Currency of the price (ISO 4217). | “USD” |
| Weight | Text | Weight with unit (e.g., “1.5 kg”, “3 lb”). | “1.5 kg” |
| Dimensions | Text | Length × Width × Height with unit (e.g., “10×5×2 cm”). | “12×8×2 cm” |
| Category | Text | Category path (e.g., “Electronics > Computers > Accessories”). | “Electronics > Computers > Accessories” |
| Brand | Text | Manufacturer or brand name. | “TechCo” |
| Image URLs | List of text | URLs for product images. | “http://example.com/img1.jpg”, “http://example.com/img2.jpg” |
| Stock Quantity | Number | Quantity on hand. | 150 |
5. Outputs
Two outputs are produced: the cleaned product data and a cleanup summary report.
5.1 Cleaned Product Data
-
Name/Label: Cleaned Product Data
-
Contents: Same fields as the input list, but with corrected formatting, standardized units, filled‑in defaults, and duplicates removed.
-
Formatting Rules:
- Text fields trimmed of leading/trailing spaces.
- Title case applied to product name and brand.
- All currency codes in upper‑case.
- Price shown with two decimal places.
- Weight converted to kilograms (kg), rounded to two decimals (e.g., “2.27 kg”).
- Dimensions converted to centimeters (cm), rounded to one decimal (e.g., “12.0 × 8.0 × 2.0 cm”).
- Missing optional fields (Brand, Category) filled with default values (see Section 8).
| Field | Type | Example (cleaned) |
|---|
| Product Name | Text | “Wireless Bluetooth Mouse” |
| SKU | Text | “WM-001‑BLU” |
| Description | Text | “Ergonomic wireless mouse with 2‑year battery life.” |
| Price | Number (2 dp) | 24.99 |
| Currency | Text (3 letter) | USD |
| Weight | Number + unit (kg) | 1.50 kg |
| Dimensions | Number + unit (cm) | 12.0 × 8.0 × 2.0 cm |
| Category | Text | “Electronics > Computers > Accessories” |
| Brand | Text | “TechCo” |
| Image URLs | List of text | http://example.com/img1.jpg, http://example.com/img2.jpg |
| Stock Quantity | Number | 150 |
5.2 Cleanup Summary Report
- Name/Label: Cleanup Summary Report
- Contents: Summary of actions taken, counts of records processed, and any issues that require manual review.
- Formatting Rules: Bullet‑point style, numeric values rounded as appropriate.
| Field | Description |
|---|
| Total Records Received | Number of product entries supplied. |
| Records Cleaned | Number of records successfully cleaned. |
| Records Removed (duplicates) | Count of duplicate entries removed. |
| Records Flagged for Review | Number of records that require manual attention (e.g., missing required fields). |
| Units Converted – Weight | Count of weight conversions performed. |
| Units Converted – Dimensions | Count of dimension conversions performed. |
| Missing Attributes Filled | Count of missing fields that were auto‑filled. |
| Errors Encountered | List of critical errors (e.g., unknown unit) that prevented processing of a specific record. |
| Timestamp | Date‑time when the process completed. |
| Comments | Additional notes for the catalog manager. |
6. Detailed Plan & Execution Steps
-
Receive Product Data List.
- Load the list of product records supplied for this run.
-
Validate Required Fields.
- Ensure each record contains a non‑empty Product Name, SKU, Price, and Currency.
- If any required field is missing, add the record to the Records Flagged for Review list and skip further processing for that record.
-
Deduplicate by SKU.
- Identify duplicate SKUs.
- Keep the record with the most non‑empty fields; discard the rest.
- Record the number of duplicates removed.
-
Normalize Text.
- Trim leading/trailing spaces in all text fields.
- Convert Product Name and Brand to title case (first letter of each word capitalized).
- Convert Category to title case and standardize the “>” separator with a single space on each side.
-
Standardize Currency.
- Convert all currency codes to upper‑case (e.g., “usd” → “USD”).
- If Currency is missing, default to “USD” and note in the report.
-
Convert Weight to Kilograms.
- Identify the unit (kg, g, lb, oz).
- Convert as follows:
- 1 lb = 0.453592 kg
- 1 oz = 0.0283495 kg
- 1 g = 0.001 kg
- Round result to two decimal places.
- If the unit is not recognized, flag the record for review.
-
Convert Dimensions to Centimeters.
- Parse the three‑dimensional string (e.g., “12x8x2 cm”, “4.7 x 3.2 x 1.5 in”).
- Recognize units: cm, mm, in (inch).
- Convert:
- 1 in = 2.54 cm
- 1 mm = 0.1 cm
- Round each dimension to one decimal place.
- If any dimension is missing or the unit is unknown, flag the record.
-
Fill Missing Optional Attributes.
- Brand: If missing, set to “Unknown Brand”.
- Category: If missing, set to “Uncategorized”.
- Weight: If missing, set to “0 kg” and note in the report (as missing).
- Dimensions: If missing, set to “0 × 0 × 0 cm” and note in the report.
-
Validate Numeric Values.
- Ensure Price > 0; if not, flag for review.
- Ensure Stock Quantity ≥ 0; if negative, set to 0 and note.
-
Validate Image URLs.
- Ensure each URL starts with “http://” or “https://”.
- Remove any URL that fails this test and add its position to the Flags list.
-
Compose Cleaned Product Data.
- Assemble each cleaned record using the field definitions from Section 5.1.
-
Generate Cleanup Summary Report.
- Populate all fields listed in the report table (Section 5.2).
- Include counts, timestamps, and any error messages.
-
Review & Finalize.
- Verify that no Critical Errors (missing required fields) remain unaddressed.
- If critical errors remain, stop and return only the report, indicating that manual intervention is required.
- Otherwise, deliver the Cleaned Product Data and the Summary Report.
7. Validation & Quality Checks
- Mandatory Field Check: Confirm that every retained record has a non‑empty Product Name, SKU, Price, and Currency.
- Duplicate Check: Confirm that the number of records in the final data equals the original record count minus the duplicates removed.
- Unit Standardization Check: Verify that all weight values are in kilograms (kg) with two decimal places and all dimensions are in centimeters (cm) with one decimal place.
- Range Check:
- Price > 0.
- Weight ≥ 0.
- Stock Quantity ≥ 0.
- Data Consistency: Ensure that the same SKU does not appear more than once.
- Report Consistency: Verify that totals in the summary report (e.g., total records, records cleaned) match the actual data processed.
- URL Validation: Confirm that each image URL starts with “http://” or “https://”.
If any check fails, mark the offending record in the Records Flagged for Review list and note the issue in the Summary Report.
8. Special Rules / Edge Cases
| Scenario | Action |
|---|
| Missing required field (Product Name, SKU, Price, Currency) | Add the record to Records Flagged for Review. Do not include the record in the Cleaned Product Data. Document the missing field in the Summary Report. |
| Unknown weight unit | Flag the record for review. Record the original value in the report and do not convert. |
| Unrecognizable dimension unit | Flag the record for review. Record the original value and add a note in the report. |
| Duplicate SKUs | Keep the entry with the most non‑empty fields. If both have equal completeness, keep the first occurring entry. Document which record was removed. |
| Price ≤ 0 | Flag the record for review. Do not include in Cleaned Data. Record reason in the summary. |
| Empty Image URL list | Keep the record but note in the report that no images are associated. |
| Multiple spaces or inconsistent punctuation | Trim spaces and replace multiple spaces with a single space. Normalize punctuation to standard ASCII characters (e.g., “‑” to “-”). |
| Currency not in ISO 4217 | Flag for review. Leave the original value in the report, and do not change it. |
If the process cannot complete because all records are flagged for review, return only the Summary Report with an “Error – No valid records processed” note.
9. Example
Input – Product Data List (2 Records)
-
Record A
- Product Name: “ wireless mouse ”
- SKU: “wm-001”
- Description: “Ergonomic mouse 2-year battery life.”
- Price: 24.5
- Currency: “usd”
- Weight: “2 lb”
- Dimensions: “5 x 3 x 1.5 in”
- Category: “Electronics>Computers>Accessories”
- Brand: (missing)
- Image URLs: “http://example.com/mouse1.png”
- Stock Quantity: 120
-
Record B
- Product Name: “Bluetooth Headset”
- SKU: “bh-002”
- Description: “Bluetooth headset with noise‑cancelling.”
- Price: -12.99 (negative price)
- Currency: (missing)
- Weight: “500 g”
- Dimensions: “10x5x2 cm”
- Category: (missing)
- Brand: “SoundCo”
- Image URLs: “image1.jpg” (invalid URL)
- Stock Quantity: -5 (negative)
Expected Output
Cleaned Product Data (1 record)
| Field | Cleaned Value |
|---|
| Product Name | Wireless Mouse |
| SKU | wm-001 |
| Description | Ergonomic mouse 2‑year battery life. |
| Price | 24.50 |
| Currency | USD |
| Weight | 0.91 kg |
| Dimensions | 12.7 × 7.6 × 3.8 cm |
| Category | Electronics > Computers > Accessories |
| Brand | Unknown Brand |
| Image URLs | http://example.com/mouse1.png |
| Stock Quantity | 120 |
Cleanup Summary Report
- Total Records Received: 2
- Records Cleaned: 1
- Records Removed (duplicates): 0
- Records Flagged for Review: 2
- Units Converted – Weight: 1 (2 lb → 0.91 kg)
- Units Converted – Dimensions: 1 (5×3×1.5 in → 12.7×7.6×3.8 cm)
- Missing Attributes Filled: 2 (Brand, Category)
- Errors Encountered:
- Record B: Invalid price (‑12.99).
- Record B: Missing currency (defaulted to USD).
- Record B: Invalid image URL (excluded).
- Record B: Negative stock (set to 0).
Comments:
- Record A required trimming of spaces and title‑case conversion.
- Record B was flagged for multiple issues and excluded from the cleaned dataset.
Appendix A – FAQ
Q1. What if the weight unit is not listed (e.g., “3.5” with no unit)?
A: Treat the value as missing. Flag the record for review and record “Missing weight unit” in the Summary Report. Do not perform any conversion.
Q2. How are duplicate SKUs determined when the SKU formats differ (e.g., “WM‑001” vs “wm-001”)?
A: Comparison is case‑insensitive and ignores leading/trailing spaces. Treat “WM‑001” and “wm-001” as the same SKU.
Q3. What is the default currency if it’s missing?
A: “USD” is used as the default currency, and the missing‑currency event is noted in the report.
Q4. How are non‑ASCII characters handled?
A: Convert to their closest ASCII equivalents (e.g., “‑” to “-”). Keep accented letters if they are standard in the language (e.g., “café”).
Q5. What if the dimensions field contains only two numbers?
A: Flag the record for review. Do not attempt to guess the missing dimension. Include a note in the report.
Q6. What is the preferred format for dimensions after conversion?
A: Length × Width × Height, each rounded to one decimal place, followed by the unit “cm”. Example: “12.0 × 8.0 × 2.0 cm”.
Q7. How are image URLs validated?
A: The URL must start with “http://” or “https://”. If an URL does not meet this pattern, it is removed from the product record and listed in the “Images removed” section of the Summary Report.
Q8. What does “Title case” mean?
A: The first letter of each word is capitalized, except for short pre‑positions and conjunctions (e.g., “and”, “or”, “the”) unless they are the first word. Example: “Bluetooth Headset”, not “Bluetooth headset”.
Q9. How are negative stock values handled?
A: They are set to “0” and the adjustment is logged in the Summary Report.
Q10. If a record has multiple errors (e.g., missing product name and unknown unit), what is reported?
A: All errors are listed for the record under “Errors Encountered”. The record is flagged for review, and no cleaned data is produced for that record.
Appendix B – Glossary
| Term | Definition |
|---|
| Product Name | The title of the product as presented to customers. |
| SKU | Stock‑Keeping Unit – a unique identifier used to track inventory. |
| Description | Textual description that explains product features, benefits, and specifications. |
| Price | Monetary value of the product, expressed in a specified currency. |
| Currency | Three‑letter ISO 4217 code that indicates the currency of the price (e.g., USD, EUR). |
| Weight | Physical weight of the product, expressed with a unit (e.g., kg, lb). |
| Dimensions | Physical size of the product expressed as length × width × height with a unit (e.g., cm, in). |
| Category | Hierarchical classification that groups similar products (e.g., “Electronics > Computers > Accessories”). |
| Brand | Manufacturer or brand name. |
| Image URLs | Web addresses pointing to product images. |
| Stock Quantity | The number of units available for sale. |
| Title case | Capitalizing the first letter of each significant word in a phrase (e.g., “Wireless Bluetooth Mouse”). |
| Duplicate | Two or more records sharing the same SKU. |
| Normalization | Process of making data consistent in format, units, and style across all records. |
| Cleaned Data | Data that has been processed to meet standard formatting and completeness rules. |
| Summary Report | Document that records the results and any issues discovered during the cleanup process. |
Appendix C – Standardization Reference Materials
C1. Unit Conversion Tables
Weight Conversion
| Original Unit | Conversion Factor to kg | Example |
|---|
| 1 kg | 1.0 kg | 1 kg → 1.0 kg |
| 1 g | 0.001 kg | 500 g → 0.50 kg |
| 1 lb | 0.453592 kg | 2 lb → 0.91 kg |
| 1 oz | 0.0283495 kg | 4 oz → 0.11 kg |
Dimensions Conversion
| Original Unit | Conversion Factor to cm | Example |
|---|
| 1 cm | 1.0 cm | 12 cm → 12.0 cm |
| 1 mm | 0.1 cm | 50 mm → 5.0 cm |
| 1 in (inch) | 2.54 cm | 4 in → 10.2 cm |
| 1 ft | 30.48 cm | 1 ft → 30.5 cm |
C2. Currency Code List (ISO 4217)
| Code | Currency | Example Symbol |
|---|
| USD | United States Dollar | $ |
| EUR | Euro | € |
| GBP | British Pound | £ |
| CAD | Canadian Dollar | $ |
| AUD | Australian Dollar | $ |
| JPY | Japanese Yen | ¥ |
| CHF | Swiss Franc | CHF |
| CNY | Chinese Yuan | ¥ |
| INR | Indian Rupee | ₹ |
Default Currency: USD (United States Dollar) is used when the currency field is missing.
C3. Text Formatting Guide
-
Whitespace
- Remove leading and trailing spaces.
- Replace multiple internal spaces with a single space.
-
Capitalization
- Title case for Product Name, Category, Brand.
- Do not capitalize “and”, “or”, “the”, “of”, “for” unless they start the phrase.
-
Punctuation
- Use straight quotes (
", ') instead of smart quotes.
- Replace “–” (en‑dash) or “—” (em‑dash) with a simple hyphen “-”.
- Ensure a period at the end of sentences in the description.
-
Numerical Formatting
- Price: Two decimal places (e.g., “24.99”).
- Weight: Two decimal places (e.g., “1.50 kg”).
- Dimensions: One decimal place per measurement (e.g., “12.0 × 8.0 × 2.0 cm”).
-
Currency
- Upper‑case ISO code (e.g., “USD”).
C4. Default Values for Missing Optional Attributes
| Attribute | Default Value | Reason |
|---|
| Brand | “Unknown Brand” | Ensures every product has a brand for reporting. |
| Category | “Uncategorized” | Allows categorization in the system. |
| Weight | “0 kg” | Indicates missing weight; flagged in report. |
| Dimensions | “0 × 0 × 0 cm” | Indicates missing dimensions; flagged in report. |
| Currency | “USD” | Default currency for the region. |
| Image URLs | None | If no images, leave list empty; note in report. |
C5. Example of Fully Cleaned Record
- Product Name: “Wireless Bluetooth Mouse”
- SKU: “WM‑001‑BLU”
- Description: “Ergonomic wireless mouse with 2‑year battery life.”
- Price: 24.99
- Currency: USD
- Weight: 0.91 kg
- Dimensions: 12.0 × 8.0 × 2.0 cm
- Category: “Electronics > Computers > Accessories”
- Brand: “TechCo”
- Image URLs: http://example.com/img1.jpg, http://example.com/img2.jpg
- Stock Quantity: 150
C6. Reporting Template
Summary Report – {Date}
----------------------------------------
Total Records Received: {Number}
Records Cleaned: {Number}
Records Removed (Duplicates): {Number}
Records Flagged for Review: {Number}
Units Converted – Weight: {Number}
Units Converted – Dimensions: {Number}
Missing Attributes Filled: {Number}
Errors Encountered: {List of error descriptions}
Timestamp: {YYYY‑MM‑DD HH:MM:SS}
Comments: {Any additional notes}
C7. Troubleshooting Checklist
-
Missing Required Fields
- Verify that every record includes Product Name, SKU, Price, and Currency.
-
Unit Mismatches
- Check for unexpected units (e.g., “kgf”, “lb.”). Convert to accepted units or flag.
-
Duplicate SKU Conflicts
- Identify duplicate SKUs and compare completeness. Choose the most complete record.
-
Negative Numbers
- Price or Stock Quantity less than zero are flagged. Adjust as per rules.
-
Image URL Validation
- Ensure each URL starts with “http://” or “https://”. Remove invalid URLs and record the action.
-
Unexpected Formatting
- Look for unusual punctuation, extra spaces, or non‑standard characters. Clean as per Text Formatting Guide.
C8. Revision History
| Version | Date | Author | Changes |
|---|
| 1.0 | 2025‑08‑11 | Alex Martinez | Initial SOP creation |
| 1.1 | 2025‑09‑03 | Alex Martinez | Added detailed edge‑case handling for duplicate SKUs |
| 1.2 | 2025‑10‑15 | Alex Martinez | Added unit conversion tables and extended FAQs |
Tips for Execution
- Use a spreadsheet or simple database to load the product list, apply the transformation steps, and generate the report.
- Keep a backup of the original data before applying any changes.
- Review the Summary Report before publishing the cleaned data.
- Document any manual changes made after the process completes for auditability.
(Note: this line is for internal reference only; it should not appear in the final SOP.)