Attribute Extraction
1. Overview
This process reads a product’s textual description and any accompanying specification sheet, then pulls out three key product attributes—brand, size, and colour—into a clear, structured list.
2. Business Value
- Consistent product data – Uniform attribute data improves search, filtering, and recommendation engines.
- Faster catalog building – Automates a manual step, letting data analysts focus on higher‑value tasks.
- Reduced errors – Standardised extraction limits inconsistencies caused by free‑form text.
3. Operational Context
- When to run: Whenever a new product is introduced or an existing product’s description is updated.
- Who uses it: Data analysts responsible for populating or maintaining the e‑commerce catalog.
- Frequency: Typically run per product entry (i.e., once for each new or updated product record).
4. Inputs
Below are the items that must be supplied for a single run of this process.
- Product Name – Text – The product’s commonly used name (e.g., “Nike Air Max 90 Men’s Shoes”).
- Product Description – Text – The full marketing description or copy, typically a paragraph or a few bullet points, containing product details.
- Specification Sheet – PDF Document – Optional PDF file that may contain additional details such as measurements, material, or colour codes.
| Input Name | Type | Details Provided |
|---|
| Product Name | Text | The official product name as it appears on the retailer’s site. |
| Product Description | Text | Full text description of the product (plain text). |
| Specification Sheet | PDF Document | Optional PDF containing specifications, measurements, and colour details (if available). |
5. Outputs
The process yields a structured list of the three required attributes.
- Extracted Attributes – A table containing the three attributes and their values.
- Extraction Summary – A short paragraph summarising the extracted data (e.g., “The product is a Nike shoe, size 9.5, in black.”).
| Output Name | Contents | Formatting Rules |
|---|
| Extracted Attributes | A two‑column table (Attribute, Value) containing: * Brand, * Size, * Colour | Use the exact attribute names shown above. Capitalise the first letter of each value (e.g., “Nike”, “9.5”, “Black”). |
| Extraction Summary | One‑sentence summary of the extracted attributes. | Sentence case, ending with a period. Use a neutral, professional tone. |
6. Detailed Plan & Execution Steps
- Open the inputs – Retrieve the Product Name, Product Description, and, if supplied, the Specification Sheet.
- Identify the brand
a. Scan the Product Name for a known brand name (see Appendix C – Brand List).
b. If the brand is not obvious, search the internet for the product name and note the most prominent brand that appears.
- Identify the size
a. Look for numeric patterns or standard size codes (e.g., “9.5”, “XL”, “12‑inch”) in the description and spec sheet.
b. If more than one size is mentioned, list each size separated by commas.
- Identify the colour
a. Scan the description and spec sheet for colour words from the colour list (Appendix C – Colour List).
b. If multiple colours are present, list all colours in the order they appear, separated by commas.
- Record the values – Write the identified values into the “Extracted Attributes” table, preserving spelling and capitalisation.
- Compose the summary – Produce a one‑sentence paragraph that combines the three attributes in the order: brand, size, colour. Example: “Nike Air Max 90, size 9.5, black.”
- Validate the extraction (see Section 7).
- Save the outputs – Store the “Extracted Attributes” table and the “Extraction Summary” as the final deliverables.
7. Validation & Quality Checks
| Check | Description |
|---|
| Brand Presence | Confirm a brand value is present; if not, flag for manual review. |
| Size Format | Verify size follows a recognised pattern (numeric, “XS‑XL”, or measurement). |
| Colour Validity | Ensure each colour matches an entry from the Colour List (Appendix C). |
| Consistency | Compare values found in the description and the spec sheet – they must not conflict. |
| Completion | All three attributes (brand, size, colour) must be filled; missing values are flagged. |
If any check fails, mark the record with an Error status and note the missing or mismatched attribute for manual review.
8. Special Rules / Edge Cases
- Multiple Colours – List all colours in the order they appear; keep commas only.
- Multiple Sizes – List all sizes; separate with commas (e.g., “9, 9.5”).
- Unknown Brand – If a brand cannot be identified from the description, product name, or a quick web search, record “Not Found” and flag for manual verification.
- Unrecognised Colour – If a colour word is not in the colour list, add “(Unconfirmed)” after the colour (e.g., “Gainsboro (Unconfirmed)”).
- Conflicting Information – When the description and spec sheet disagree on size or colour, prefer the spec sheet; note the conflict in a comment field in the output table.
- Missing Spec Sheet – If no PDF is supplied, rely solely on the product description.
- No Size or Colour Mentioned – Record “Not Specified” for the missing attribute.
9. Example
Input
- Product Name: “Adidas UltraBoost Running Shoes”
- Product Description: “Experience the new Adidas UltraBoost. These shoes feature a breathable knit upper in black and white. Available in sizes 8, 9, 10, and 11. Made with Primeknit technology.”
- Specification Sheet: PDF file (contains a table showing “Colour: Black/White”, “Size: 10”, “Material: Primeknit”).
Output
Extracted Attributes
| Attribute | Value |
|---|
| Brand | Adidas |
| Size | 8, 9, 10, 11 |
| Colour | Black, White |
Extraction Summary
“Adidas UltraBoost running shoes, available in sizes 8, 9, 10, 11, in black and white.”
Appendix A – FAQ
Q1: What if the product description does not contain a colour?
A: Record “Not Specified” for the colour field, then flag for manual review.
Q2: The product name includes multiple brand names (e.g., “Nike x Adidas”).
A: List both brands separated by a slash (e.g., “Nike/Adidas”).
Q3: I find a colour word that isn’t in the colour list.
A: Include the word as‑is, add “(Unconfirmed)”, and note the situation in the “Comments” column of the output table.
Q4: How do I handle a size given as “EU 42”?
A: Record the size exactly as shown (“EU 42”) and ensure it is placed under the Size column.
Q5: The spec sheet lists a colour as “#000000”.
A: Translate the hex code to its colour name (e.g., “Black”) using the Colour List. If unknown, write “#000000 (Unconfirmed)”.
Q6: When should I flag a record for manual review?
A: When any of the three attributes are missing or invalid after the validation checks.
Q7: The product has a “size” that is a dimension (e.g., “12‑inch”)
A: Record the dimension exactly (e.g., “12‑inch”) under the Size column.
Appendix B – Glossary
- Brand – The manufacturer or label associated with the product (e.g., “Nike”).
- Size – Numerical or alphanumeric representation of the product’s dimension, measurement, or standard size code.
- Colour (or Colour) – The visual colour of the product (e.g., “Black”, “Red”).
- Product Description – Textual copy describing the product’s features and specifications.
- Specification Sheet – A PDF document that contains detailed technical information about the product.
- Extraction Summary – A short sentence summarising the extracted attributes.
Appendix C – Reference Materials
C.1 – Brand List (E‑Commerce)
| Brand |
|---|
| Adidas |
| Nike |
| Puma |
| Under Armour |
| Reebok |
| New Balance |
| Skechers |
| Asics |
| Vans |
| Converse |
| Timberland |
| Columbia |
| North Face |
| Patagonia |
| Levi’s |
| Calvin Klein |
| Gucci |
| Prada |
| Louis Vuitton |
| Chanel |
| Dior |
| ... (add as needed) |
C.2 – Colour List
| Colour |
|---|
| Black |
| White |
| Red |
| Blue |
| Green |
| Yellow |
| Orange |
| Purple |
| Pink |
| Brown |
| Gray |
| Navy |
| Beige |
| Maroon |
| Teal |
| Cyan |
| Magenta |
| Gold |
| Silver |
| Bronze |
| Navy Blue |
| Light Gray |
| Dark Gray |
| (Add any industry‑specific colours here, e.g., “Saffron”, “Olive”, “Burgundy”) |
C.3 – Size Formats
| Format | Example |
|---|
| Numeric | “10”, “9.5”, “12” |
| US Size | “US 10”, “US 9.5” |
| EU Size | “EU 42”, “EU 44” |
| UK Size | “UK 9”, “UK 9.5” |
| Letter Size | “S”, “M”, “L”, “XL”, “XXL” |
| Measurement | “12‑inch”, “30 cm”, “1.5 kg” |
| Combined | “US 9.5 / UK 8” |
C.4 – Style Guide for Outputs
- Capitalisation: Capitalise the first letter of each word in the Brand and Colour fields; size values keep their original case (e.g., “XS”, “9.5”).
- Punctuation: The Extraction Summary ends with a single period; no extra punctuation inside.
- Lists: Separate multiple values with a comma followed by a space (e.g., “Black, White”).
- Spacing: No extra spaces before or after commas.
- Alphabetical Order: For multiple colours or sizes, preserve the order they appear in the source material; do not reorder alphabetically.
C.5 – Validation Checklist (for quick reference)
- Brand – Present? (Yes/No) – If No → “Not Found”.
- Size – Valid format? (Yes/No) – If No → “Not Specified”.
- Colour – In colour list? (Yes/No) – If No → “Unconfirmed”.
- All Three Present? – Yes → Proceed; No → Flag for manual review.
C.6 – Example of Multiple Colours
| Colour Code | Description |
|---|
| “Black/White” | Two colours listed with a slash – record as “Black, White”. |
| “Black & White” | Two colours separated by “&” – record as “Black, White”. |
| “Black/White/Red” | Three colours – record as “Black, White, Red”. |
Tip: When in doubt, use the first colour mentioned as the primary colour and list any additional colours after it.
Additional Notes
- Always keep the original source wording in mind when extracting values; do not re‑interpret or re‑write unless the attribute format requires it (e.g., converting “12‑inches” to “12‑inch”).
- If the product name contains a brand and the description also mentions the brand, confirm they match. If they differ, note the discrepancy in a comment column (optional).
- Maintain a log of all records that are flagged for manual review to ensure follow‑up.