1. Overview
This agent compares two records and determines whether they represent the same real-world entity. It accepts free-text input for each record, automatically extracts and normalizes relevant fields, calculates a weighted confidence score, and returns a match decision with a field-by-field comparison.
When extracted information is ambiguous or incomplete, the agent researches publicly available sources to fill gaps before making its determination.
2. Business value
-
Deduplication at scale: catches duplicates that simple string matching misses (e.g., "IBM" vs. "International Business Machines").
-
Audit trail: the field-by-field comparison table shows exactly what matched, conflicted, or was missing.
-
Reduced manual review: only "Possible Match" results need human attention.
-
Zero formatting burden: users paste whatever they have (a CRM export, an email signature, a LinkedIn snippet, a business card photo's text) and the agent figures it out.
3. Inputs
| Field | Type | Details |
|---|---|---|
| Record A | Free text | Any unstructured text describing the entity: company name, contact info, address, ID numbers, or any combination. |
| Record B | Free text | Same as above for the second entity. |
The agent parses each text block and extracts whatever fields it can identify: name, address, phone, email, website, ID numbers (tax ID, DUNS, registration number, etc.). Unrecognized text is retained as context for research and fuzzy matching.
4. Outputs
| Field | Contents |
|---|---|
| Decision | Match, No Match, or Possible Match |
| Confidence Score | 0 to 100 |
| Extracted Fields | Table showing what the agent parsed from each input, so the user can verify extraction was correct |
| Field Comparison | Table showing each extracted field, both values, normalized forms, and whether they matched |
| Conflicts | List of fields where the two records directly contradict each other |
| Research Notes | Any publicly available information the agent found to fill gaps or resolve ambiguity |
5. Execution steps
-
Extract fields from free text. Parse each input and identify: entity name, address, phone number(s), email address(es), website, and any ID numbers. If the text is ambiguous (e.g., two names appear), use surrounding context to determine which is the primary entity. Present the extracted fields table so the user can see what was parsed.
-
Normalize extracted fields. Apply the normalization rules in Appendix B: strip whitespace, standardize phone to E.164, expand address abbreviations, lowercase email domains, remove protocol/www from URLs, strip legal suffixes from names.
-
Compare each field pair. For each field present in both records, calculate a similarity score from 0.0 to 1.0:
-
Exact match after normalization: 1.0
-
ID numbers match: 1.0 (definitive)
-
Fuzzy name match (e.g., "Acme Corp" vs. "Acme Corporation"): 0.7 to 0.9 depending on edit distance
-
Partial address match (same street, different suite): 0.5 to 0.8
-
Same email domain but different local part: 0.3
-
No meaningful similarity: 0.0
-
-
Research gaps. If a field is present in one record but missing in the other, and the running confidence is in the 40-75 range, attempt to fill the gap using publicly available information on the web. Note findings and sources in Research Notes.
-
Calculate weighted confidence score.
Confidence = Sum(field_similarity x field_weight) / Sum(active_field_weights) x 100Only include fields where at least one record has a value. Use weights from Appendix A.
-
Determine decision:
-
80 to 100:
Match -
40 to 79:
Possible Match -
0 to 39:
No Match
Exception: if any ID number matches exactly, the decision is
Matchregardless of other fields. If ID numbers are present in both records and don't match, the decision isNo Matchregardless of other fields. -
-
Flag conflicts. Any field where both records have values and the similarity score is below 0.3 is listed as a conflict.
6. Validation checks
-
The confidence score must be mathematically consistent with the field-by-field scores and weights. Don't round intermediate calculations.
-
Every field extracted from either record must appear in the comparison table, even if it's empty on one side.
-
Research notes must cite specific sources (e.g., "LinkedIn company page," "State business registry").
-
If extraction is uncertain (e.g., it's unclear whether a string is a company name or a person's name), note the ambiguity in the Extracted Fields table.
7. Edge cases
-
Minimal or unstructured input: If the agent can only extract one identifiable field from each input, return
Possible Matchwith confidence of 50 and a note that there isn't enough data for a reliable determination. -
Subsidiary or DBA: If research reveals one entity is a subsidiary or DBA of the other, return
Possible Matchwith an explanation rather than a flatMatch. -
Conflicting ID numbers: Overrides everything else. If both records contain an ID number and they don't match, the decision is
No Matcheven if every other field is identical. -
Multiple entities in one input: If the text appears to describe more than one entity (e.g., a partnership listing), use the most prominent entity and note the ambiguity.
Appendix A: Field weights
| Field | Weight | Rationale |
|---|---|---|
| ID Numbers | 50 | Definitive identifier when present |
| Name | 25 | Primary identifier but subject to variation |
| Website | 15 | Strong signal, low ambiguity |
| Email Domain | 10 | Moderately strong if domains match |
| Phone | 10 | Useful but often changes |
| Address | 10 | Useful but companies relocate |
Weights are relative. The agent normalizes to 100 based on which fields are actually present.
Appendix B: Normalization rules
| Field | Normalization |
|---|---|
| Name | Strip legal suffixes (Inc., LLC, Ltd., Corp.), lowercase, trim whitespace |
| Phone | Convert to E.164 (+1XXXXXXXXXX), strip extensions |
| Lowercase entire address, compare domains separately from local parts | |
| Address | Expand abbreviations (St to Street, Ave to Avenue, Blvd to Boulevard, Ste to Suite), standardize directionals (N to North), remove punctuation |
| Website | Remove protocol, "www.", and trailing slashes, lowercase |
| ID Numbers | Strip all non-alphanumeric characters, uppercase any letters |

