1. Overview

1. Overview header

1. Overview

This agent compares two records (companies, contacts, or products) and determines whether they represent the same real-world entity. It normalizes fields, calculates a weighted confidence score, and returns a match decision with a field-by-field comparison showing exactly what matched, what conflicted, and what was missing.

When a field is ambiguous or incomplete, the agent researches publicly available information (company websites, business registries, etc.) to fill gaps before making its determination.

2. Business value

  • Deduplication at scale: catches duplicates that simple string matching misses (e.g., "IBM" vs. "International Business Machines," "123 Main St" vs. "123 Main Street, Suite 200").

  • Audit trail: the field-by-field comparison table makes it easy to verify why a match was or wasn't made, which matters for compliance and data governance.

  • Reduced manual review: only records in the "Possible Match" range need human attention. Clear matches and clear non-matches are resolved automatically.

3. Inputs

Two records with the following fields. Not every field needs to be filled; the agent works with whatever is available.

FieldTypeDetails
Record A: NameTextCompany name, person name, or product name
Record A: AddressTextFull mailing address
Record A: PhoneTextAny format
Record A: EmailTextEmail address
Record A: WebsiteTextURL
Record A: ID NumbersText (optional)Tax ID, DUNS number, registration number, etc.
Record B: NameTextSame fields as Record A
Record B: AddressText
Record B: PhoneText
Record B: EmailText
Record B: WebsiteText
Record B: ID NumbersText (optional)

4. Outputs

FieldContents
DecisionMatch, No Match, or Possible Match
Confidence Score0 to 100
Field ComparisonTable showing each field, both values, the normalized forms, and whether they matched
ConflictsList of fields where the two records directly contradict each other
Research NotesAny publicly available information the agent found to fill gaps or resolve ambiguity

5. Execution steps

  1. Normalize both records. Apply the normalization rules in Appendix B to each field: strip whitespace, standardize phone to E.164, expand address abbreviations (St → Street, Ave → Avenue), lowercase email domains, remove "http(s)://www." from URLs.

  2. Compare each field pair. For each field present in both records, calculate a similarity score from 0.0 to 1.0:

    • Exact match after normalization → 1.0

    • ID numbers match → 1.0 (these are definitive)

    • Fuzzy name match (e.g., "Acme Corp" vs. "Acme Corporation") → 0.7 to 0.9 depending on edit distance

    • Partial address match (same street, different suite) → 0.5 to 0.8

    • Same email domain but different local part → 0.3

    • No meaningful similarity → 0.0

  3. Research gaps. If a field is present in one record but missing in the other, and the confidence score is in the 40-75 range, attempt to fill the gap using publicly available information. Note what was found and the source in Research Notes.

  4. Calculate weighted confidence score. Multiply each field's similarity score by its weight from Appendix A, then sum:

    Confidence = Σ (field_similarity × field_weight) / Σ (active_field_weights) × 100

    Only include fields where at least one record has a value. If neither record has a value for a field, exclude it from the calculation entirely.

  5. Determine decision:

    • 80 to 100: Match

    • 40 to 79: Possible Match

    • 0 to 39: No Match

    Exception: if any ID number matches exactly, the decision is Match regardless of other fields. If ID numbers are present in both records and don't match, the decision is No Match regardless of other fields.

  6. Flag conflicts. Any field where both records have values and the similarity score is below 0.3 should be listed as a conflict.

6. Validation checks

  • The confidence score must be mathematically consistent with the field-by-field similarity scores and weights. Don't round intermediate calculations.

  • Every field present in either record must appear in the comparison table, even if it's empty on one side.

  • Research notes must cite specific sources (e.g., "LinkedIn company page" or "State business registry") rather than vague references.

7. Edge cases

  • Both records nearly empty: if fewer than two comparable fields exist, return Possible Match with confidence of 50 and a note that there isn't enough data for a reliable determination.

  • One record is clearly a subsidiary: if research reveals that one entity is a subsidiary or DBA of the other, return Possible Match with an explanation rather than a flat Match.

  • Conflicting ID numbers: this overrides everything else. If both records contain an ID number and they don't match, the decision is No Match even if every other field is identical.


Appendix A: Field weights

FieldWeightRationale
ID Numbers50Definitive identifier when present
Name25Primary identifier but subject to variation
Website15Strong signal, low ambiguity
Email Domain10Moderately strong if domains match
Phone10Useful but often changes or has multiple numbers
Address10Useful but companies relocate and have multiple offices

Weights are relative, not absolute. The agent normalizes to 100 based on which fields are actually present.

Appendix B: Normalization rules

FieldNormalization
NameStrip legal suffixes (Inc., LLC, Ltd., Corp.), lowercase, trim whitespace
PhoneConvert to E.164 format (+1XXXXXXXXXX). Strip extensions.
EmailLowercase the entire address. Compare domains separately from local parts.
AddressExpand abbreviations (St→Street, Ave→Avenue, Blvd→Boulevard, Ste→Suite, Apt→Apartment). Standardize directionals (N→North, etc.). Remove punctuation.
WebsiteRemove protocol (http/https), "www.", and trailing slashes. Lowercase.
ID NumbersStrip all non-alphanumeric characters. Uppercase any letters.