KYC & Onboarding Form Parser
1. Overview
This process reads a customer‑submitted KYC document (such as a passport, driver’s licence, or national ID) and extracts the essential data fields into a clear, plain‑text report. The report also highlights any missing information, format errors, or mismatches. The output is a concise, human‑readable report that a compliance analyst can use to confirm that the KYC submission meets the required standards.
2. Business Value
- Regulatory compliance – Ensures that every KYC submission satisfies anti‑money‑laundering (AML) and “Know Your Customer” regulations.
- Efficiency – Automates the extraction of key data, reducing the time analysts spend manually reviewing documents.
- Risk reduction – Early detection of missing or incorrect fields helps prevent fraud and onboarding delays.
- Customer experience – Faster verification speeds up onboarding for new customers.
3. Operational Context
- When to run – Whenever a new customer submits a KYC document for onboarding, or when an existing customer’s KYC information is being refreshed or audited.
- Who uses it – Compliance analysts, onboarding managers, and any team responsible for regulatory verification.
- Frequency – Each KYC submission triggers one run of the process (typically multiple times per day in a high‑volume fintech).
4. Inputs
| Name/Label | Type | Details Provided |
|---|
| KYC Document | PDF file | A scanned or digitally captured copy of the customer’s identification document (passport, driver’s licence, or national ID) and any supporting documents (e.g., proof‑of‑address). The PDF must be legible, contain all required pages, and be free of encryption. |
| Document Type (optional) | Text | The type of identification presented (e.g., “Passport”, “Driver’s Licence”, “National ID”). If not supplied, the process will infer the type from the document content. |
| Customer Name (optional) | Text | The full legal name of the customer as entered on the onboarding form. Used only for reference in the report. |
| Submission Date (optional) | Date (format YYYY‑MM‑DD) | The date on which the KYC document was submitted for processing. Helpful for tracking and audit purposes. |
All the above information must be supplied for a single run of the process. No external files or databases are required.
5. Outputs
The process generates a plain‑text report composed of three sections: an Extracted Fields Table, a Mismatches Summary Table, and an Overall Summary. No new system IDs are created.
Extracted Fields Table
| Name/Label | Contents | Formatting Rules |
|---|
| Extracted Fields Table | A list of every required data element (e.g., Full Name, Date of Birth, ID Number, Issue Date, Expiry Date, Address, etc.) with the value extracted from the document, a status indicator (Valid, Missing, Mismatch, or Mismatch – partial read), and a brief comment. | Plain‑text table with four columns: Field, Extracted Value, Status, Comment. Rows sorted alphabetically by Field. No extra symbols or IDs. |
Mismatch Summary Table
| Name/Label | Contents | Formatting Rules |
|---|
| Mismatch Summary Table | All fields that are missing, contain format errors, or conflict with any optional reference input (e.g., Customer Name) are listed with the issue type, expected format, actual value (if any), and a brief comment. | Plain‑text table with five columns: Field, Issue, Expected Format, Actual Value, Comment. Only fields with issues appear. |
Overall Summary
| Name/Label | Contents | Formatting Rules |
|---|
| Overall Summary | A short paragraph indicating whether the KYC document passes the compliance check (PASS) or fails (FAIL) and a count of issues detected. | Plain‑text paragraph. Use the word “PASS” when all fields are valid; otherwise “FAIL – X issue(s) detected”. No tables. |
6. Detailed Plan & Execution Steps
- Gather Input – Receive the KYC Document PDF and any optional inputs (Document Type, Customer Name, Submission Date). Verify the PDF opens and is not corrupted.
- Check Readability – Confirm that the PDF is legible. If it is unreadable, stop and return “Document unreadable – manual review required”.
- Identify Document Type – Use the supplied Document Type; if absent, infer the type from the document’s visual cues.
- Run OCR – Perform OCR on every page of the PDF to produce plain text.
- Locate Required Fields – Using the Required KYC Fields list (Appendix C), locate each field in the OCR text: Full Name, Date of Birth, ID Number, Date of Issue, Date of Expiry, Address, Proof of Address, etc.
- Validate Each Field:
a. Presence – If a field is not found, mark it Missing.
b. Format – Check the extracted value against the format rules in Appendix C (e.g., dates must be YYYY‑MM‑DD; ID number must be alphanumeric 6–12 characters). If the format is wrong, mark Mismatch.
c. Reference Matching – If a Customer Name was provided, compare it to the extracted Full Name (ignore case and surrounding spaces). Any difference = Mismatch.
- Populate Extracted Fields Table – List each field with its value, status, and a brief comment.
- Build Mismatch Summary Table – For each Missing or Mismatch entry, add a row with the field name, issue type, expected format, actual value (if any), and a short comment.
- Create Overall Summary – Count the total number of missing or mismatched items.
- If the count is zero, write “PASS – All required fields are present and correctly formatted.”
- If any issue exists, write “FAIL – X issue(s) detected” and note the count.
- Assemble Report – Combine the three sections in the order: Extracted Fields Table, Mismatch Summary Table, Overall Summary.
- Quality Check – Perform a quick visual scan of the report to verify that all tables are correctly formatted and that the counts in the Summary match the entries in the Mismatch table.
- Deliver Output – Present the full plain‑text report to the compliance analyst. No separate file is produced; the report is returned as plain text.
- Log Result – Record that the document has been processed, noting any failures (e.g., unreadable file) for internal tracking.
7. Validation & Quality Checks
- File Integrity – PDF opens without error and contains at least one page.
- Field Presence – Each required field appears in the Extracted Fields Table (with “Valid”, “Missing”, or “Mismatch” status).
- Format Conformance – Dates, ID numbers, and other fields follow the rules in Appendix C. Any deviation is flagged.
- Name Consistency – If a Customer Name was supplied, the Full Name must match; otherwise, flag as Mismatch.
- Table Structure – Both tables contain the correct headings and column order; rows are correctly aligned.
- Summary Accuracy – The number of issues reported in the Overall Summary matches the rows in the Mismatch Summary Table.
- Readability – The report uses plain language, no technical jargon, and no system‑generated IDs.
- Error Handling – Any failure (e.g., unreadable PDF, missing document) results in an explicit error message and no output report.
8. Special Rules / Edge Cases
- Unreadable PDF – If OCR cannot extract any text, stop and return “Document unreadable – manual review needed”.
- Missing Document – If no PDF is supplied, abort the process and report “No document provided”.
- Multiple IDs – When the PDF contains more than one ID, select the one with the highest priority: Passport > Driver’s Licence > National ID. Document any selection in the Comment column of the Extracted Fields Table.
- Partial Pages – If a page containing a required field is missing (e.g., back side of a driver’s licence), record the field as Missing.
- Non‑Latin Scripts – If the document is in an unsupported language, flag “Unsupported language – manual review required”.
- Conflicting Data – If two different values appear for the same field on separate pages, list both in the Comment and mark the field as Mismatch.
- Missing Supporting Document – If the KYC process requires a proof‑of‑address document but none is present, mark the field as Missing.
- Incorrect Date Format – Dates not in the format YYYY‑MM‑DD are flagged Mismatch with a comment indicating the required format.
- Invalid ID Pattern – IDs containing illegal characters (e.g., “#”, “%”) or with length outside 6–12 characters are flagged Mismatch with a comment “Invalid ID format”.
- No Document Type Provided – If the system cannot determine the document type, set the Document Type to “Unknown” and continue using the full list of required fields.
- Partial OCR Reads – When OCR yields partial data (e.g., “A12…” for an ID), label the status Mismatch – partial read, include the partially read value in Comment, and note the need for manual verification.
9. Example
Input (single run)
- KYC Document (PDF) – Scan of a passport for “Alice Smith”. The PDF contains:
- Full Name: “Alice Marie Smith”
- Date of Birth: “1990‑04‑15”
- Passport Number (ID Number): “A1234567”
- Date of Issue: “2020‑01‑01”
- Date of Expiry: “2030‑01‑01”
- Address: “123 Main Street, Springfield, USA”
- Missing – Proof‑of‑address document is not included.
- Document Type (optional) – “Passport”
- Customer Name (optional) – “Alice Smith”
- Submission Date (optional) – “2023‑06‑01”
Expected Output
Extracted Fields Table
| Field | Extracted Value | Status | Comment |
|---|
| Address | 123 Main Street, Springfield, USA | Mismatch | City missing |
| Date of Birth | 1990-04-15 | Valid | |
| Date of Expiry | 2030-01-01 | Valid | |
| Date of Issue | 2020-01-01 | Valid | |
| Full Name | Alice Marie Smith | Mismatch | Does not match Customer Name |
| ID Number | A1234567 | Valid | |
| Document Type | Passport | Valid | |
| Proof of Address | (none) | Missing | No proof‑of‑address document supplied |
| Signature | Present | Valid | |
| Country of Issue | United States | Valid | |
| Issuing Authority | Department of State | Valid | |
Mismatch Summary Table
| Field | Issue | Expected Format | Actual Value | Comment |
|---|
| Full Name | Mismatch with Customer Name | N/A | “Alice Marie Smith” | Customer name is “Alice Smith” |
| Address | Missing city component | Full street, city, country | “123 Main Street, USA” | city missing |
| Proof of Address | Missing | Proof‑of‑address document (e.g., utility bill) | None | Required for KYC |
| Signature | Missing | Hand‑written signature | None | Required for verification |
Overall Summary
FAIL – 4 issue(s) detected
- Name mismatch.
- Address missing city component.
- Missing proof‑of‑address document.
- Signature missing.
The compliance analyst must request a corrected name, a full address (including city), the missing proof‑of‑address document, and a signature before the submission can be approved.
Appendix A – FAQ
Q1: What if the PDF is blurry or the text cannot be read?
A: The process will attempt OCR. If no readable text is produced, the process stops and returns “Document unreadable – manual review required.” Request a clearer scan.
Q2: Are documents in languages other than English supported?
A: Only documents in languages supported by the OCR engine (e.g., English, Spanish, French) can be processed. If a language is not supported, the process flags “Unsupported language – manual review required”.
Q3: How is a “name mismatch” defined?
A: If the Full Name extracted from the ID does not exactly match the optional Customer Name (ignoring case and leading/trailing spaces), the field is marked Mismatch and noted in the report.
Q4: What date format must be used?
A: All dates must be in ISO format YYYY‑MM‑DD (e.g., 2023‑08‑15). Any other format (e.g., “12/31/2023”) will be flagged as a Mismatch.
Q5: What is the acceptable format for an ID number?
A: The ID must be alphanumeric, 6–12 characters, and contain no symbols (e.g., “A1234567”). Anything outside this pattern is a Mismatch.
Q6: What should I do if a required document (e.g., proof of address) is missing?
A: The missing document appears in the Mismatches table as “Missing”. The overall result will be “FAIL”. The analyst must request the missing document before approval.
Q7: How are multiple IDs in one PDF handled?
A: The process selects the ID with the highest priority (Passport > Driver’s Licence > National ID). Lower‑priority IDs are ignored unless required fields are missing, in which case a comment is added.
Q8: What does the “Status” column indicate?
A: Valid – field present and correct format.
Missing – field not found.
Mismatch – present but format or content is incorrect.
Mismatch – partial read – OCR captured part of the value; manual verification required.
Q9: How can the plain‑text report be used in other tools?
A: The report can be copied into a spreadsheet or database because the tables are plain‑text with a simple pipe‑delimited format.
Q10: Who should be notified if the process fails?
A: The compliance analyst receives the error message and must request a new or corrected document from the customer.
Appendix B – Glossary
- KYC (Know Your Customer) – Procedures used by financial institutions to verify a client’s identity and assess risk.
- Compliance Analyst – Staff member who checks KYC data against regulatory requirements.
- KYC Document – Any official identification (passport, driver’s licence, national ID) and supporting proof‑of‑address submitted by a customer.
- OCR (Optical Character Recognition) – Technology that converts scanned images of text into machine‑readable characters.
- Extraction – The act of locating and pulling a piece of data from a document.
- Validation – Checking that an extracted value meets a predefined rule or format.
- Mismatch – The extracted value does not meet a rule or does not match a reference value.
- Missing – A required data element was not found in the document.
- Proof of Address – A document that verifies a customer’s residential address (e.g., utility bill).
- Date Format – All dates must follow YYYY‑MM‑DD.
- ID Pattern – Alphanumeric string 6–12 characters, no spaces or symbols.
- Overall Summary – Short paragraph stating whether the KYC document passes or fails and how many issues were found.
- Priority Order for IDs – If multiple IDs exist, the order of precedence is: 1) Passport, 2) Driver’s Licence, 3) National ID.
- Signature – The handwritten signature on the ID; presence must be recorded but not validated for authenticity.
Appendix C – Reference Material
A. Required KYC Fields
| Field | Description | Expected Format | Example |
|---|
| Full Name | Legal name as printed on the ID | Text, 2–4 words, no numbers or symbols | “Alice Marie Smith” |
| Date of Birth | Birth date of the customer | YYYY‑MM‑DD | 1990‑04‑15 |
| ID Number | Official identification number | Alphanumeric, 6–12 characters, no symbols | “A1234567” |
| Date of Issue | Date the ID was issued | YYYY‑MM‑DD | 2020‑01‑01 |
| Date of Expiry | Expiry date of the ID | YYYY‑MM‑DD | 2030‑01‑01 |
| Address | Residential address (if on ID) | Text: street, city, country | “123 Main Street, Springfield, USA” |
| Document Type | Type of ID provided | “Passport”, “Driver’s Licence”, “National ID” | “Passport” |
| Proof of Address | Supporting document proving address (e.g., utility bill) | PDF or image; must show full address | N/A |
| Signature | Hand‑written signature on ID | Presence only; not validated for authenticity | N/A |
| Country of Issue | Country that issued the ID | Full country name | “United States” |
| Issuing Authority | Agency that issued the ID | Text | “Department of State” |
| Proof of Signature | Presence of a signature on the ID | Must be present | N/A |
B. Validation Rules
- Date Format – Must be exactly YYYY‑MM‑DD (e.g., 2023‑08‑15). No other separators or spaces.
- ID Pattern – Alphanumeric only, 6–12 characters, no spaces, hyphens, or special symbols.
- Name Matching – If a Customer Name is supplied, the Full Name extracted must match exactly, ignoring case and leading/trailing spaces.
- Proof of Address – Required if the KYC policy mandates address verification. Must contain a full street address, city, and country.
- Document Type Detection – Use visual cues (e.g., “Passport” label, government emblem) to identify the document. If not identified, set as “Unknown”.
- Signature – Only confirm presence; no authenticity verification is required.
- Multiple Pages – All pages are examined; any required field on any page is accepted.
- Language – Only Latin‑script languages (English, Spanish, French, etc.) are supported. Other scripts are flagged as “Unsupported language – manual review required”.
- Priority for Multiple IDs – If more than one ID is present, select the highest‑priority type: Passport > Driver’s Licence > National ID. Lower‑priority IDs are ignored unless the higher‑priority ID is missing required fields; in that case, note the missing data.
- Partial Reads – If OCR produces an incomplete value (e.g., “A12…”), mark the field as Mismatch – partial read and note the partial value in the Comment column.
C. Formatting Guide for the Report
- Header – “KYC & Onboarding Form Parsing Report” followed by optional Customer and Submission Date lines.
- Section Headings – Use “Extracted Fields”, “Mismatches”, and “Overall Summary”.
- Tables – Plain‑text tables, column headings on the first line, separated by vertical bars “|”. Align columns for readability; alignment does not affect parsing.
- Status Values – Use exactly: “Valid”, “Missing”, “Mismatch”, or “Mismatch – partial read”.
- Comment Column – Brief note; keep concise (a few words).
- Overall Summary – Must contain the word “PASS” if all fields are valid; otherwise “FAIL – X issue(s) detected”.
- No IDs – Do not generate any system or random IDs in the report; only use data extracted from the document.
D. Sample Valid Report
KYC & Onboarding Form Parsing Report
Customer: Alice Smith
Submission Date: 2023-06-01
Extracted Fields
| Field | Extracted Value | Status | Comment |
|----------------------|------------------------------|----------|--------|
| Address | 123 Main Street, Springfield, USA | Valid | |
| Date of Birth | 1990-04-15 | Valid | |
| Date of Expiry | 2030-01-01 | Valid | |
| Date of Issue | 2020-01-01 | Valid | |
| Full Name | Alice Smith | Valid | |
| ID Number | A1234567 | Valid | |
| Document Type | Passport | Valid | |
| Proof of Address | Uploaded (Utility Bill) | Valid | |
| Signature | Present | Valid | |
| Country of Issue | United States | Valid | |
| Issuing Authority | Department of State | Valid | |
Mismatches
[No rows – all fields passed validation]
Overall Summary
PASS – All required fields are present and correctly formatted.
E. Sample Report with Errors
KYC & Onboarding Form Parsing Report
Customer: Alice Smith
Submission Date: 2023-06-01
Extracted Fields
| Field | Extracted Value | Status | Comment |
|----------------------|------------------------------|----------|--------|
| Address | 123 Main St, USA | Mismatch | City missing |
| Date of Birth | 1990/04/15 | Mismatch | Wrong date format |
| Date of Expiry | 2030-01-01 | Valid | |
| Date of Issue | 2020-01-01 | Valid | |
| Full Name | Alice M. Smith | Mismatch | Does not match Customer Name |
| ID Number | A12345# | Mismatch | Invalid character |
| Document Type | Passport | Valid | |
| Proof of Address | None | Missing | Proof of address missing |
| Signature | Missing | Missing | |
| Country of Issue | United States | Valid | |
| Issuing Authority | Department of State | Valid | |
Mismatches
| Field | Issue | Expected Format | Actual Value | Comment |
|----------------|------------|-----------------------------|-------------|--------|
| Address | Missing city | Street, city, country | “123 St, USA” | City missing |
| Date of Birth | Wrong format | YYYY‑MM‑DD | 1990/04/15 | Use hyphens |
| Full Name | Mismatch with Customer Name | N/A | “Alice M. Smith” | Customer name: “Alice Smith” |
| ID Number | Invalid characters | Alphanumeric 6–12 characters | A12345# | Contains ‘#’ |
| Proof of Address | Missing | Proof of address document (e.g., utility bill) | None | Required for KYC |
| Signature | Missing | Hand‑written signature | None | Required for verification |
Overall Summary
FAIL – 6 issue(s) detected. The KYC submission requires additional documents and correction of data.
F. Edge‑Case Considerations
- Multiple‑page Documents – All pages are processed. If a required field appears on any page, it is accepted.
- Duplicate Information – If the same field appears on multiple pages with differing values, list the values in the Comment column and mark Mismatch.
- OCR Errors – If OCR fails for a single field, flag it as Missing and note “OCR failed – manual review”.
- Document with Watermark – Watermarks do not affect extraction; ignore them.
- File Size – The PDF should be under 10 MB to ensure timely processing. Larger files may cause time‑outs.
- Security – Do not store or transmit the PDF beyond the processing step. The report must contain only the required fields and no additional personal data.
The SOP is self‑contained, relies only on the inputs listed above, and does not require any external data sources. It can be executed manually or integrated into an automated workflow that follows the same steps.