LH-PVI

01 — The Challenge

Decades of data that no one could query, analyze, or trust.

The insurer had accumulated a substantial historical dataset — policy documents, claims files, inspection reports, loss run summaries — across decades of operations. The problem was not a lack of data. The problem was that the data existed in formats that made it analytically useless: scanned PDFs with handwritten annotations, carbon copies of inspection reports, addresses formatted differently across every regional office, and free-text claim descriptions that required reading comprehension to interpret.

Manual processing was the only option, and manual processing was expensive, slow, and inconsistent. Complex claims required an analyst to cross-reference multiple documents against property records, maps, and regulatory requirements before a coverage decision could be made. There was also no spatial awareness in the portfolio at all — policies and claims existed as text records with no geospatial context, which meant risk clustering, flood zone exposure, and proximity-based underwriting factors simply weren't calculable.

6+ Hours

Average analyst time per complex claim, spent cross-referencing documents against property records and regulatory requirements.

40%

Of analyst time consumed by data re-entry — normalizing inconsistent formats across systems before any actual analysis could begin.

Zero

Geospatial risk visibility across the portfolio. Risk clustering, flood zone exposure, and proximity factors were analytically invisible.

02 — Research & Approach

Meet the data where it is. Add coordinates. Trust the output.

The core architectural principle was to design the pipeline around the insurer's actual document chaos — not around clean, well-formatted inputs that don't exist in production. Production ML means processing what exists. A system that requires clean inputs will never process the legacy archive. A system that handles the full range of real document messiness will.

The approach had three sequential layers: extract the meaning from the document (regardless of format), resolve the location (regardless of how the address was written), and enrich with geospatial context (flood zone, zoning, proximity factors) automatically. The insurer's analysts receive structured output that maps directly into their existing systems. No workflow change required.

Multi-stage document processing pipeline: ingest → OCR and extraction → entity resolution → geocoding → spatial enrichment → structured output
Gemini API for intelligent document extraction that handles inconsistent formats, handwritten annotations, multi-page policy documents, and scanned carbon copies
Fuzzy address matching and geocoding to resolve ambiguous location references — partial addresses, intersections, colloquial location descriptions — to precise coordinates
Geospatial enrichment layer automatically attaches flood zone, property boundary, zoning, and proximity data to every resolved record
Output schemas map directly to the insurer's existing systems — zero workflow change required for analysts receiving structured results
Batch and real-time modes: historical backfill for the existing archive, real-time processing for new document submissions

03 — Technical Architecture

A pipeline from unstructured chaos to queryable, spatial records.

The pipeline is GCP-native throughout, with each stage designed to be independently scalable and observable. The critical design constraint was that no decrypted document ever touches persistent storage outside Cloud Storage — processing happens ephemerally in Cloud Functions, and only the structured output is retained long-term.

Ingestion

PDF and image upload to Cloud Storage triggers Cloud Function processing. Batch mode for historical archive, event-driven for new submissions. Supports scanned PDFs, handwritten documents, and multi-page files.

Extraction

Gemini API for document understanding — extracting policy numbers, claimant details, property addresses, damage descriptions, and coverage fields from unstructured text regardless of format. Pydantic schemas validate all extracted fields with confidence scores.

Geocoding

Google Maps Geocoding API for address resolution, with a fuzzy matching pre-processor (Jaro-Winkler + token normalization) that handles the inconsistent address formats across regional offices and decades of documents.

Spatial Enrichment

PostGIS spatial operations attach flood zone classification, parcel boundary, zoning designation, and proximity metrics to every geocoded record. Custom functions for flood zone analysis and proximity scoring to high-risk features.

Output & Analytics

Structured records written to BigQuery with full provenance chain. Analyst-facing output maps to existing system schemas. Portfolio-level analytics available through BigQuery for risk clustering and exposure analysis.

GCP Vertex AI Gemini API Document AI PostGIS BigQuery Cloud Functions Cloud Storage Python Pydantic

04 — Measured Impact

Faster decisions, better risk visibility, no disruption.

70%

Faster Claims Processing

Complex claim handling time dropped from 6+ hours to under 2. Analysts receive pre-structured, geocoded records with all supporting documents cross-referenced before they open the file.

94%

Extraction Accuracy

Across the full range of document types — including handwritten inspection notes and scanned carbon copies. Every field carries a confidence score and source attribution for analyst review.

Full

Portfolio Geospatial Coverage

Risk patterns across the entire policy portfolio — flood zone clustering, proximity exposure, geographic concentration — are now queryable analytics, not invisible text fields.

Zero

Workflow Disruption

Analysts receive structured output that maps directly to their existing systems. No retraining, no new interface to learn. The pipeline adds intelligence without changing how claims teams work.

05 — Key Takeaways

What production document AI actually requires.

"Meet the Data Where It Is"

The biggest architectural win was designing the pipeline to handle the insurer's actual document chaos — inconsistent formats, handwritten notes, scanned carbon copies, partial addresses, regional formatting conventions — rather than requiring clean inputs. Production ML means processing what exists, not what you wish existed. A system that requires clean inputs never processes the archive that has the most analytical value.

"Geospatial Context Changes Everything"

Adding coordinates to insurance records transformed the analytical capability from document-level to portfolio-level. Risk clustering, flood exposure analysis, and proximity scoring were impossible before geocoding — and straightforward after. The most valuable insight the platform delivered wasn't about any individual claim. It was about the geographic distribution of risk across the entire portfolio, which no one had been able to see before.

"Validation Is the Product"

Pydantic output validation with structured error reporting and confidence scores meant the insurer could trust automated outputs from day one. Every extracted field has a confidence score and provenance chain pointing back to the specific document and location it was extracted from. The trust architecture — making it easy to verify any output against its source — was more important than the extraction accuracy number, because it determined whether analysts would actually use the output or revert to manual processing.

Decades of data that no one could query, analyze, or trust.

Meet the data where it is. Add coordinates. Trust the output.

A pipeline from unstructured chaos to queryable, spatial records.

Faster decisions, better risk visibility, no disruption.

What production document AI actually requires.

View More Work