← Back to Portfolio
NYCID — NYC Private Property Dossier
Sole Architect & Builder Real Estate AI · Property Intelligence · NYC 2025

NYCID

NYCID is a private property intelligence platform that generates comprehensive dossiers for New York City residential properties — fusing municipal open data, building records, environmental assessments, and market signals into a single verified record per property. Built for real estate operators, investors, and underwriters who need depth beyond what any listing platform provides.

New York City publishes extraordinary amounts of property data across dozens of disconnected agencies — DOB permits, DOF assessments, HPD violations, DEP environmental records, ACRIS ownership transfers, DCP zoning — but no system unifies them into a coherent property-level view. NYCID does.

8.1M NYC Properties Indexed
12+ Municipal Sources Unified
<3s Dossier Generation Time
97% Identity Match Rate
Challenge Approach Architecture Impact Takeaways
01 — The Challenge

12 agencies publishing data about the same properties in 12 different formats.

New York City is the most data-rich property market in the world. The Departments of Buildings, Finance, Housing Preservation and Development, Environmental Protection, City Planning, and the ACRIS system collectively publish more data about more properties than any other municipality on earth. The problem is that each agency designed its data independently, uses different property identifiers, formats addresses differently, and publishes at different cadences with different schemas.

Existing platforms — StreetEasy, PropertyShark, CoStar — provide listing and transaction data, but not deep operational intelligence: permit history, violation patterns, environmental flags, assessment trends, ownership chain, zoning constraints. Assembling that intelligence for a single property manually takes hours across a dozen different government portals. For a portfolio of properties, it's impractical at any useful scale.

12+ Sources
Data about the same properties published by DOB, DOF, HPD, DEP, ACRIS, DCP, FDNY, and DOT — with no shared identifier or format standard.
8.1M Properties
The full NYC tax lot universe — including condos, co-ops, and complex multi-unit structures that create edge cases in every joining strategy.
Zero Unification
No existing system joins permit, violation, assessment, ownership, zoning, and environmental data into a coherent per-property record.
02 — Research & Approach

BBL as the universal join key. Fuzzy matching as the fallback.

The first architectural decision — and the one that made everything else possible — was to use the Borough-Block-Lot (BBL) identifier as the universal join key across all data sources. The BBL is the one identifier that NYC property law requires to be consistent across agencies, because it's how the tax system tracks ownership. Every other identifier (address, building ID, parcel number) varies by agency. BBL doesn't.

The problem is that not every dataset has BBL, and some datasets have it wrong. For those cases, a fuzzy address matching fallback chain resolves ambiguous references to probable BBL candidates, which are then verified through geocoding. This two-path identity resolution strategy handles the full range of data quality across the 12+ sources while maintaining a 97% overall match rate.

  • BBL-based canonical identity — every record in every source is resolved to a BBL before entering the registry. Records that can't be matched are held in a review queue with confidence scores, not silently dropped.
  • Fuzzy address matching fallback using Jaro-Winkler similarity with token normalization, followed by geocode verification to confirm candidate BBL assignments
  • Multi-source ingestion pipeline normalizes schemas from each agency into the canonical data model — handling NYC Open Data APIs (Socrata), ACRIS bulk files, and DOB NOW API independently
  • Change detection system: when any source updates a record, the affected dossier is flagged for regeneration. Every change is logged with full audit trail — what changed, from which source, at what time
  • Edge case handling for condos, co-ops, and tax lot splits (approximately 15% of the dataset) — each gets correct BBL assignment and appropriate data inheritance rules from parent lots
03 — Technical Architecture

GCP-native property registry at NYC scale.

The architecture is designed for both scale and freshness — the 8.1M property registry needs to stay current as agencies publish updates, and individual dossier requests need to resolve in under 3 seconds. These two requirements drove the key architectural decisions: BigQuery for the registry (batch-optimized, analytical), Firestore for user state and saved dossiers (real-time, query-optimized), and Cloud Run for the dossier generation engine (stateless, scalable).

Ingestion
Scheduled Cloud Functions pull from NYC Open Data APIs (Socrata), ACRIS bulk file releases, and DOB NOW API on agency-specific cadences. Each source has its own ingestion module with source-specific parsing and error handling.
Identity Resolution
BBL-based join with fuzzy address matching fallback. Jaro-Winkler similarity with borough and street type normalization resolves ambiguous references. Geocode verification confirms candidate BBL assignments before committing.
Property Registry
BigQuery holds the full 8.1M property dataset with temporal history — every version of every record from every source, timestamped. Enables point-in-time queries and change analysis across any time window.
Dossier Engine
Cloud Run service assembles permit history, violation records, assessment trends, ownership chain, zoning context, flood zone status, and environmental flags into structured JSON. PDF rendering for shareable reports. Sub-3-second generation via pre-computed summary tables in BigQuery.
User Layer
FastAPI backend on Cloud Run. Firestore for saved dossiers, user state, and search history. Property search by address, BBL, or owner name — each resolving through the identity layer before returning dossier data.
GCP BigQuery Cloud Run Cloud Functions Firestore PostGIS Python FastAPI Pydantic NYC Open Data
04 — Measured Impact

8.1 million properties. One coherent view of each.

8.1M
NYC Properties Indexed
The complete NYC tax lot universe — all boroughs, all property types, including the edge cases (condo units, co-ops, split lots) that account for roughly 15% of the dataset and most of the implementation complexity.
12+
Municipal Sources Unified
DOB, DOF, HPD, DEP, ACRIS, DCP, FDNY, DOT — each publishing independently, each normalized to the canonical property data model, all joinable through the BBL identity layer.
<3s
Dossier Generation
Per-property dossiers assemble in under 3 seconds via pre-computed summary tables in BigQuery. Full temporal history is available for deeper queries without impacting standard dossier response time.
97%
Identity Match Rate
Across the full dataset, including sources that use inconsistent address formats and lack BBL identifiers. Unmatched records are held in a review queue with confidence scores — not silently dropped from the registry.
05 — Key Takeaways

What building with municipal data at scale actually teaches you.

"Identity Resolution Is the Hardest Problem"
Joining data across NYC agencies sounds straightforward until you encounter the full reality: every agency formats addresses differently, some use BBL and some don't, and edge cases (condo units, co-op shares, tax lots that split across the dataset) are not rare — they're approximately 15% of the full dataset. The identity resolution layer took longer to build than every other component combined, and it's what determines whether the rest of the system is trustworthy. A fast dossier generated from a wrong join is worse than no dossier at all.
"Temporal History Is the Moat"
A snapshot of current property data is useful. A temporal record showing permit filings over 10 years, violation patterns across ownership changes, assessment trends before and after major renovations, and full chain of ownership transforms a data product into an intelligence product. We stored every version of every record from every source — not just the latest. That decision doubles the storage cost and significantly complicates ingestion, but it's what separates the platform from a scraper. Retroactive temporal history cannot be manufactured. It has to be accumulated.
"Municipal Data Is Messy But Rich"
NYC publishes more property data than any city in the world. The challenge is not access — most of it is free through NYC Open Data. The challenge is normalization and trust. Building confidence scores for each data source, displaying provenance on every field (which source, which publication date), and making it easy for users to trace any data point back to its original government record was the difference between a prototype that engineers found interesting and a product that real estate professionals rely on for decisions involving real money.

View More Work

See the full portfolio — production AI systems across asset management, insurance intelligence, and voice biometrics.