NYCID

01 — The Challenge

12 agencies publishing data about the same properties in 12 different formats.

New York City is the most data-rich property market in the world. The Departments of Buildings, Finance, Housing Preservation and Development, Environmental Protection, City Planning, and the ACRIS system collectively publish more data about more properties than any other municipality on earth. The problem is that each agency designed its data independently, uses different property identifiers, formats addresses differently, and publishes at different cadences with different schemas.

Existing platforms — StreetEasy, PropertyShark, CoStar — provide listing and transaction data, but not deep operational intelligence: permit history, violation patterns, environmental flags, assessment trends, ownership chain, zoning constraints. Assembling that intelligence for a single property manually takes hours across a dozen different government portals. For a portfolio of properties, it's impractical at any useful scale.

12+ Sources

Data about the same properties published by DOB, DOF, HPD, DEP, ACRIS, DCP, FDNY, and DOT — with no shared identifier or format standard.

8.1M Properties

The full NYC tax lot universe — including condos, co-ops, and complex multi-unit structures that create edge cases in every joining strategy.

Zero Unification

No existing system joins permit, violation, assessment, ownership, zoning, and environmental data into a coherent per-property record.

02 — Research & Approach

BBL as the universal join key. Fuzzy matching as the fallback.

The first architectural decision — and the one that made everything else possible — was to use the Borough-Block-Lot (BBL) identifier as the universal join key across all data sources. The BBL is the one identifier that NYC property law requires to be consistent across agencies, because it's how the tax system tracks ownership. Every other identifier (address, building ID, parcel number) varies by agency. BBL doesn't.

The problem is that not every dataset has BBL, and some datasets have it wrong. For those cases, a fuzzy address matching fallback chain resolves ambiguous references to probable BBL candidates, which are then verified through geocoding. This two-path identity resolution strategy handles the full range of data quality across the 12+ sources while maintaining a 97% overall match rate.

BBL-based canonical identity — every record in every source is resolved to a BBL before entering the registry. Records that can't be matched are held in a review queue with confidence scores, not silently dropped.
Fuzzy address matching fallback using Jaro-Winkler similarity with token normalization, followed by geocode verification to confirm candidate BBL assignments
Multi-source ingestion pipeline normalizes schemas from each agency into the canonical data model — handling NYC Open Data APIs (Socrata), ACRIS bulk files, and DOB NOW API independently
Change detection system: when any source updates a record, the affected dossier is flagged for regeneration. Every change is logged with full audit trail — what changed, from which source, at what time
Edge case handling for condos, co-ops, and tax lot splits (approximately 15% of the dataset) — each gets correct BBL assignment and appropriate data inheritance rules from parent lots

03 — Technical Architecture

GCP-native property registry at NYC scale.

The architecture is designed for both scale and freshness — the 8.1M property registry needs to stay current as agencies publish updates, and individual dossier requests need to resolve in under 3 seconds. These two requirements drove the key architectural decisions: BigQuery for the registry (batch-optimized, analytical), Firestore for user state and saved dossiers (real-time, query-optimized), and Cloud Run for the dossier generation engine (stateless, scalable).

Ingestion

Scheduled Cloud Functions pull from NYC Open Data APIs (Socrata), ACRIS bulk file releases, and DOB NOW API on agency-specific cadences. Each source has its own ingestion module with source-specific parsing and error handling.

Identity Resolution

BBL-based join with fuzzy address matching fallback. Jaro-Winkler similarity with borough and street type normalization resolves ambiguous references. Geocode verification confirms candidate BBL assignments before committing.

Property Registry

BigQuery holds the full 8.1M property dataset with temporal history — every version of every record from every source, timestamped. Enables point-in-time queries and change analysis across any time window.

Dossier Engine

Cloud Run service assembles permit history, violation records, assessment trends, ownership chain, zoning context, flood zone status, and environmental flags into structured JSON. PDF rendering for shareable reports. Sub-3-second generation via pre-computed summary tables in BigQuery.

User Layer

FastAPI backend on Cloud Run. Firestore for saved dossiers, user state, and search history. Property search by address, BBL, or owner name — each resolving through the identity layer before returning dossier data.

GCP BigQuery Cloud Run Cloud Functions Firestore PostGIS Python FastAPI Pydantic NYC Open Data

04 — Measured Impact

8.1 million properties. One coherent view of each.

8.1M

NYC Properties Indexed

The complete NYC tax lot universe — all boroughs, all property types, including the edge cases (condo units, co-ops, split lots) that account for roughly 15% of the dataset and most of the implementation complexity.

12+

Municipal Sources Unified

DOB, DOF, HPD, DEP, ACRIS, DCP, FDNY, DOT — each publishing independently, each normalized to the canonical property data model, all joinable through the BBL identity layer.

<3s

Dossier Generation

Per-property dossiers assemble in under 3 seconds via pre-computed summary tables in BigQuery. Full temporal history is available for deeper queries without impacting standard dossier response time.

97%

Identity Match Rate

Across the full dataset, including sources that use inconsistent address formats and lack BBL identifiers. Unmatched records are held in a review queue with confidence scores — not silently dropped from the registry.

05 — Key Takeaways

What building with municipal data at scale actually teaches you.

"Identity Resolution Is the Hardest Problem"

Joining data across NYC agencies sounds straightforward until you encounter the full reality: every agency formats addresses differently, some use BBL and some don't, and edge cases (condo units, co-op shares, tax lots that split across the dataset) are not rare — they're approximately 15% of the full dataset. The identity resolution layer took longer to build than every other component combined, and it's what determines whether the rest of the system is trustworthy. A fast dossier generated from a wrong join is worse than no dossier at all.

"Temporal History Is the Moat"

A snapshot of current property data is useful. A temporal record showing permit filings over 10 years, violation patterns across ownership changes, assessment trends before and after major renovations, and full chain of ownership transforms a data product into an intelligence product. We stored every version of every record from every source — not just the latest. That decision doubles the storage cost and significantly complicates ingestion, but it's what separates the platform from a scraper. Retroactive temporal history cannot be manufactured. It has to be accumulated.

"Municipal Data Is Messy But Rich"

NYC publishes more property data than any city in the world. The challenge is not access — most of it is free through NYC Open Data. The challenge is normalization and trust. Building confidence scores for each data source, displaying provenance on every field (which source, which publication date), and making it easy for users to trace any data point back to its original government record was the difference between a prototype that engineers found interesting and a product that real estate professionals rely on for decisions involving real money.

12 agencies publishing data about the same properties in 12 different formats.

BBL as the universal join key. Fuzzy matching as the fallback.

GCP-native property registry at NYC scale.

8.1 million properties. One coherent view of each.

What building with municipal data at scale actually teaches you.

View More Work