Enterprise Reference · 2026
A practitioner's reference for building production AI systems on Claude and GCP — from Claude Code in CI/CD pipelines to cost engineering, multi-agent architecture, and security without JSON keys.
New to this stack? Start with the Google AI Studio Playbook ← for prototyping methodology, project dashboards, and the block-by-block build sequence. This guide picks up where that one ends.
Contents
The enterprise AI stack described here is not about vendor maximalism. It is about structural independence: a small, principled set of tools that compound over time, produce verifiable outputs, and do not require continuous reinvestment to maintain their position. Claude as the reasoning layer. GCP as the infrastructure layer. Markdown-based documentation as the knowledge layer. Git as the audit trail.
The distinguishing characteristic of this stack is that the knowledge it produces is portable. Your DASHBOARD.md, your EVOLUTION.md, your system instructions — these are plain text files that work in AI Studio, in Claude Code, in a local terminal, or in any future tool that can read a file. Platform lock-in is an architectural decision. This stack makes the opposite decision deliberately.
The guide that follows assumes you have shipped at least one real system, understand GCP IAM at the role level, and are comfortable with Python and the command line. It is a reference for practitioners, not a tutorial for beginners.
| Layer | Tool | Why This Choice |
|---|---|---|
| Reasoning | Claude (Anthropic) | Best-in-class on complex reasoning, instruction following, and code generation; verifiable safety constraints |
| Infrastructure | GCP — Cloud Run, Cloud SQL, BigQuery | Serverless-first, per-request billing, native IAM integration, strong compliance story |
| CI/CD Agent | Claude Code (headless) | Full file system access, structured JSON output, composable with standard shell tooling |
| Knowledge | Markdown + Git | Human-readable, diffs cleanly, works in every tool, survives every platform change |
| Secrets | GCP Secret Manager + WIF | No JSON keys, short-lived tokens, auditable access, zero long-lived credentials |
| Identity | Firebase Auth + GCP IAM | JWTs issued by Firebase, verified server-side via Google public keys; maps to GCP IAM conditions for resource-level access control |
The Anthropic Python SDK is the production interface to Claude. The patterns here are not tutorial content — they are the conventions that prevent the most common production failures: hardcoded model names, missing retry logic, and unstructured API responses that break downstream parsing.
import anthropic import os import logging log = logging.getLogger(__name__) # Never hardcode model names — pin here and import everywhere MODEL_SONNET = "claude-sonnet-4-6" MODEL_HAIKU = "claude-haiku-4-5-20251001" def get_client() -> anthropic.Anthropic: """Return a configured Anthropic client. Reads ANTHROPIC_API_KEY from env.""" api_key = os.environ.get("ANTHROPIC_API_KEY") if not api_key: raise EnvironmentError("ANTHROPIC_API_KEY not set") return anthropic.Anthropic(api_key=api_key) def complete(system: str, user: str, model: str = MODEL_SONNET, max_tokens: int = 2048) -> str: """Single-turn completion with structured logging.""" client = get_client() log.info("claude_request", extra={"model": model, "system_len": len(system)}) response = client.messages.create( model=model, max_tokens=max_tokens, system=system, messages=[{"role": "user", "content": user}], ) log.info("claude_response", extra={"input_tokens": response.usage.input_tokens, "output_tokens": response.usage.output_tokens}) return response.content[0].text
Claude supports prompt caching for repeated large context (system instructions, DASHBOARD.md, schema docs). Cached content is charged at approximately 10% of the normal input price after cache creation. Minimum 1,024 tokens to be eligible. The TTL is 5 minutes by default — for stable system prompts, cache consistently.
response = client.messages.create(
model=MODEL_SONNET,
max_tokens=2048,
system=[
{
"type": "text",
"text": large_system_prompt, # DASHBOARD.md + schema docs
"cache_control": {"type": "ephemeral"} # marks this block for caching
}
],
messages=[{"role": "user", "content": user_prompt}],
)
# response.usage.cache_creation_input_tokens — tokens written to cache
# response.usage.cache_read_input_tokens — tokens read from cache (charged at ~10%)
Streaming is the correct pattern for any response the user waits for — it reduces perceived latency dramatically. It is also the pattern most likely to fail silently in production: a stream that drops mid-response, a client that disconnects, or a timeout that triggers before the stream completes. These failure modes require different handling than a standard API error.
import anthropic
from typing import Generator
def stream_completion(
client: anthropic.Anthropic,
system: str,
user: str,
model: str = "claude-sonnet-4-6",
) -> Generator[str, None, None]:
"""
Stream a completion. Yields text chunks as they arrive.
Handles mid-stream failures gracefully — logs partial output,
raises a typed exception the caller can handle.
"""
partial_output = []
try:
with client.messages.stream(
model=model,
max_tokens=2048,
system=system,
messages=[{"role": "user", "content": user}],
) as stream:
for text in stream.text_stream:
partial_output.append(text)
yield text
except anthropic.APIStatusError as e:
# Log partial output before raising — partial responses have value
log.warning(
"stream_interrupted",
extra={
"partial_chars": sum(len(t) for t in partial_output),
"error": str(e)
}
)
raise StreamInterruptedError(
f"Stream failed after {len(partial_output)} chunks: {e}",
partial="".join(partial_output)
) from e
class StreamInterruptedError(Exception):
"""Stream failed mid-response. partial attribute contains what arrived."""
def __init__(self, message: str, partial: str = ""):
super().__init__(message)
self.partial = partial
# FastAPI SSE endpoint pattern
from fastapi import FastAPI
from fastapi.responses import StreamingResponse
app = FastAPI()
@app.get("/stream/{query}")
async def stream_response(query: str):
"""Server-Sent Events endpoint for streaming Claude responses."""
def event_generator():
for chunk in stream_completion(client, SYSTEM_PROMPT, query):
yield f"data: {chunk}\n\n"
yield "data: [DONE]\n\n"
return StreamingResponse(event_generator(), media_type="text/event-stream")
AI Studio's Build tab preview breaks with streaming responses — the iframe cannot handle SSE. Prototype all streaming features with non-streaming calls in AI Studio, switch to streaming only when deploying to Cloud Run. Log this in EVOLUTION.md § What Was Tried and Abandoned.
Claude Code is the command-line interface for Claude. In interactive mode, it is a powerful coding assistant. In headless mode, it is a scriptable AI agent that can be embedded in CI/CD pipelines, pre-commit hooks, and automated review workflows. The distinction matters: most documentation covers the interactive case. This section covers the headless case.
Claude Code sessions have context limits. For large codebases, git worktrees let you run Claude Code on an isolated copy of the repo — changes are made in the worktree and merged back when complete. This prevents the common failure mode of a long session corrupting the working tree.
# Create an isolated worktree for a Claude Code session git worktree add /tmp/claude-session-01 -b claude/feature-branch # Run Claude Code against the worktree cd /tmp/claude-session-01 claude "implement the payment webhook handler" # Review changes, then merge back git diff main git checkout main && git merge claude/feature-branch # Clean up git worktree remove /tmp/claude-session-01
# Required for non-interactive CI/CD — skips permission prompts claude --dangerously-skip-permissions --headless --print "run tests" # Structured JSON output — for parsing in scripts claude --output-format stream-json "analyze this diff" # Combine for CI pipeline use claude --dangerously-skip-permissions \ --headless \ --output-format stream-json \ --model claude-haiku-4-5-20251001 \ "classify the severity of issues in this diff" | jq '.content[0].text'
--dangerously-skip-permissions gives Claude Code full file system access without prompting. Only use in sandboxed CI environments. Never on a machine with production credentials.
#!/bin/bash # Lightweight pre-commit review — runs on staged changes only DIFF=$(git diff --cached) if [ -z "$DIFF" ]; then exit 0; fi RESULT=$(echo "$DIFF" | claude \ --headless \ --output-format stream-json \ --model claude-haiku-4-5-20251001 \ "Review this diff. Flag: hardcoded secrets, SQL injection risks, missing error handling. Output JSON: {issues: [], severity: 'low|medium|high'}" \ | jq -r '.content[0].text') SEVERITY=$(echo "$RESULT" | jq -r '.severity') if [ "$SEVERITY" = "high" ]; then echo "Claude Code: high severity issues found. Review before committing." echo "$RESULT" | jq '.issues' exit 1 fi
When Claude Code or a Gemini agent needs to operate on your GCP project, the authentication pattern matters as much as the code it generates. An agent with the wrong credentials — too broad, too narrow, or using long-lived JSON keys — is either a security liability or a source of permission failures that are slow to diagnose. The pattern below gives agents exactly what they need, nothing more, and uses short-lived tokens rather than keys.
| Agent Type | Auth Method | GCP Permissions | Use Case |
|---|---|---|---|
| Interactive Claude Code (dev) | gcloud ADC impersonating claude-code-agent SA | BigQuery read, Storage read | Local development, read-only GCP access |
| Headless Claude Code (CI/CD) | Workload Identity Federation via GitHub OIDC | Storage write, Artifact Registry write | Automated builds, deployments |
| Production Cloud Run service | Attached service account (api-runner SA) | Cloud SQL, Secret Manager, Vertex AI | Runtime API calls, database access |
# Give Claude Code the exact permissions it has in production — no more gcloud auth application-default login \ --impersonate-service-account=claude-code-agent@PROJECT_ID.iam.gserviceaccount.com # Verify what Claude Code can and cannot do gcloud projects get-iam-policy PROJECT_ID \ --flatten="bindings[].members" \ --filter="bindings.members:claude-code-agent" \ --format="table(bindings.role)" # Add to CLAUDE.md so every session knows its permission boundary: # "Auth: impersonating claude-code-agent SA — BigQuery read, Storage read only # Cannot: write to production DB, deploy services, access secrets"
"Before making any GCP API call, check DASHBOARD.md § Service Accounts to confirm the active SA has the required role listed. If the task requires a permission not in the SA's role list: - Do NOT attempt the call - Add a TODO comment: # NEEDS_PERMISSION: roles/[required-role] - Continue with the parts of the task that are within scope - Report the permission gap at the end of the session Never attempt to grant permissions or modify IAM from within a session."
Every time a new IAM role is granted to any service account, add a dated entry to EVOLUTION.md § Architecture Decisions with: what role was granted, to which SA, why it was needed, and what the minimum viable alternative was. IAM drift — roles accumulating without documentation — is the most common security debt in long-running GCP projects.
Multi-agent systems become necessary when a task exceeds a single context window, when parallelization is needed for throughput, or when you need different models optimized for different sub-tasks. The patterns here cover the three most common production configurations.
One Claude Sonnet instance acts as orchestrator — it receives the user request, breaks it into sub-tasks, dispatches to specialist agents (Haiku for classification, Sonnet for generation, Opus for complex reasoning), and assembles the final response. The orchestrator never does the work itself; it routes and assembles.
from anthropic_client import get_client, MODEL_SONNET, MODEL_HAIKU def orchestrate(user_request: str) -> str: client = get_client() # Step 1: classify with Haiku (fast, cheap) task_type = classify_request(client, user_request) # Step 2: dispatch to appropriate model if task_type == "classification": return haiku_complete(client, user_request) elif task_type == "generation": return sonnet_complete(client, user_request) else: return sonnet_complete(client, user_request, complex_mode=True)
When one agent hands off to another, the receiving agent needs sufficient context to continue without hallucinating the history. Pass a structured summary, not the raw conversation. The raw conversation is too long and contains noise. A structured handoff is faster, cheaper, and more reliable.
Retrieval-Augmented Generation is the production pattern for grounding Claude responses in your domain-specific data. The GCP implementation uses Cloud SQL with pgvector (or AlloyDB for higher throughput), Cloud Storage for document ingestion, and Cloud Run for the retrieval service.
Documents land in a Cloud Storage bucket. A Cloud Run job triggers on upload, chunks the document, and generates embeddings via the Vertex AI Embeddings API.
Embeddings are written to Cloud SQL with pgvector. Each row: document ID, chunk text, embedding vector, metadata (source, date, version).
On query, generate a query embedding, run a cosine similarity search against pgvector, return top-k chunks.
Pass retrieved chunks as context in Claude's system prompt. Log the retrieved chunk IDs with every response for auditability.
The most common RAG failure is poor chunking strategy, not poor retrieval. Chunk by semantic unit (paragraph or section), not by token count. A 512-token chunk that cuts across a sentence boundary is worse than a 650-token chunk that preserves it.
AI systems require observability at two layers: the infrastructure layer (latency, error rates, memory, CPU) and the model layer (token usage, cache hit rates, cost per request, output quality signals). Most teams instrument the first and skip the second. The second is where the expensive surprises come from.
import structlog log = structlog.get_logger() def log_llm_call(response, model: str, task_type: str): log.info("llm_call", model=model, task_type=task_type, input_tokens=response.usage.input_tokens, output_tokens=response.usage.output_tokens, cache_read_tokens=getattr(response.usage, "cache_read_input_tokens", 0), cost_usd=estimate_cost(response.usage, model), stop_reason=response.stop_reason, )
At scale, AI inference cost is a primary engineering constraint. The patterns in this section are not optimizations to apply later — they are architectural decisions that compound. A system designed without cost awareness from day one will require a rewrite at scale.
Use the cheapest model that can reliably complete the task. For most classification, extraction, and structured output tasks, Haiku is sufficient and costs roughly 20x less than Sonnet. Reserve Sonnet for generation, reasoning, and complex instruction-following. Reserve Opus for multi-constraint architectural decisions where quality matters more than cost.
| Task | Model | Reasoning |
|---|---|---|
| Classification / extraction | Haiku | Deterministic, short output, no reasoning needed |
| Structured JSON generation | Haiku / Sonnet | Haiku if schema is simple; Sonnet if schema is complex |
| Code generation | Sonnet | Requires instruction following and context awareness |
| Architecture review | Sonnet + extended thinking | Multi-constraint reasoning benefits from thinking budget |
| Complex legal / compliance review | Opus | Highest accuracy on nuanced tradeoffs |
# Count tokens before sending — one line, prevents surprise bills token_count = client.messages.count_tokens( model="claude-sonnet-4-6", system=system_prompt, messages=[{"role": "user", "content": user_prompt}] ) print(f"This request will use ~{token_count.input_tokens} input tokens") # At Sonnet pricing: input_tokens * $0.000003 = estimated cost # Set a threshold: if token_count.input_tokens > 50000: use_cache_or_truncate()
Use token counting in development to audit expensive prompts before they hit production. Set hard thresholds in code that route to a smaller model or apply truncation before dispatching oversized requests.
When Sonnet's first answer is insufficient for complex tradeoffs — multi-constraint architecture decisions, nuanced compliance questions, deeply nested conditional logic — extended thinking lets the model reason through the problem before responding. The reasoning chain is visible in the response. It costs more tokens but is dramatically better on problems that require holding multiple constraints simultaneously.
response = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=16000,
thinking={
"type": "enabled",
"budget_tokens": 10000 # how much to spend on reasoning
},
messages=[{"role": "user", "content": architecture_question}],
betas=["interleaved-thinking-2025-05-14"]
)
# response.content will include thinking blocks + response blocks
thinking_block = [b for b in response.content if b.type == "thinking"][0]
Use extended thinking for architecture decisions, not for routine generation. The cost difference is significant — budget only what the problem complexity justifies.
Token cost without attribution is useless for optimization. Knowing your monthly Claude bill is $800 tells you nothing about which feature to optimize. Knowing that your document classification feature costs $340/month and your summary generation costs $460/month tells you exactly where to focus. One metadata field on every API call makes this possible.
import anthropic
import logging
from dataclasses import dataclass
log = logging.getLogger(__name__)
@dataclass
class RequestContext:
"""Tag every LLM call with context for cost attribution."""
feature: str # "document_classification", "summary_generation", etc.
user_id: str # for per-user cost tracking
request_type: str # "classify", "generate", "extract", "reason"
environment: str # "prod", "staging", "dev"
def tracked_complete(
client: anthropic.Anthropic,
system: str,
user: str,
model: str,
ctx: RequestContext,
) -> str:
"""Complete with full cost attribution logging."""
response = client.messages.create(
model=model,
max_tokens=2048,
system=system,
messages=[{"role": "user", "content": user}],
)
# Log structured cost data — queryable in Cloud Logging
cost_usd = (
response.usage.input_tokens * 0.000003 + # Sonnet input price
response.usage.output_tokens * 0.000015 # Sonnet output price
)
log.info("llm_cost", extra={
"feature": ctx.feature,
"user_id": ctx.user_id,
"request_type": ctx.request_type,
"environment": ctx.environment,
"model": model,
"input_tokens": response.usage.input_tokens,
"output_tokens": response.usage.output_tokens,
"cache_read_tokens": getattr(response.usage, "cache_read_input_tokens", 0),
"estimated_cost_usd": round(cost_usd, 6),
})
return response.content[0].text
# Usage — cost is visible and attributable at the call site
result = tracked_complete(
client, system, user, MODEL_SONNET,
ctx=RequestContext(
feature="document_classification",
user_id=user_id,
request_type="classify",
environment="prod"
)
)
Structured logs flow to Cloud Logging automatically on Cloud Run. Export to BigQuery with a log sink and you have a queryable cost dashboard: SELECT feature, SUM(estimated_cost_usd) as total_cost FROM llm_cost_logs GROUP BY feature ORDER BY total_cost DESC. One Cloud Logging export, zero additional instrumentation.
Rate limit errors break every production system eventually. Nobody adds handling until it does. The pattern below should be in every project that calls the Anthropic API at scale. Jitter prevents the thundering herd problem when multiple workers hit limits simultaneously.
import time, random, anthropic from anthropic import RateLimitError def call_with_backoff(client, max_retries=5, **kwargs): """Exponential backoff with jitter for rate limit handling.""" for attempt in range(max_retries): try: return client.messages.create(**kwargs) except RateLimitError: if attempt == max_retries - 1: raise wait = (2 ** attempt) + random.uniform(0, 1) # jitter print(f"Rate limited. Waiting {wait:.1f}s (attempt {attempt+1}/{max_retries})") time.sleep(wait)
AI systems require two complementary testing approaches: conventional unit and integration tests for the infrastructure layer, and evaluation-based testing for the model layer. The second is harder, less familiar, and more important as the system matures.
Model output quality cannot be unit tested — it requires an evaluation set (a curated collection of inputs with expected output characteristics) and a grader (often another model or a deterministic scoring function). Build your eval set from real user requests as early as possible. Synthetic evals will not catch the failure modes that matter in production.
Start with 50 real examples, human-graded on a 1–5 scale for correctness and quality. Run evals before every significant prompt change. A prompt that improves one dimension while degrading another will not be visible without a baseline.
The documentation layer of this stack is not optional. It is the mechanism by which the AI system's behavior becomes auditable, recoverable, and transferable. The two core documents are DASHBOARD.md (the current state reference) and EVOLUTION.md (the decision log).
Spatial columns are not like other columns. A geometry column without an explicit SRID, without a GIST index, or with the wrong coordinate system produces results that are wrong in ways that are hard to detect — proximity queries that return nothing, distance calculations that are off by orders of magnitude, or joins that silently exclude valid records. These errors cannot be fixed by application code. They require schema changes, data migration, and re-indexing. Schema-first design prevents all of this. Define the spatial schema completely — SRID, geometry type, index — before any application layer reads from it.
-- Run at deploy time, every time. Idempotent by design.
-- Log the PostGIS version in DASHBOARD.md after first run.
-- Enable extension (idempotent)
CREATE EXTENSION IF NOT EXISTS postgis;
-- Verify version — log this in DASHBOARD.md § Active Services
SELECT PostGIS_Version();
-- Create spatial table with explicit SRID (idempotent via IF NOT EXISTS)
CREATE TABLE IF NOT EXISTS properties (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
name TEXT NOT NULL,
address TEXT,
property_type TEXT,
geom geometry(Point, 4326), -- WGS84, always explicit
created_at TIMESTAMPTZ DEFAULT NOW()
);
-- GIST index — create before loading data, not after
CREATE INDEX IF NOT EXISTS idx_properties_geom
ON properties USING GIST(geom);
-- Verify the index exists
SELECT indexname, indexdef
FROM pg_indexes
WHERE tablename = 'properties' AND indexname = 'idx_properties_geom';
[DATE] Schema: properties table with PostGIS geometry
Decision: geometry(Point, 4326) with GIST index
SRID: 4326 (WGS84) — matches GPS coordinates and all public municipal datasets
Index: GIST created before data load — enables ST_DWithin index use
Alternatives: geography type (rejected: less flexible for joins),
no SRID specified (rejected: produces wrong distance results silently)
Idempotency: all DDL uses IF NOT EXISTS — safe to run in CI/CD migrations
Revisit: if switching to 3D geometry or non-point types
Cloud Run is the deployment target for this stack because it matches the billing model of AI inference: you pay per request, not per hour. A system that is idle for 22 hours and busy for 2 costs proportionally — unlike a VM or a Kubernetes node.
gcloud run deploy [SERVICE_NAME] \ --image gcr.io/[PROJECT_ID]/[IMAGE] \ --region us-central1 \ --platform managed \ --service-account [SA_NAME]@[PROJECT_ID].iam.gserviceaccount.com \ --memory 512Mi \ --cpu 1 \ --min-instances 0 \ --max-instances 20 \ --concurrency 80 \ --no-allow-unauthenticated \ --set-env-vars "PROJECT_ID=[PROJECT_ID],REGION=us-central1" \ --set-secrets "ANTHROPIC_API_KEY=anthropic-api-key:latest"
min-instances=0 means cold starts on first request after inactivity (2–8 seconds). For latency-sensitive workloads, set min-instances=1. Document the cost implication in EVOLUTION.md.
The Anthropic API is reliable but not infallible. At scale, you will encounter rate limits, transient errors, and model-specific failures. The patterns here are not optional enhancements — they are production requirements for any system handling real traffic.
A circuit breaker prevents cascade failures: when error rates exceed a threshold, it stops sending requests to the model and returns a fallback response. This protects both your system and the API from thundering herd conditions when you're at or near your rate limit.
Design every Claude-powered feature with a fallback for when the model is unavailable. Cached responses for common queries. Rule-based fallbacks for classification tasks. Clear user-facing error messages that do not expose internal failure details.
Primary model → cached response → cheaper model → rule-based fallback → informative error. Design the fallback hierarchy before you need it, not during an incident.
Security in a Claude + GCP stack has two layers: protecting your API credentials from exposure, and correctly scoping the GCP permissions that your AI-powered services use. Both are simpler to get right initially than to fix after an incident.
Service account JSON keys are a security antipattern: they are long-lived, can be leaked in source control or logs, and cannot be scoped to a specific workload. Workload Identity Federation (WIF) lets GitHub Actions — or any OIDC provider — authenticate as a GCP service account without a key file. Short-lived tokens, auditable access, zero long-lived credentials.
# Create the workload identity pool gcloud iam workload-identity-pools create "github-pool" \ --project="${PROJECT_ID}" \ --location="global" # Create the OIDC provider gcloud iam workload-identity-pools providers create-oidc "github-provider" \ --project="${PROJECT_ID}" \ --location="global" \ --workload-identity-pool="github-pool" \ --attribute-mapping="google.subject=assertion.sub,attribute.repository=assertion.repository" \ --issuer-uri="https://token.actions.githubusercontent.com" # In GitHub Actions — no secrets needed - uses: google-github-actions/auth@v2 with: workload_identity_provider: 'projects/PROJECT_NUM/locations/global/workloadIdentityPools/github-pool/providers/github-provider' service_account: 'deploy-bot@PROJECT_ID.iam.gserviceaccount.com'
Add a "uses WIF, no key file" annotation to the relevant service account entry in DASHBOARD.md § Service Accounts. Future maintainers should never need to guess how authentication works.
"The goal is not to be unhackable. The goal is to ensure that any breach is limited in scope, auditable in retrospect, and recoverable without catastrophic data loss."
— Principle of least privilege applied to AI systems
The stack described in this guide is designed to be auditable. Every tool choice, every IAM decision, every architectural tradeoff should have a paper trail in EVOLUTION.md. This is not administrative overhead — it is the mechanism by which AI-generated systems become defensible, transferable, and recoverable. The VERA framework formalizes this approach into a maturity model for organizations building on AI infrastructure.
The verification layer starts with a discipline, not a certification. The discipline is: document every architectural decision at the moment it is made, while the reasoning is fresh and the alternatives are still visible.
The VERA framework is not a retrospective exercise. Every architectural decision in this guide — Firebase Auth over custom auth, Cloud Run over Cloud Functions, ST_DWithin over ST_Distance, WIF over JSON keys — should have a dated EVOLUTION.md entry at the moment it is made. Not after the project ships. Not in a documentation sprint. At the moment the decision is made, while the reasoning is fresh and the alternatives are still visible.
## 02. Architecture Decisions
# Format: [DATE] DECISION | alternatives considered + rejected | cost/risk | revisit trigger
[2026-04-07] Auth: Firebase Auth + Cloud Run verification
Decision: Firebase Auth JWTs verified server-side via google-auth library
Alternatives:
- Custom JWT (rejected: key rotation complexity, no OAuth provider integration)
- Supabase Auth (rejected: adds platform dependency outside GCP ecosystem)
- No auth on public endpoints (rejected: gated content requires identity)
Cost: Zero additional GCP cost — verification uses Google's public keys
IAM impact: Cloud Run services remain --no-allow-unauthenticated for protected routes
Revisit when: Firebase Auth pricing changes or enterprise SSO requirement emerges
[2026-04-07] Spatial: ST_DWithin over ST_Distance for proximity queries
Decision: All proximity queries use ST_DWithin(geom::geography, point, radius)
Alternatives:
- ST_Distance in WHERE clause (rejected: does not use GIST index, full table scan)
- Application-side bounding box filter (rejected: imprecise, more code)
Performance: <50ms on 100k rows with GIST index vs >2000ms without
Idempotency: no migration needed, query-level decision
Revisit: Never. This is a PostGIS fundamental.
[2026-04-07] Deployment: Cloud Run over Cloud Functions
Decision: All backend services on Cloud Run (containerized FastAPI)
Alternatives:
- Cloud Functions (rejected: 9MB unzipped limit breaks most Python deps,
no persistent connections, cold start per-invocation model)
- GKE (rejected: operational overhead, cost at this scale)
- App Engine (rejected: deprecated patterns, less flexible runtime)
Cost: Per-request billing, min-instances=0 for dev, min-instances=1 for prod
Revisit when: request volume exceeds ~10M/month (GKE cost crossover)
The EVOLUTION.md pattern described here is the foundation of the VERA maturity model. An organization with consistently documented architecture decisions, verified outputs, and sovereign knowledge infrastructure is at VERA maturity level 3. The formal framework, including the Ed25519 session certificate methodology and multi-party attestation tier, is at veraframework.com and danielflugger.com/vera.html.