Enterprise Reference  ·  2026

The Enterprise Stack:
Claude, GCP, and the Sovereign Builder

A practitioner's reference for building production AI systems on Claude and GCP — from Claude Code in CI/CD pipelines to cost engineering, multi-agent architecture, and security without JSON keys.

Daniel Flügger  ·  2026  ·  Google Cloud Partner


Start Here

New to this stack? Start with the Google AI Studio Playbook ← for prototyping methodology, project dashboards, and the block-by-block build sequence. This guide picks up where that one ends.

Contents

01 — Foundation

The Sovereign Stack Philosophy

The enterprise AI stack described here is not about vendor maximalism. It is about structural independence: a small, principled set of tools that compound over time, produce verifiable outputs, and do not require continuous reinvestment to maintain their position. Claude as the reasoning layer. GCP as the infrastructure layer. Markdown-based documentation as the knowledge layer. Git as the audit trail.

The distinguishing characteristic of this stack is that the knowledge it produces is portable. Your DASHBOARD.md, your EVOLUTION.md, your system instructions — these are plain text files that work in AI Studio, in Claude Code, in a local terminal, or in any future tool that can read a file. Platform lock-in is an architectural decision. This stack makes the opposite decision deliberately.

The guide that follows assumes you have shipped at least one real system, understand GCP IAM at the role level, and are comfortable with Python and the command line. It is a reference for practitioners, not a tutorial for beginners.

LayerToolWhy This Choice
ReasoningClaude (Anthropic)Best-in-class on complex reasoning, instruction following, and code generation; verifiable safety constraints
InfrastructureGCP — Cloud Run, Cloud SQL, BigQueryServerless-first, per-request billing, native IAM integration, strong compliance story
CI/CD AgentClaude Code (headless)Full file system access, structured JSON output, composable with standard shell tooling
KnowledgeMarkdown + GitHuman-readable, diffs cleanly, works in every tool, survives every platform change
SecretsGCP Secret Manager + WIFNo JSON keys, short-lived tokens, auditable access, zero long-lived credentials
IdentityFirebase Auth + GCP IAMJWTs issued by Firebase, verified server-side via Google public keys; maps to GCP IAM conditions for resource-level access control
02 — Client

Anthropic SDK — Python Client Setup

The Anthropic Python SDK is the production interface to Claude. The patterns here are not tutorial content — they are the conventions that prevent the most common production failures: hardcoded model names, missing retry logic, and unstructured API responses that break downstream parsing.

Centralized client configuration

anthropic_client.py — one place for all SDK config
import anthropic
import os
import logging

log = logging.getLogger(__name__)

# Never hardcode model names — pin here and import everywhere
MODEL_SONNET = "claude-sonnet-4-6"
MODEL_HAIKU  = "claude-haiku-4-5-20251001"

def get_client() -> anthropic.Anthropic:
    """Return a configured Anthropic client. Reads ANTHROPIC_API_KEY from env."""
    api_key = os.environ.get("ANTHROPIC_API_KEY")
    if not api_key:
        raise EnvironmentError("ANTHROPIC_API_KEY not set")
    return anthropic.Anthropic(api_key=api_key)

def complete(system: str, user: str, model: str = MODEL_SONNET, max_tokens: int = 2048) -> str:
    """Single-turn completion with structured logging."""
    client = get_client()
    log.info("claude_request", extra={"model": model, "system_len": len(system)})
    response = client.messages.create(
        model=model,
        max_tokens=max_tokens,
        system=system,
        messages=[{"role": "user", "content": user}],
    )
    log.info("claude_response", extra={"input_tokens": response.usage.input_tokens,
                                             "output_tokens": response.usage.output_tokens})
    return response.content[0].text

Prompt caching — 90% cost reduction on repeated context

Claude supports prompt caching for repeated large context (system instructions, DASHBOARD.md, schema docs). Cached content is charged at approximately 10% of the normal input price after cache creation. Minimum 1,024 tokens to be eligible. The TTL is 5 minutes by default — for stable system prompts, cache consistently.

Prompt caching — cache_control on large system context
response = client.messages.create(
    model=MODEL_SONNET,
    max_tokens=2048,
    system=[
        {
            "type": "text",
            "text": large_system_prompt,    # DASHBOARD.md + schema docs
            "cache_control": {"type": "ephemeral"}  # marks this block for caching
        }
    ],
    messages=[{"role": "user", "content": user_prompt}],
)
# response.usage.cache_creation_input_tokens — tokens written to cache
# response.usage.cache_read_input_tokens — tokens read from cache (charged at ~10%)

Streaming responses in production

Streaming is the correct pattern for any response the user waits for — it reduces perceived latency dramatically. It is also the pattern most likely to fail silently in production: a stream that drops mid-response, a client that disconnects, or a timeout that triggers before the stream completes. These failure modes require different handling than a standard API error.

Streaming with error handling and partial recovery
import anthropic
from typing import Generator

def stream_completion(
    client: anthropic.Anthropic,
    system: str,
    user: str,
    model: str = "claude-sonnet-4-6",
) -> Generator[str, None, None]:
    """
    Stream a completion. Yields text chunks as they arrive.
    Handles mid-stream failures gracefully — logs partial output,
    raises a typed exception the caller can handle.
    """
    partial_output = []
    try:
        with client.messages.stream(
            model=model,
            max_tokens=2048,
            system=system,
            messages=[{"role": "user", "content": user}],
        ) as stream:
            for text in stream.text_stream:
                partial_output.append(text)
                yield text
    except anthropic.APIStatusError as e:
        # Log partial output before raising — partial responses have value
        log.warning(
            "stream_interrupted",
            extra={
                "partial_chars": sum(len(t) for t in partial_output),
                "error": str(e)
            }
        )
        raise StreamInterruptedError(
            f"Stream failed after {len(partial_output)} chunks: {e}",
            partial="".join(partial_output)
        ) from e

class StreamInterruptedError(Exception):
    """Stream failed mid-response. partial attribute contains what arrived."""
    def __init__(self, message: str, partial: str = ""):
        super().__init__(message)
        self.partial = partial

# FastAPI SSE endpoint pattern
from fastapi import FastAPI
from fastapi.responses import StreamingResponse

app = FastAPI()

@app.get("/stream/{query}")
async def stream_response(query: str):
    """Server-Sent Events endpoint for streaming Claude responses."""
    def event_generator():
        for chunk in stream_completion(client, SYSTEM_PROMPT, query):
            yield f"data: {chunk}\n\n"
        yield "data: [DONE]\n\n"
    return StreamingResponse(event_generator(), media_type="text/event-stream")
Do not stream in AI Studio preview

AI Studio's Build tab preview breaks with streaming responses — the iframe cannot handle SSE. Prototype all streaming features with non-streaming calls in AI Studio, switch to streaming only when deploying to Cloud Run. Log this in EVOLUTION.md § What Was Tried and Abandoned.

03 — Automation

Claude Code — CI/CD and Headless Mode

Claude Code is the command-line interface for Claude. In interactive mode, it is a powerful coding assistant. In headless mode, it is a scriptable AI agent that can be embedded in CI/CD pipelines, pre-commit hooks, and automated review workflows. The distinction matters: most documentation covers the interactive case. This section covers the headless case.

Git worktrees for parallel sessions

Claude Code sessions have context limits. For large codebases, git worktrees let you run Claude Code on an isolated copy of the repo — changes are made in the worktree and merged back when complete. This prevents the common failure mode of a long session corrupting the working tree.

Git worktree pattern — isolated Claude Code session
# Create an isolated worktree for a Claude Code session
git worktree add /tmp/claude-session-01 -b claude/feature-branch

# Run Claude Code against the worktree
cd /tmp/claude-session-01
claude "implement the payment webhook handler"

# Review changes, then merge back
git diff main
git checkout main && git merge claude/feature-branch

# Clean up
git worktree remove /tmp/claude-session-01

Headless mode flags you need to know

Headless mode — CI/CD usage
# Required for non-interactive CI/CD — skips permission prompts
claude --dangerously-skip-permissions --headless --print "run tests"

# Structured JSON output — for parsing in scripts
claude --output-format stream-json "analyze this diff"

# Combine for CI pipeline use
claude --dangerously-skip-permissions \
       --headless \
       --output-format stream-json \
       --model claude-haiku-4-5-20251001 \
       "classify the severity of issues in this diff" | jq '.content[0].text'
Security warning

--dangerously-skip-permissions gives Claude Code full file system access without prompting. Only use in sandboxed CI environments. Never on a machine with production credentials.

Pre-commit hook integration

.git/hooks/pre-commit — automated diff review
#!/bin/bash
# Lightweight pre-commit review — runs on staged changes only
DIFF=$(git diff --cached)
if [ -z "$DIFF" ]; then exit 0; fi

RESULT=$(echo "$DIFF" | claude \
  --headless \
  --output-format stream-json \
  --model claude-haiku-4-5-20251001 \
  "Review this diff. Flag: hardcoded secrets, SQL injection risks, missing error handling. Output JSON: {issues: [], severity: 'low|medium|high'}" \
  | jq -r '.content[0].text')

SEVERITY=$(echo "$RESULT" | jq -r '.severity')
if [ "$SEVERITY" = "high" ]; then
  echo "Claude Code: high severity issues found. Review before committing."
  echo "$RESULT" | jq '.issues'
  exit 1
fi
03b — Auth Pattern

Agentic Handshakes — How Agents Auth into Your Project

When Claude Code or a Gemini agent needs to operate on your GCP project, the authentication pattern matters as much as the code it generates. An agent with the wrong credentials — too broad, too narrow, or using long-lived JSON keys — is either a security liability or a source of permission failures that are slow to diagnose. The pattern below gives agents exactly what they need, nothing more, and uses short-lived tokens rather than keys.

The three-level auth hierarchy

Agent TypeAuth MethodGCP PermissionsUse Case
Interactive Claude Code (dev)gcloud ADC impersonating claude-code-agent SABigQuery read, Storage readLocal development, read-only GCP access
Headless Claude Code (CI/CD)Workload Identity Federation via GitHub OIDCStorage write, Artifact Registry writeAutomated builds, deployments
Production Cloud Run serviceAttached service account (api-runner SA)Cloud SQL, Secret Manager, Vertex AIRuntime API calls, database access
Dev setup — impersonate scoped SA for Claude Code
# Give Claude Code the exact permissions it has in production — no more
gcloud auth application-default login \
  --impersonate-service-account=claude-code-agent@PROJECT_ID.iam.gserviceaccount.com

# Verify what Claude Code can and cannot do
gcloud projects get-iam-policy PROJECT_ID \
  --flatten="bindings[].members" \
  --filter="bindings.members:claude-code-agent" \
  --format="table(bindings.role)"

# Add to CLAUDE.md so every session knows its permission boundary:
# "Auth: impersonating claude-code-agent SA — BigQuery read, Storage read only
#  Cannot: write to production DB, deploy services, access secrets"
Specific prompt pattern for Claude Code — auth-aware tasks
"Before making any GCP API call, check DASHBOARD.md § Service Accounts
to confirm the active SA has the required role listed.

If the task requires a permission not in the SA's role list:
- Do NOT attempt the call
- Add a TODO comment: # NEEDS_PERMISSION: roles/[required-role]
- Continue with the parts of the task that are within scope
- Report the permission gap at the end of the session

Never attempt to grant permissions or modify IAM from within a session."
EVOLUTION.md requirement

Every time a new IAM role is granted to any service account, add a dated entry to EVOLUTION.md § Architecture Decisions with: what role was granted, to which SA, why it was needed, and what the minimum viable alternative was. IAM drift — roles accumulating without documentation — is the most common security debt in long-running GCP projects.

04 — Architecture

Multi-Agent Architecture

Multi-agent systems become necessary when a task exceeds a single context window, when parallelization is needed for throughput, or when you need different models optimized for different sub-tasks. The patterns here cover the three most common production configurations.

Orchestrator + specialist pattern

One Claude Sonnet instance acts as orchestrator — it receives the user request, breaks it into sub-tasks, dispatches to specialist agents (Haiku for classification, Sonnet for generation, Opus for complex reasoning), and assembles the final response. The orchestrator never does the work itself; it routes and assembles.

Orchestrator pattern — model routing by task
from anthropic_client import get_client, MODEL_SONNET, MODEL_HAIKU

def orchestrate(user_request: str) -> str:
    client = get_client()

    # Step 1: classify with Haiku (fast, cheap)
    task_type = classify_request(client, user_request)

    # Step 2: dispatch to appropriate model
    if task_type == "classification":
        return haiku_complete(client, user_request)
    elif task_type == "generation":
        return sonnet_complete(client, user_request)
    else:
        return sonnet_complete(client, user_request, complex_mode=True)

Context handoff between agents

When one agent hands off to another, the receiving agent needs sufficient context to continue without hallucinating the history. Pass a structured summary, not the raw conversation. The raw conversation is too long and contains noise. A structured handoff is faster, cheaper, and more reliable.

05 — Retrieval

RAG Architecture on GCP

Retrieval-Augmented Generation is the production pattern for grounding Claude responses in your domain-specific data. The GCP implementation uses Cloud SQL with pgvector (or AlloyDB for higher throughput), Cloud Storage for document ingestion, and Cloud Run for the retrieval service.

The embedding pipeline

01
Ingest

Documents land in a Cloud Storage bucket. A Cloud Run job triggers on upload, chunks the document, and generates embeddings via the Vertex AI Embeddings API.

02
Store

Embeddings are written to Cloud SQL with pgvector. Each row: document ID, chunk text, embedding vector, metadata (source, date, version).

03
Retrieve

On query, generate a query embedding, run a cosine similarity search against pgvector, return top-k chunks.

04
Generate

Pass retrieved chunks as context in Claude's system prompt. Log the retrieved chunk IDs with every response for auditability.

Retrieval quality

The most common RAG failure is poor chunking strategy, not poor retrieval. Chunk by semantic unit (paragraph or section), not by token count. A 512-token chunk that cuts across a sentence boundary is worse than a 650-token chunk that preserves it.

06 — Observability

Monitoring and Observability

AI systems require observability at two layers: the infrastructure layer (latency, error rates, memory, CPU) and the model layer (token usage, cache hit rates, cost per request, output quality signals). Most teams instrument the first and skip the second. The second is where the expensive surprises come from.

Structured logging for every LLM call

Structured logging — minimum fields per request
import structlog
log = structlog.get_logger()

def log_llm_call(response, model: str, task_type: str):
    log.info("llm_call",
        model=model,
        task_type=task_type,
        input_tokens=response.usage.input_tokens,
        output_tokens=response.usage.output_tokens,
        cache_read_tokens=getattr(response.usage, "cache_read_input_tokens", 0),
        cost_usd=estimate_cost(response.usage, model),
        stop_reason=response.stop_reason,
    )

GCP Cloud Monitoring dashboard fields

07 — Cost Engineering

Cost Engineering

At scale, AI inference cost is a primary engineering constraint. The patterns in this section are not optimizations to apply later — they are architectural decisions that compound. A system designed without cost awareness from day one will require a rewrite at scale.

Model routing by task complexity

Use the cheapest model that can reliably complete the task. For most classification, extraction, and structured output tasks, Haiku is sufficient and costs roughly 20x less than Sonnet. Reserve Sonnet for generation, reasoning, and complex instruction-following. Reserve Opus for multi-constraint architectural decisions where quality matters more than cost.

TaskModelReasoning
Classification / extractionHaikuDeterministic, short output, no reasoning needed
Structured JSON generationHaiku / SonnetHaiku if schema is simple; Sonnet if schema is complex
Code generationSonnetRequires instruction following and context awareness
Architecture reviewSonnet + extended thinkingMulti-constraint reasoning benefits from thinking budget
Complex legal / compliance reviewOpusHighest accuracy on nuanced tradeoffs

Count tokens before expensive calls

Token counting — prevent surprise bills
# Count tokens before sending — one line, prevents surprise bills
token_count = client.messages.count_tokens(
    model="claude-sonnet-4-6",
    system=system_prompt,
    messages=[{"role": "user", "content": user_prompt}]
)
print(f"This request will use ~{token_count.input_tokens} input tokens")
# At Sonnet pricing: input_tokens * $0.000003 = estimated cost
# Set a threshold: if token_count.input_tokens > 50000: use_cache_or_truncate()

Use token counting in development to audit expensive prompts before they hit production. Set hard thresholds in code that route to a smaller model or apply truncation before dispatching oversized requests.

Extended thinking for complex architectural reasoning

When Sonnet's first answer is insufficient for complex tradeoffs — multi-constraint architecture decisions, nuanced compliance questions, deeply nested conditional logic — extended thinking lets the model reason through the problem before responding. The reasoning chain is visible in the response. It costs more tokens but is dramatically better on problems that require holding multiple constraints simultaneously.

Extended thinking — beta parameter syntax
response = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=16000,
    thinking={
        "type": "enabled",
        "budget_tokens": 10000  # how much to spend on reasoning
    },
    messages=[{"role": "user", "content": architecture_question}],
    betas=["interleaved-thinking-2025-05-14"]
)
# response.content will include thinking blocks + response blocks
thinking_block = [b for b in response.content if b.type == "thinking"][0]

Use extended thinking for architecture decisions, not for routine generation. The cost difference is significant — budget only what the problem complexity justifies.

Cost attribution — tag every request

Token cost without attribution is useless for optimization. Knowing your monthly Claude bill is $800 tells you nothing about which feature to optimize. Knowing that your document classification feature costs $340/month and your summary generation costs $460/month tells you exactly where to focus. One metadata field on every API call makes this possible.

Cost attribution — metadata tagging pattern
import anthropic
import logging
from dataclasses import dataclass

log = logging.getLogger(__name__)

@dataclass
class RequestContext:
    """Tag every LLM call with context for cost attribution."""
    feature: str       # "document_classification", "summary_generation", etc.
    user_id: str       # for per-user cost tracking
    request_type: str  # "classify", "generate", "extract", "reason"
    environment: str   # "prod", "staging", "dev"

def tracked_complete(
    client: anthropic.Anthropic,
    system: str,
    user: str,
    model: str,
    ctx: RequestContext,
) -> str:
    """Complete with full cost attribution logging."""
    response = client.messages.create(
        model=model,
        max_tokens=2048,
        system=system,
        messages=[{"role": "user", "content": user}],
    )

    # Log structured cost data — queryable in Cloud Logging
    cost_usd = (
        response.usage.input_tokens * 0.000003 +   # Sonnet input price
        response.usage.output_tokens * 0.000015    # Sonnet output price
    )
    log.info("llm_cost", extra={
        "feature": ctx.feature,
        "user_id": ctx.user_id,
        "request_type": ctx.request_type,
        "environment": ctx.environment,
        "model": model,
        "input_tokens": response.usage.input_tokens,
        "output_tokens": response.usage.output_tokens,
        "cache_read_tokens": getattr(response.usage, "cache_read_input_tokens", 0),
        "estimated_cost_usd": round(cost_usd, 6),
    })
    return response.content[0].text

# Usage — cost is visible and attributable at the call site
result = tracked_complete(
    client, system, user, MODEL_SONNET,
    ctx=RequestContext(
        feature="document_classification",
        user_id=user_id,
        request_type="classify",
        environment="prod"
    )
)
BigQuery cost dashboard

Structured logs flow to Cloud Logging automatically on Cloud Run. Export to BigQuery with a log sink and you have a queryable cost dashboard: SELECT feature, SUM(estimated_cost_usd) as total_cost FROM llm_cost_logs GROUP BY feature ORDER BY total_cost DESC. One Cloud Logging export, zero additional instrumentation.

Rate limit handling — exponential backoff with jitter

Rate limit errors break every production system eventually. Nobody adds handling until it does. The pattern below should be in every project that calls the Anthropic API at scale. Jitter prevents the thundering herd problem when multiple workers hit limits simultaneously.

Exponential backoff with jitter
import time, random, anthropic
from anthropic import RateLimitError

def call_with_backoff(client, max_retries=5, **kwargs):
    """Exponential backoff with jitter for rate limit handling."""
    for attempt in range(max_retries):
        try:
            return client.messages.create(**kwargs)
        except RateLimitError:
            if attempt == max_retries - 1:
                raise
            wait = (2 ** attempt) + random.uniform(0, 1)  # jitter
            print(f"Rate limited. Waiting {wait:.1f}s (attempt {attempt+1}/{max_retries})")
            time.sleep(wait)
08 — Quality

Testing Strategies

AI systems require two complementary testing approaches: conventional unit and integration tests for the infrastructure layer, and evaluation-based testing for the model layer. The second is harder, less familiar, and more important as the system matures.

What to unit test

What requires evaluation

Model output quality cannot be unit tested — it requires an evaluation set (a curated collection of inputs with expected output characteristics) and a grader (often another model or a deterministic scoring function). Build your eval set from real user requests as early as possible. Synthetic evals will not catch the failure modes that matter in production.

Eval minimum viable set

Start with 50 real examples, human-graded on a 1–5 scale for correctness and quality. Run evals before every significant prompt change. A prompt that improves one dimension while degrading another will not be visible without a baseline.

09 — Documentation

Version Control and Documentation

The documentation layer of this stack is not optional. It is the mechanism by which the AI system's behavior becomes auditable, recoverable, and transferable. The two core documents are DASHBOARD.md (the current state reference) and EVOLUTION.md (the decision log).

DASHBOARD.md structure

EVOLUTION.md structure

09b — Spatial

PostGIS + LLMs — Spatial Schema First

Schema-First is non-negotiable for spatial data

Spatial columns are not like other columns. A geometry column without an explicit SRID, without a GIST index, or with the wrong coordinate system produces results that are wrong in ways that are hard to detect — proximity queries that return nothing, distance calculations that are off by orders of magnitude, or joins that silently exclude valid records. These errors cannot be fixed by application code. They require schema changes, data migration, and re-indexing. Schema-first design prevents all of this. Define the spatial schema completely — SRID, geometry type, index — before any application layer reads from it.

Idempotent PostGIS schema — safe to run multiple times
-- Run at deploy time, every time. Idempotent by design.
-- Log the PostGIS version in DASHBOARD.md after first run.

-- Enable extension (idempotent)
CREATE EXTENSION IF NOT EXISTS postgis;

-- Verify version — log this in DASHBOARD.md § Active Services
SELECT PostGIS_Version();

-- Create spatial table with explicit SRID (idempotent via IF NOT EXISTS)
CREATE TABLE IF NOT EXISTS properties (
    id          UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    name        TEXT NOT NULL,
    address     TEXT,
    property_type TEXT,
    geom        geometry(Point, 4326),  -- WGS84, always explicit
    created_at  TIMESTAMPTZ DEFAULT NOW()
);

-- GIST index — create before loading data, not after
CREATE INDEX IF NOT EXISTS idx_properties_geom
    ON properties USING GIST(geom);

-- Verify the index exists
SELECT indexname, indexdef
FROM pg_indexes
WHERE tablename = 'properties' AND indexname = 'idx_properties_geom';
EVOLUTION.md entry template for spatial schema

[DATE] Schema: properties table with PostGIS geometry Decision: geometry(Point, 4326) with GIST index SRID: 4326 (WGS84) — matches GPS coordinates and all public municipal datasets Index: GIST created before data load — enables ST_DWithin index use Alternatives: geography type (rejected: less flexible for joins), no SRID specified (rejected: produces wrong distance results silently) Idempotency: all DDL uses IF NOT EXISTS — safe to run in CI/CD migrations Revisit: if switching to 3D geometry or non-point types

10 — Deployment

Deployment on Cloud Run

Cloud Run is the deployment target for this stack because it matches the billing model of AI inference: you pay per request, not per hour. A system that is idle for 22 hours and busy for 2 costs proportionally — unlike a VM or a Kubernetes node.

Production deployment flags

Cloud Run deploy — production configuration
gcloud run deploy [SERVICE_NAME] \
  --image gcr.io/[PROJECT_ID]/[IMAGE] \
  --region us-central1 \
  --platform managed \
  --service-account [SA_NAME]@[PROJECT_ID].iam.gserviceaccount.com \
  --memory 512Mi \
  --cpu 1 \
  --min-instances 0 \
  --max-instances 20 \
  --concurrency 80 \
  --no-allow-unauthenticated \
  --set-env-vars "PROJECT_ID=[PROJECT_ID],REGION=us-central1" \
  --set-secrets "ANTHROPIC_API_KEY=anthropic-api-key:latest"
Cold start

min-instances=0 means cold starts on first request after inactivity (2–8 seconds). For latency-sensitive workloads, set min-instances=1. Document the cost implication in EVOLUTION.md.

11 — Resilience

Resilience Patterns

The Anthropic API is reliable but not infallible. At scale, you will encounter rate limits, transient errors, and model-specific failures. The patterns here are not optional enhancements — they are production requirements for any system handling real traffic.

Circuit breaker

A circuit breaker prevents cascade failures: when error rates exceed a threshold, it stops sending requests to the model and returns a fallback response. This protects both your system and the API from thundering herd conditions when you're at or near your rate limit.

Graceful degradation

Design every Claude-powered feature with a fallback for when the model is unavailable. Cached responses for common queries. Rule-based fallbacks for classification tasks. Clear user-facing error messages that do not expose internal failure details.

Fallback hierarchy

Primary model → cached response → cheaper model → rule-based fallback → informative error. Design the fallback hierarchy before you need it, not during an incident.

12 — Security

Security and IAM

Security in a Claude + GCP stack has two layers: protecting your API credentials from exposure, and correctly scoping the GCP permissions that your AI-powered services use. Both are simpler to get right initially than to fix after an incident.

IAM rules for AI services

Workload Identity Federation — no JSON keys in production

Service account JSON keys are a security antipattern: they are long-lived, can be leaked in source control or logs, and cannot be scoped to a specific workload. Workload Identity Federation (WIF) lets GitHub Actions — or any OIDC provider — authenticate as a GCP service account without a key file. Short-lived tokens, auditable access, zero long-lived credentials.

WIF setup — gcloud CLI
# Create the workload identity pool
gcloud iam workload-identity-pools create "github-pool" \
  --project="${PROJECT_ID}" \
  --location="global"

# Create the OIDC provider
gcloud iam workload-identity-pools providers create-oidc "github-provider" \
  --project="${PROJECT_ID}" \
  --location="global" \
  --workload-identity-pool="github-pool" \
  --attribute-mapping="google.subject=assertion.sub,attribute.repository=assertion.repository" \
  --issuer-uri="https://token.actions.githubusercontent.com"

# In GitHub Actions — no secrets needed
- uses: google-github-actions/auth@v2
  with:
    workload_identity_provider: 'projects/PROJECT_NUM/locations/global/workloadIdentityPools/github-pool/providers/github-provider'
    service_account: 'deploy-bot@PROJECT_ID.iam.gserviceaccount.com'

Add a "uses WIF, no key file" annotation to the relevant service account entry in DASHBOARD.md § Service Accounts. Future maintainers should never need to guess how authentication works.

Anthropic API key management

"The goal is not to be unhackable. The goal is to ensure that any breach is limited in scope, auditable in retrospect, and recoverable without catastrophic data loss."

— Principle of least privilege applied to AI systems
13 — Verification

VERA — Verification and the Sovereign Knowledge Layer

The stack described in this guide is designed to be auditable. Every tool choice, every IAM decision, every architectural tradeoff should have a paper trail in EVOLUTION.md. This is not administrative overhead — it is the mechanism by which AI-generated systems become defensible, transferable, and recoverable. The VERA framework formalizes this approach into a maturity model for organizations building on AI infrastructure.

The verification layer starts with a discipline, not a certification. The discipline is: document every architectural decision at the moment it is made, while the reasoning is fresh and the alternatives are still visible.

Linking every architectural decision to EVOLUTION.md

The VERA framework is not a retrospective exercise. Every architectural decision in this guide — Firebase Auth over custom auth, Cloud Run over Cloud Functions, ST_DWithin over ST_Distance, WIF over JSON keys — should have a dated EVOLUTION.md entry at the moment it is made. Not after the project ships. Not in a documentation sprint. At the moment the decision is made, while the reasoning is fresh and the alternatives are still visible.

EVOLUTION.md — required entry format for every architectural decision
## 02. Architecture Decisions

# Format: [DATE] DECISION | alternatives considered + rejected | cost/risk | revisit trigger

[2026-04-07] Auth: Firebase Auth + Cloud Run verification
Decision: Firebase Auth JWTs verified server-side via google-auth library
Alternatives:
  - Custom JWT (rejected: key rotation complexity, no OAuth provider integration)
  - Supabase Auth (rejected: adds platform dependency outside GCP ecosystem)
  - No auth on public endpoints (rejected: gated content requires identity)
Cost: Zero additional GCP cost — verification uses Google's public keys
IAM impact: Cloud Run services remain --no-allow-unauthenticated for protected routes
Revisit when: Firebase Auth pricing changes or enterprise SSO requirement emerges

[2026-04-07] Spatial: ST_DWithin over ST_Distance for proximity queries
Decision: All proximity queries use ST_DWithin(geom::geography, point, radius)
Alternatives:
  - ST_Distance in WHERE clause (rejected: does not use GIST index, full table scan)
  - Application-side bounding box filter (rejected: imprecise, more code)
Performance: <50ms on 100k rows with GIST index vs >2000ms without
Idempotency: no migration needed, query-level decision
Revisit: Never. This is a PostGIS fundamental.

[2026-04-07] Deployment: Cloud Run over Cloud Functions
Decision: All backend services on Cloud Run (containerized FastAPI)
Alternatives:
  - Cloud Functions (rejected: 9MB unzipped limit breaks most Python deps,
    no persistent connections, cold start per-invocation model)
  - GKE (rejected: operational overhead, cost at this scale)
  - App Engine (rejected: deprecated patterns, less flexible runtime)
Cost: Per-request billing, min-instances=0 for dev, min-instances=1 for prod
Revisit when: request volume exceeds ~10M/month (GKE cost crossover)
VERA certification path

The EVOLUTION.md pattern described here is the foundation of the VERA maturity model. An organization with consistently documented architecture decisions, verified outputs, and sovereign knowledge infrastructure is at VERA maturity level 3. The formal framework, including the Ed25519 session certificate methodology and multi-party attestation tier, is at veraframework.com and danielflugger.com/vera.html.