ADR-004 — Build on DuckDB (+ Iceberg/S3 at scale) over Snowflake / Databricks

Status: Accepted (POC). Phase 2 review trigger documented below. Date: 2026-05-28.

Context

srDB needs a queryable, durable, cost-efficient substrate that lets analysts, ops, and (per the agent-first thesis) AI agents run SQL or tool-mediated queries against the canonical position layer. Five realistic vendor + build options:

Snowflake — the modal alts-manager "data platform" choice. Mature, hosted, separates storage from compute.
Databricks SQL / Photon on S3 — comparable to Snowflake; stronger on ML/Spark; weaker on classical analytical SQL idioms.
BigQuery — strong for ad-hoc analytics; weak for the transactional + bitemporal patterns this substrate needs.
Custom Postgres / Aurora — flexible, ACID, but requires manual columnar/partitioning work for the analytical workload.
DuckDB (POC) → DuckDB + Iceberg-on-S3 (production) — embedded engine, file-format-portable, schema-as-code, agent-embeddable.

The substrate-first thesis (ADR-001/002/003) puts real load on the engine: bitemporal predicates on every read, polymorphic position queries, recursive lookthrough traversal, columnar scans on series_obs. The substrate also has to be cheap enough to run on developer laptops and in CI without standing up paid infrastructure.

Decision

Build the substrate on DuckDB for the POC, and Iceberg-on-S3 (read via DuckDB or Trino) for production scale. Schema is SQL files in schema/*.sql, versioned in git, applied via the same idempotent load_schema.py loader at every scale.

Phase 2 trigger to flip to Snowflake (or any managed warehouse): when concurrent analyst+agent queries > ~50 sustained, or when column-store storage cost on Iceberg/S3 exceeds the Snowflake bill at our query mix, or when regulatory residency requires a vendor with the audit certifications we don't want to build in-house.

Consequences

Positive:

Schema-as-code in git. Branches, code review, PR-bound migration discipline. Snowflake schemas are account-managed; "what was the schema yesterday?" is a vendor support ticket.
$0 to run. Developer laptops, CI, demo machines all run the full substrate locally. Snowflake POC is a credit-card decision.
DuckDB embeds in the MCP server (srdb_mcp/tools/_db.py). The agent's runtime is a Python process holding a read-only DB handle. Snowflake's connection-pool dance complicates the agent-first claim.
Iceberg-on-S3 at production scale gives us multi-engine reads (DuckDB, Trino, Spark, Snowflake itself if we change our mind) with the canonical store living in our S3 bucket. No vendor lock on the data.

Negative:

DuckDB is a single-node engine. Cross-node concurrency at production scale needs careful query routing (Iceberg + Trino or a DuckDB cluster orchestrator like MotherDuck).
No managed query optimizer for billion-row joins. We have to write the joins.
Ops surface area: backups, S3 lifecycle, Iceberg compaction, audit logging — all on the team rather than a vendor.

Alternatives Considered

Snowflake. Right for an asset manager whose books are 90% liquid securities and whose query mix is well-known dashboards. Wrong for srDB's profile: heterogeneous private + crypto, schema iteration velocity high during the first year, agent embedding cheaper with DuckDB, cost matters at POC scale, and "schema-as-code in git" is the team-onboarding artifact we're not willing to give up. We'd consider flipping once the substrate stabilizes and the team grows past ~5 engineers querying concurrently.
Databricks. Better than Snowflake for the ML/Spark workloads we don't have. Worse than Snowflake for the analytical SQL idiom-set we use heavily. Same vendor-managed-schema concern. Skip.
BigQuery. Strong ad-hoc analytics. Weak on the bitemporal correlated-subquery patterns the substrate needs. Strong on cost-per-query for large scans; the substrate's query mix is small scans on indexed columns. Wrong tool.
Custom Postgres / Aurora. Considered. The bitemporal + polymorphic + recursive-lookthrough workload would force us to build column-store and indexing patterns that DuckDB gives us free. ACID is a nice-to-have we don't strictly need (the substrate is built from idempotent ingestion pipelines). Reasonable fallback if Iceberg becomes a constraint.