← All ADRs

ADR-006 — Build the agent layer with Anthropic Claude (via MCP) over Bedrock / OpenAI / local

Status: Accepted (POC). The MCP abstraction means the model-host choice is reversible. Date: 2026-05-28.

Context

The agent-first thesis requires a model + protocol choice. Four meaningful options:

  1. Anthropic Claude (direct SDK + MCP) — Claude Opus/Sonnet via the Anthropic SDK, with MCP as the tool surface protocol.
  2. OpenAI (GPT-5/4o via Responses API) — comparable model quality, comparable tool-use; MCP support newer.
  3. AWS Bedrock (multi-model) — Claude + Llama + Mistral + Titan behind one IAM surface. AWS-native.
  4. Local models (Llama 3.3 70B, Mistral Large 2, etc.) — runs on our infra, no per-token cost.

The decision is anchored on two questions: which model produces correct structured output and grounded reasoning on financial questions? and what's the switching cost if we change our mind?

Decision

Build with Anthropic Claude via the Anthropic SDK, and MCP as the tool protocol (see srdb_mcp/). Claude calls MCP tools to query srDB; the substrate never sees prompts directly.

Switching to a different model host requires changing the agent runtime, not the substrate. The MCP server is the abstraction; any agent host that speaks MCP is compatible.

Consequences

Positive:

  • Strong current quality on structured output reliability in financial domain in our hands-on use. Claude Opus + Sonnet 4.x have been consistent on typed-tool responses in our development; OpenAI Responses + GPT-5 are also strong and have closed the gap. This is an opinion based on the author's traces; a real production decision should re-benchmark on representative tasks.
  • MCP originated with Anthropic and has the longest-running first-party support. OpenAI shipped MCP support in the Responses API in 2025; it works in production. The differences today are configuration ergonomics, not protocol gaps. Either provider is a defensible choice; we picked Anthropic for the POC because the author has more hands-on familiarity with their tool-use surface and MCP integration. A production decision should weigh team familiarity, procurement, and benchmarks on representative tasks.
  • Tool-use semantics are first-class in the Anthropic SDK. The tools parameter on messages.create matches MCP's tool surface 1:1.
  • The protocol is the moat, not the model. If Anthropic disappears tomorrow, the substrate doesn't care. The MCP server is unchanged. We swap the agent host.

Negative:

  • Pricing + rate-limit dependence on Anthropic. Multi-host fallback (e.g. Bedrock-hosted Claude + direct-API Claude) is on the Phase 2 list.
  • AWS-shop teams may prefer Bedrock for IAM + procurement reasons. The Anthropic direct SDK is a separate contract.
  • We're not testing the substrate's agent surface with GPT or Gemini regularly. The MCP abstraction guarantees it should work, but "guarantees" without test coverage are aspirational. Periodic cross-model agent testing is on the Phase 2 list.

Alternatives Considered

  • OpenAI direct (GPT-5). Comparable quality on the substrate's question class; Responses API has shipped MCP support. Reasonable alternative — the choice between Anthropic and OpenAI is primarily about team familiarity and benchmarks on representative tasks, not protocol availability. Pure model + ergonomics call, no architectural lock either way.
  • Bedrock (Claude-on-Bedrock or multi-model). AWS-native, single IAM, regional residency, audit story bundled with AWS. Adds proxy hops vs direct Anthropic API (the practical latency cost varies by region and workload and is worth measuring on your own traffic before deciding). Right if SRHG standardizes on AWS for procurement and the IAM/audit story matters more than direct-API ergonomics. The realistic shape is direct Anthropic API as default, Bedrock as alternate for IAM-bound workloads; the MCP abstraction makes that fan-out cheap.
  • Gemini. Tool-use is mature but MCP support is younger and Gemini's reasoning on long structured contexts (the kind of tool-response stream the substrate produces) is less consistent than Claude's. We'd consider it for cost-sensitive workloads where the question class is narrower.
  • Local models (Llama / Mistral). Per-token cost is zero; latency is single-digit-ms on a GPU we already own. But structured-output reliability and tool-use composability are not at the bar the substrate needs. Worth re-evaluating annually. The substrate's MCP surface is model-agnostic so the switching cost is hours, not weeks.