Shipping AI Products That Don't Break in Prod

The notebook lies

Every model looks brilliant in a notebook. The cold truth shows up the moment a real user types something you didn't anticipate, and the eval set you spent two weeks curating turns out to cover 30% of reality.

If it isn't observable, it isn't shipped.

— an SRE, probably

350ms

p99 latency budget

~40%

Eval coverage gap

$0.18

Cost / 1k req

Boring infra that saved us

→Per-request token + cost logging
→Shadow eval on a sample of live traffic
→Hard timeouts, soft fallbacks
→Prompt versioning in git, not Notion

// always wrap the model call
const out = await withTimeout(
  llm.invoke(prompt),
  { ms: 8000, fallback: cachedAnswer },
);

🧭

Build the dashboard before the demo. You will need it on day three, not day thirty.