The Drift Problem in AI Systems

Most production AI systems don’t fail. They drift.

The difference matters. A failure is visible — an error, a crash, a hallucination flagged by a user. Drift is invisible. The output gets a little worse each week. The relevance drops slightly. The tone shifts. The accuracy degrades at the margins. Nobody notices because nobody is measuring.

I’ve seen this play out three ways:

Data drift. The documents the agent searches over change — new versions replace old ones, pages get updated, permissions shift. The retrieval layer returns different chunks than it did during development. The prompts were tuned for the old chunks. Nobody re-tuned them for the new ones.

Model drift. If you’re using a hosted model API, the provider updates the model. Not major version bumps — those you’d notice. Minor updates. Weights change. Behavior shifts at the edges. Your carefully tested prompts produce subtly different output. You didn’t change anything, but the foundation moved underneath you.

Expectation drift. The stakeholders who approved the pilot had specific expectations. Over time, usage patterns diverge from those expectations. People use the agent for things it wasn’t designed for. It handles them poorly. But nobody goes back to the original scope to check whether the current usage matches what was tested.

The common thread: no monitoring. Most AI deployments have health checks for uptime and latency. Almost none have quality checks for output relevance, accuracy, or tone. The system is “up” but quietly getting worse.

The fix is boring: periodic evaluation. Sample the output. Compare it to a baseline. Check whether the retrieval layer is returning what you expect. Do this monthly, not annually.

Nobody wants to build the monitoring. Everyone wants to build the agent. That’s how you end up with a system that worked great in March and subtly fails by September.