Your AI feature is live. Users are hitting it. And somewhere, quietly, it is getting worse.
This is the part of AI development no one talks about during the build phase. A chatbot that answers questions well in August can start giving subtly wrong answers by November, not because anyone changed the code, but because the world changed around it. Monitoring is the difference between catching that drift in week two and finding out about it from a frustrated customer in month four.
Here is what to track, how to decide when to act, and what it actually costs to keep an AI feature running well.
What should I track once an AI feature is in production?
The honest starting point: you need four numbers, not forty.
Output accuracy is the first number: how often the AI gives a response that is actually correct or useful. You will not be able to check every single response, but you can sample. Pull 50–100 responses per week and have a team member rate them. This sounds manual because it is. There is no perfect automated substitute for a human reading an AI output and saying "yes, that is right" or "no, that missed the point." LangChain published internal benchmarks in 2023 showing that automated evaluation metrics alone missed roughly 30% of meaningful quality regressions that human spot-checks caught.
The second number is user rejection rate. Track every time a user copies the response and rewrites it, hits the thumbs-down button, immediately asks the same question again, or abandons the session after the AI responds. Any of those signals tells you the output was not useful. This is easier to collect than you think because it is just click and session data your analytics tool already tracks.
Third: latency, meaning how long users wait for a response. If your feature took two seconds in testing and now takes six, something has changed, either on the model provider's end or in how you are passing context to the model. Users will not file a support ticket about this. They will just stop using the feature.
Fourth: cost per query. AI inference is not free. OpenAI's GPT-4 pricing in mid-2023 was roughly $0.03 per 1,000 tokens on input and $0.06 per 1,000 tokens on output. If your average query doubles in size over three months because you are accidentally passing more context than you need, your bill doubles too. Catching that drift in cost data is cheaper than an unexpected invoice.
| Metric | What it tells you | How to collect it |
|---|---|---|
| Output accuracy | Whether responses are still useful | Weekly human spot-check of 50–100 samples |
| User rejection rate | Whether users trust the output | Thumbs-down clicks, session abandonment, rewrites |
| Response latency | Whether the feature is still fast | Server-side timer on every API call |
| Cost per query | Whether token usage is creeping up | Model provider dashboard or billing API |
Start with these four. Once you have a stable baseline for each one, you will know when something is wrong. That baseline, established in the first two weeks after launch, is the most important thing you can build.
How does model drift affect AI performance over time?
Model drift is not a bug. No one breaks anything. The model just stops being as good at your specific task, and it happens for three distinct reasons.
Your data changes. The documents, product catalog, customer profiles, or policy text that your AI references were accurate when you launched. Three months later, some of it is outdated. If your AI is summarizing your return policy and the return policy changed, the AI will confidently summarize the old one. This is data drift, and it accounts for the majority of quality regressions in AI features that connect to real-world content.
Your users change. The questions people actually ask in month three are different from the questions they asked in month one. Early users tend to be curious early adopters; later users tend to be more goal-oriented and less forgiving. The feature may have been tuned to handle exploratory questions well but struggles with direct, task-focused requests. This is behavioral drift, and it shows up as an increase in user rejection rate before it shows up in any other metric.
The model itself changes. Model providers update their underlying models and sometimes change behavior without publishing release notes. Google updated its Bard model twice in the second half of 2023 with changes that affected tone and refusal behavior. OpenAI documented similar updates. If you deploy against a specific model version, this risk is lower, but version pinning is not always an option across every provider.
The MIT-IBM Watson AI Lab published research in 2022 showing that real-world AI models lose an average of 11% accuracy within six months of deployment without any active retraining or prompt maintenance. That number accelerates in domains where the underlying data changes frequently, like pricing, inventory, or news.
When should monitoring trigger a rollback or retraining?
Set thresholds before you launch, not after something breaks.
A useful rule for most AI features: if output accuracy drops more than 10 percentage points below your launch baseline, that is a retraining trigger. If user rejection rate doubles from your week-two baseline, that is a rollback trigger because something has degraded badly enough that users notice. If latency doubles, that warrants an immediate investigation into whether the model provider has a problem or whether your prompts have grown out of control.
Rollback and retraining are different responses to different problems. A rollback makes sense when a recent change you made caused the regression. If you updated your prompts last Tuesday and quality dropped by Thursday, roll back the prompts. This takes hours, not weeks. Retraining, or in the case of a hosted model, re-tuning or updating your retrieval data, makes sense when the underlying content or user behavior has drifted. That takes longer: two to four weeks for a proper cycle of data collection, evaluation, and update.
The most expensive mistake founders make here is waiting until users complain before checking the numbers. By the time a user files a support ticket about an AI feature giving wrong answers, the problem has usually been present for weeks. Automated alerts on your four core metrics mean you find out first.
A practical setup: alert when any metric crosses its threshold two days in a row. A single bad day can be noise. Two consecutive days is a trend worth acting on.
Can I reuse existing observability tools for AI monitoring?
Yes, with one important caveat.
If you already use Datadog, New Relic, or Grafana, you can track latency and cost from day one without any new tooling. Those platforms handle time-series metrics well, and response time plus cost per query are just numbers you log on every API call. Connecting your model provider's billing API to your existing dashboard takes a few hours of engineering time.
Where existing tools fall short is on output quality. Datadog cannot tell you whether a response was good. It can only tell you that a response arrived. For the accuracy and user rejection metrics, you need something closer to your application layer.
In 2023, the practical options were lightweight and mostly self-built. The typical setup was: log every AI request and response to a database, build a simple internal tool where a team member could rate samples each week, and track thumbs-down events in your existing analytics platform. Purpose-built AI observability platforms like Arize AI and Weights and Biases were available but added $500–$2,000 per month in cost, which made sense for teams running multiple AI features at scale, not for a founder with one feature in production.
The practical answer for most early-stage products: reuse your existing infrastructure for latency and cost, and build a minimal quality-sampling workflow by hand. Budget two to four hours of engineering time to set it up.
What does a basic AI monitoring setup cost?
A basic setup that covers all four core metrics costs $200–$600 per month in tooling, with four to eight hours of engineering time to configure it.
The cost breaks down like this. If you are using a hosted model through OpenAI or a similar provider, the model cost itself is separate from monitoring. For monitoring infrastructure, a Datadog starter plan runs about $15 per host per month and handles latency and cost tracking well. A simple database for logging AI inputs and outputs costs $50–$150 per month on any major cloud provider at early-stage volume. An analytics tool to capture user rejection events is often already in your stack at no additional cost.
| Component | What it does | Monthly cost |
|---|---|---|
| Observability platform (e.g., Datadog) | Tracks latency and cost per query | $15–$50 |
| Log storage for AI inputs and outputs | Stores samples for quality review | $50–$150 |
| Analytics events (thumbs-down, abandonment) | Captures user rejection signals | Usually $0 (existing tool) |
| Human review time (weekly spot-check) | Rates output quality on 50–100 samples | 1–2 hrs/week of staff time |
A Western agency that sets up AI monitoring as a separate engagement typically charges $3,000–$8,000 for the initial configuration, plus $500–$1,500 per month for ongoing management. At Timespade, monitoring setup is part of the AI feature build, not a separate line item. The alerts, log storage, and quality sampling process are ready on the same day the feature goes live.
The one thing money cannot shortcut is the weekly review habit. Someone on your team needs to look at the numbers. An alert that fires and gets ignored is the same as no alert. Budget one to two hours per week for whoever owns the AI feature to review the metrics and decide whether anything warrants action. That habit, maintained consistently, is worth more than any monitoring platform.
If you are building an AI feature and want monitoring built into the delivery, not bolted on after launch, Book a free discovery call.
