Published May 12th, 2026
A few years back, I was running a time-series pipeline that scored incoming product reviews on a 1-10 scale. The scorer was a Large Language Model (LLM) with reviews rolled in continuously and ratings flowed into a dashboard which the product team checked every Monday morning. Everything ran clean for months. Then one Monday the chart had a step in it.
Reviews from the prior week averaged 6.4. The current week averaged 7.6, for the same product and the same customers. When I went back to read them, the reviews looked indistinguishable from what we had been getting all year.
The model had changed. The provider had pushed a quiet update to the weights, and the LLM that gave us 6.4-equivalent scores last week was now giving 7.6-equivalent scores for the same content. Every historical comparison in that dashboard was silently invalid. The cleanup took a week. The harder conversation was about how much of our reporting had been real in the first place.
That kind of failure is the default behavior of LLMs in production, and trying to engineer it away with tighter parameters or pinned versions is a losing fight. The job is to design with those failures in mind. I learned this lesson twice: from the reviews pipeline, and from raising two kids.
If you have lived through the toddler years, you have run this experiment a few hundred times without calling it that. The lunch you packed all last week, the one that came home empty every day, suddenly gets pushed off the table on Tuesday with full commitment. The bedtime story that worked for six straight nights stops working on the seventh. The nap routine the babysitter swore was solid breaks the moment you start calling it a “rule.”
Experienced parents eventually stop trying to force determinism on the kid. Patterns and trends still matter, but you stop expecting any individual input to produce any individual output. Instead, you build a system that absorbs the variance and doesn’t fight it. This is the same shift AI engineers make in production, usually after their first calibration regression.
In that reviews pipeline, the LLM wasn’t generating content; it was grading it. Each incoming review went to the model with a rubric, and the model returned a 1-10 score that rolled up into the weekly dashboard. That makes the LLM a judge: a model whose only job is to evaluate other inputs against a standard. And the pipeline taught me the judge can be the most fragile thing in the system. The model being evaluated can drift. The model doing the evaluating can drift too. And unless you have something stable to anchor against, you can’t tell which model is causing the change.
The pattern that works is a small held-out set of inputs with known, human-validated scores, and the habit of re-running it on a regular cadence. You can call this set of inputs the “calibration set,” and it’s likely you already have enough data to establish your own. You only need 20 to 50 examples to test.
When a model changes, re-score the calibration set. If the average jumps despite no other changes, like ours did from 6.4 to 7.6, you know the judge moved, not the data. Without a calibration set to validate against, the same diagnosis could have taken weeks of reading individual reviews and arguing about what changed.
This is where AgentControl’s offline evaluations are most useful. You upload your calibration set as a dataset, point a judge at it, and re-run on a cadence or before any variation change. The discipline I had to implement manually (keep the judge anchored, keep its inputs comparable, watch the distribution rather than any single response) becomes a property of the configuration, instead of a script someone has to remember to run.
The parenting version of this example is the pencil marks on a doorframe. The doorframe does not move. Every few months you put the kid against it, shoes off and back to the wall. If the line jumps three inches and you realize the kid is wearing sneakers, you take the shoes off and measure again before you believe the kid has grown three inches. The doorframe is your calibration set. The shoes-off rule is the discipline that keeps re-runs comparable.
Some models let you set a temperature to try to force a deterministic output by selecting the highest-probability token at each step. It minimizes creativity, randomness, and hallucination. But 100% determinism is not guaranteed. My reviews pipeline had been running at temperature zero, but was still impacted by the provider model’s change. Temperature zero makes the output deterministic for a given model, but it can’t protect you when the model itself is swapped out. Once the weights changed, greedy sampling was faithfully selecting the highest-probability tokens from a different model, so the scores shifted no matter how the temperature was set.
Temperature zero compresses the variance you can see during testing, which makes you feel safer, but does nothing about the variance that actually breaks production. Design as if the model can produce a different valid output every time, because eventually it will.
Before you ship the model that does a new thing well, ship the path that runs when the model does the new thing badly, slowly, or not at all. It’s easy to skip this step, but that’s part of what makes it so important.
For an LLM endpoint, preparing a fallback path can include:
You must assume the model will misbehave at some point and define useful response behaviors for when it does.
You can implement these responses in an adaptive system that watches the production signal itself and mitigates bad performance without a human in the loop. This is what configuration-driven LLM tooling is built for. With LaunchDarkly AgentControl, different models live as configuration rather than code, traffic can shift between them without a deploy, and a guarded release can use an online evaluation score, or any other AgentControl metric, to determine whether a variation advances, pauses, or reverts. When the judge sees scores regress past a threshold, the rollout reverses itself. The fallback stops being a piece of code that needs human implementation and becomes the architecture, watching itself.
Parents already operate this way. The grocery store meltdown will happen. The school will call at 11am about a “low-grade fever.” You have a snack pre-stuffed in your bag and a backup babysitter on text. The fallback is the architecture. The happy path is the bonus.
After the reviews pipeline incident, my work changed. I started logging the model version returned in every API response. I built a calibration set. I stopped trusting any single eval run as a verdict. The boring fallback path now ships before the impressive demo path.
Embrace this method of thinking and working to create robust, failure-tolerant systems and create meaningful results. You just have to accept what every parent of a small kid already knows: the interesting measurement is the distribution, not a single sample.