Measuring What Matters: A/B Testing a Model Upgrade in Production

February 25, 2026

ai gemini model-evaluation pipelines testing

When Google shipped Gemini 3.1 Pro with "2x reasoning improvement," the obvious question wasn't should we upgrade? — it was how would we know if it actually helped?

Most AI applications upgrade models blindly. Swap the model ID, run a vibe check, ship it. This works when your output is creative text or chat responses where "better" is subjective. It doesn't work when your application makes forensic claims about human behaviour, assigns severity ratings, and produces findings that someone might take to a lawyer.

We run a multi-stage psychological analysis pipeline — 25 LLM calls across 7 stages — that analyses email threads for gaslighting and manipulation patterns. The pipeline looks like this:

Stage	What it does	LLM calls	Model
0. Thread Parsing	Structure raw email into indexed messages	1	Flash
1A. Narrative Summary	Extract tactics, contradictions, ghost authorities	1	Flash
1B. DARVO Analysis	Detect Deny/Attack/Reverse-Victim-Offender patterns	3	Flash
1C. Deep Analysis	Psychological reasoning — emotional dynamics, power structures, hidden narratives	1	3.1 Pro (high thinking)
2. Semantic Practices	Obligation asymmetries, dead-end patterns, circular logic, semantic transformations	5	Flash
3. Evidence Anchoring	Chronological narrative, disputed fact identification	1	Flash
4. Claim Verification	Verify contested facts against the thread corpus	1	3.1 Pro (low thinking)
5. Synthesis Report	Final report assembly with governance findings	1	Flash

Every finding is anchored to specific email references. Every claim has a verification status. This is structured reasoning, not vibes.

So we built the infrastructure to measure the upgrade properly. Here's what we found.

The Upgrade Path

The model upgrade touched three layers:

Framework: @ax-llm/ax from v16 to v18. Two major version bumps, both clean drop-ins — zero source changes required. The breaking changes (AxJSInterpreter renamed, AxAgent RLM redesign) didn't affect our codebase because we use the generator and signature APIs, not the agent layer. Ax made this A/B test practical: its StageModelConfig pattern lets you reroute individual pipeline stages to different models without touching the prompt logic, so we could swap models surgically and compare outputs with identical prompts.

Model routing: Not every stage benefits from a reasoning upgrade. We analysed all 25 LLM calls and made surgical changes:

Deep analysis (Stage 1C): Gemini3Pro with high thinking → Gemini31Pro with high thinking. This is the psychological reasoning stage — the strongest candidate for improved reasoning.
Claim verification (Stage 4): Gemini3Flash with no thinking → Gemini31Pro with low thinking. This was the critical fix. Verifying contested facts is reasoning, not extraction. Flash was the wrong model for this task entirely.
Everything else: Left on Flash. Narrative extraction, evidence anchoring, semantic practices, report assembly — these are structured extraction tasks where Flash's speed matters more than deep reasoning.

A note on "thinking": Gemini 3.1 Pro supports an internal chain-of-thought mode where the model allocates reasoning tokens before producing its visible output. You configure this as a budget level — low, medium, or high — rather than a specific token count. High thinking gives the model more room to reason through complex psychological patterns; low thinking provides a lighter reasoning pass for tasks like claim verification where the logic is simpler but still benefits from deliberation. When thinking is enabled, you must not set maxTokens — the model manages its own output budget.

The bug that lived in the shadows: The model swap revealed that claimVerification.ts was calling createGeminiLLM instead of createThinkingGeminiLLM. The thinking configuration was being silently ignored — the model was running without its reasoning capability even when we thought we'd enabled it. The upgrade forced the audit that caught the bug.

The Comparison

We ran the same email thread through the pipeline before and after the upgrade, then built internal Convex queries to pull both reports for structured comparison.

What improved

More findings, better calibrated. The post-upgrade report identified 5 detailed findings versus 4, adding "Character Attack" as a distinct isolation tactic. More importantly, it upgraded "Reality Distortion" from moderate to severe — a calibration change that matters when your output informs real decisions.

Richer tactic detection. The narrative analysis found 10 named tactics with multiple instances each, versus 4 tactics pre-upgrade. The model isn't hallucinating new patterns — it's disaggregating compound behaviours that the previous model collapsed into single categories.

More precise verification statuses. This is where the 3.1 Pro thinking mode showed the clearest improvement. Pre-upgrade, the claim about a "governance lawyer" was tagged unsubstantiated_leverage. Post-upgrade, it's requires_external_evidence. These are meaningfully different epistemic judgments. The first says "this is a manipulation tactic." The second says "this claim could be true but can't be verified from this corpus." The post-upgrade assessment is more forensically honest.

Similarly, a claim about stolen email addresses moved from unsubstantiated_leverage to unverifiable_within_thread. Again, more precise — the model is distinguishing between "this is being used as leverage" and "we genuinely cannot determine this from the available evidence."

Deeper psychological framing. The post-upgrade deep analysis identified "Severe DARVO coupled with Coercive Control" as the primary pattern, versus just "DARVO" pre-upgrade. Coercive control is a recognised compound pattern in psychological literature — the model is reaching for more clinically precise categorisation.

Double the obligation asymmetries. The semantic practices stage found 6 obligation asymmetries versus 3, detecting subtler power imbalances in how demands flow between parties.

What stayed the same

The top-level verdict was identical: severe gaslighting, high confidence, DARVO as primary pattern. Both reports found the same contradictions (3), the same ghost authorities (12), the same escalation paths (6). The chronological narrative covered the same ground with comparable accuracy. Email reference precision was consistent across both.

The ghost authorities result is worth examining. Twelve is a high count, and both models found exactly twelve. This suggests the extraction was already saturated — the previous model had identified every unverified authority claim present in the corpus. A better reasoning model can't extract entities that aren't there. When your extraction pipeline is already finding everything, the upgrade's value shows up not in what it finds but in how it reasons about what it found. The ghost authorities count didn't change; what changed was how those authorities were woven into the governance findings narrative and linked to circular logic patterns.

This is the right outcome. A model upgrade should sharpen the edges, not change the shape of the picture.

The Uncomfortable Lesson

The most valuable discovery wasn't about Gemini 3.1 Pro. It was that our claim verification stage had been running on the wrong model and the wrong factory function for its entire lifetime. The thinking configuration we'd specified in the stage config was being silently dropped because the code called a factory function that doesn't support thinking mode.

This is a class of bug that no amount of testing catches. The output was plausible — Flash is competent at claim verification, just not optimal. The verification statuses were valid enum values. The reasoning text was coherent. Nothing failed. Nothing threw an error. The system just produced subtly worse judgments than it should have, and we had no baseline to notice.

The upgrade forced us to audit every LLM call's wiring against the stage configuration, which is what caught the bug. The lesson: model upgrades are also code audits. Treat them as an opportunity to verify that your actual runtime behaviour matches your intended configuration.

How to Measure Model Upgrades

If your AI application produces structured output with measurable properties, here's the approach:

Build comparison infrastructure first. We added internal Convex queries (no auth gate) that return full analysis data via CLI. This lets us pull any two reports and diff them structurally — not just eyeball the UI.

Run the same input through both models. Not different inputs, not synthetic benchmarks. The same real data your users actually submit.

Compare at every granularity level. Top-level verdict (same? good), finding count and severity (more precise? good), verification statuses (more epistemically honest? good), tactic detection depth (more disaggregated? good). A single "better/worse" judgment misses the nuance.

Check for reasoning precision, not just reasoning volume. The post-upgrade report isn't longer. It's more precise. requires_external_evidence is a better judgment than unsubstantiated_leverage for a claim that references a governance lawyer who may or may not exist. More words isn't the signal. Better distinctions are.

Use the upgrade to audit your wiring. Every model swap is a chance to verify that your code actually does what your configuration says it should.

The Bigger Picture

Google's "2x reasoning improvement" benchmark number (77.1% on ARC-AGI-2) is a useful signal that something changed. But benchmarks measure benchmark performance. What matters is whether your specific application, with your specific prompts, producing your specific output structures, gets meaningfully better.

For forensic analysis with structured findings — yes, measurably. For extraction tasks where the output is already well-bounded — no observable difference, and Flash's speed advantage remains.

The right model upgrade strategy isn't "put the best model everywhere." It's "put the reasoning model where you need reasoning, keep the fast model where you need speed, and verify your wiring actually connects the two."

Key Takeaways

Model upgrades are code audits. The most valuable discovery was a silent wiring bug that had been degrading output quality for months. The upgrade forced the audit that caught it.
Better reasoning means better distinctions, not more words. The post-upgrade report isn't longer. It distinguishes requires_external_evidence from unsubstantiated_leverage — a more epistemically honest judgment.
Route models by task type. Reasoning models for reasoning stages, fast models for extraction stages. Not every node in your pipeline needs a brain upgrade.
Extraction saturates; reasoning doesn't. When both models find identical entity counts, the upgrade's value is in how it reasons about those entities, not in finding more of them.
Build comparison infrastructure before you upgrade. Internal queries, CLI access, structured diffs. You can't measure what you can't pull apart.

This essay documents the upgrade of a psychological analysis pipeline from Gemini 3 Pro/Flash to Gemini 3.1 Pro, running on Convex with the Ax LLM framework. The A/B comparison was conducted on the same email corpus across 25 LLM calls spanning 7 analysis stages.