dprovenance.dev / guides / regression-testing-for-ai-agents

Regression testing for AI agents: catch reasoning drift before it ships

Q: How is this different from evals?

Evals score the quality of outputs against a rubric or dataset; regression testing detects structural drift between two runs of the same agent. Evals tell you whether answers are good; regression tests tell you whether this build changed how the agent works. They are complementary — run both.

Q: Does nondeterminism make fingerprints flaky?

No. The run fingerprint hashes the execution structure — the typed steps and engines in sequence order — not token output, so two runs that take the same path fingerprint identically even when the wording differs. When the path itself legitimately varies, tune strictness with max_regression_level, allow_divergent_steps, or a custom equivalence evaluator.

You shipped a prompt tweak. The agent still returns an answer — it just quietly stopped verifying claims on the way there. This guide is a practical walkthrough of AI agent regression testing: record a golden run, diff a candidate's execution structure against it, and fail the build when the reasoning drifts. The examples use DProvenanceKit — open source, Apache-2.0, zero third-party dependencies in the core.

# 01 · the failure mode

Output tests pass. The reasoning regressed.

The nastiest agent bugs don't throw. You reword a system prompt, bump a model version, or reorder a tool list, and the agent keeps producing plausible answers. Nothing crashes. No assertion fires. But somewhere inside the run it stopped calling its verification step, or started looping its search tool three times where one call used to do.

Traditional testing can't see this, because asserting on final output means matching a string or schema an LLM produced — flaky by construction, and blind to how the answer was assembled. An answer produced without the verify step can look identical to one that was checked. The regression isn't in the output; it's in the execution path. That is the gap LLM regression testing exists to close: treat the reasoning itself — the sequence of steps, tools, and engines — as the thing under test.

# 02 · the definition

What is prompt regression testing?

Prompt regression testing — also called LLM regression testing — is the practice of comparing a candidate run of an LLM application against a known-good golden run and failing the build when the execution structure changes: a dropped tool call, a looping step, a changed path. It gates on how the agent worked, not on string-matching what it said, so the verdict stays deterministic even when outputs vary token to token.

The framing works because structure is far more stable than text. Two healthy runs of the same agent rarely produce identical tokens, but they take a recognizable path: plan, search, rank, verify, decide. When that path changes without you intending it to, something in your prompt, model, or orchestration changed the agent's behavior — and that is exactly what a regression test should flag. It is not a replacement for evals (more on that in the FAQ); it's the structural complement that can run deterministically on every pull request.

# 03 · golden runs

The golden-run workflow

The mechanics mirror snapshot testing, applied to reasoning instead of rendered output:

Record a run you trust as a trace — every step, in order, with its span structure.
Save that trace as the golden baseline (a plain SQLite file you can commit).
Re-run the agent after every prompt edit, model bump, or refactor.
Diff the candidate against the golden and fail when the structure drifts.

Here is the whole loop with a single import — this is the README's quick start, unedited:

five_minute_wow.pypython

from dprovenancekit import trace

# 1. Record an execution
with trace("Agent Workflow"):
    with trace("Retrieve Documents"):
        # your retrieval code here
        pass
    with trace("Verify Claims"):
        # your verification code here
        pass

# 2. Save the trace
trace.save("golden_run.sqlite")

# 3. Print a structural explanation
trace.explain()
# --- Execution Trace (b4f8d2…) ---
# ▶ Started Agent Workflow
#   ▶ Started Retrieve Documents
#   ✔ Finished Retrieve Documents
#   ▶ Started Verify Claims
#   ✔ Finished Verify Claims
# ✔ Finished Agent Workflow

# 4. Catch regressions when the logic changes: rerun the (now buggy) workflow,
#    then diff the current run against the saved golden baseline
trace.diff("golden_run.sqlite")
# --- Trace Diff (Golden vs Current) ---
# ❌ Missing step: Verify Claims

The same record → explain → diff loop, with more context, is on the main page. Under the hood it powers anomaly detection, CI gating, and visual trace analysis.

# 04 · fingerprints

Run fingerprints: the cheap signal

Before you reach for a full step-by-step diff, there's a one-string-compare shortcut: the run fingerprint, a structural identity of the execution path. DProvenanceKit computes it as a SHA-1 over each event's type-and-engine signature in sequence order — so equal fingerprints mean the same typed steps ran through the same engines in the same order.

Reordered tools, a skipped retrieval step, a dropped verify step: all of them change the fingerprint. That makes it a cheap regression signal you can check constantly — equal fingerprints and you're done; different fingerprints and you run the full alignment to see exactly which steps were added, removed, or reordered, and how severe that is. Because the hash covers structure rather than token output, two runs that phrase their answers differently but take the same path fingerprint identically.

# 05 · a real catch

Worked example: the research agent that stopped verifying

A research agent, recorded twice into one local SQLite file. The golden run takes the path plan → search.web → rank → verify.claims → decide. The candidate — after a PR that "just" reworded the planner prompt — loops search.web three times, then skips verify.claims entirely. The server-less gate CLI selects the newest run per context id and diffs them; no backend, no account:

research-agent · golden vs candidate · dprovenancekit 0.4.0 real output

$ dprovenancekit gate --db traces.sqlite \
    --golden-context golden --candidate-context candidate

Regression gate: FAIL
  severity: high (strength 0.95); max allowed: none
  fingerprint: differs (5e616565a219… vs b6e1023b183b…)
  per-step changes:
    added: search.web, search.web
    removed: verify.claims
  engine: Critical reasoning steps removed: verify.claims

$ echo $?
1

Verbatim real output — captured from dprovenancekit gate (v0.4.0) run against two actually-recorded traces; the fingerprints are the real hashes.

The non-zero exit code is the point: this command drops into any pipeline as a hard gate — the CI gate guide covers GitHub Actions and GitLab wiring end to end. And for invariants that should hold on every run, not just against a baseline, two anomaly rules ship out of the box — Tool Drop (a required step never ran) and Looping (a step repeated past a threshold) — configured through a JSON rule registry:

rules.jsonjson

{
  "rules": [
    { "type": "tool_drop", "required_step": "safety_check" },
    { "type": "looping", "step": "web_search", "max_repeats": 5 }
  ]
}

$ dprovenancekit anomalies --db traces.sqlite --rules rules.json — runs the rules locally or on every PR, no golden required.

# 06 · in the test suite

LLM regression testing in pytest

The same gate is one assertion inside any test. Give it a golden run and a candidate run; it aligns them and fails with a readable diagnostic when the candidate diverged:

test_gate.pypython

from dprovenancekit.testing import RegressionGate, assert_no_regression

# Strict by default: any removed, added, or changed (ambiguous) step fails,
# and a removed CRITICAL step is a HIGH-severity regression.
assert_no_regression(golden=golden_run, candidate=candidate_run)

# Or build a reusable policy and inspect the verdict without raising:
report = RegressionGate(allow_divergent_steps=True).check(golden_run, candidate_run)
assert report.passed, report.reasoning

Regressions carry a severity — none / low / medium / high — and a removed or reordered CRITICAL step is a HIGH-severity regression (reordering a critical step can invert a dependency). Loosen the gate with max_regression_level (gate on severity only) or allow_divergent_steps (tolerate benign per-step changes), or pass a custom evaluator to define what "equivalent" means — for example, ignore volatile payload fields like token counts. One subtlety worth knowing: detecting reordered steps requires a span-aware profile such as AlignmentProfile.developer_debug_v1; the default linear profile treats a pure reorder as still-matching, so pass the span-aware profile when order is part of what you're guarding.

The bundled pytest plugin removes the remaining plumbing. One fixture records the block as a candidate run and pins the baseline for you:

test_research_agent.pypython

def test_research_agent(golden_trace):
    with golden_trace("research-agent"):
        run_my_agent()   # anything using @traced / record_event / an adapter

shellsh

$ pytest --dprov-update-golden   # record (or intentionally update) the baseline, commit it
$ pytest                         # every later run gates against tests/goldens/research-agent.sqlite

Baselines are SQLite files committed next to your tests (directory configurable via the dprov_golden_dir ini option). Gate options can be set per test — e.g. golden_trace("name", max_regression_level="high").

# 07 · your stack

Works with your stack — or with no framework at all

Everything above operates on traces, so the only integration question is how runs get recorded. For a hand-written agent loop, the core instrumentation ships with zero dependencies:

instrumented_agent.pypython

from dprovenancekit import InMemoryTraceStore, traced, traced_run, record_event

@traced
def search(query): ...

@traced
def answer(question, sources): ...

store = InMemoryTraceStore()
with traced_run(store, context_id="ticket-42"):
    sources = search(question)
    record_event("plan.chosen", {"strategy": "rag"})
    reply = answer(question, sources)

@traced records a <name>.start / .end / .error event pair per call in its own span, and works on plain functions, async def, generators, and async generators. Instrumentation never changes behavior — exceptions pass through unchanged, and outside a traced_run the decorator is transparent. Already on a framework? Adapters record the same trace shape:

LangChain / LangGraph — pip install "dprovenancekit[langchain]", then attach DProvenanceTracer / DProvenanceCallbackHandler.
OpenAI Agents SDK — pip install "dprovenancekit[openai-agents]"; officially listed in the OpenAI Agents SDK docs. See the integration page.
LlamaIndex — pip install "dprovenancekit[llama-index]".
CrewAI — pip install "dprovenancekit[crewai]".
Plus integration modules for Google GenAI, FastAPI, Jupyter, and MCP.

# 08 · faq

FAQ

What is prompt regression testing?

Prompt regression testing (also called LLM regression testing) compares a candidate run of an LLM application against a known-good golden run and fails the build when the execution structure changes — a dropped tool call, a looping step, a changed path — instead of string-matching final outputs.

How do I test AI agents before production?

Record a golden trace of a known-good run and commit it as a baseline. On every prompt, model, or code change, re-run the agent and diff the candidate's execution structure against the golden — a run fingerprint for the cheap check, step-level alignment for the details — and block the release when critical steps are dropped, reordered, or looping. In practice that's one golden_trace pytest fixture during development plus a gate in CI that exits non-zero on drift.

How is this different from evals?

Evals score the quality of outputs against a rubric or dataset; regression testing detects structural drift between runs. Evals tell you whether answers are good; regression tests tell you whether this build changed how the agent works. They're complementary — run both. Comparing tools in this space? See DProvenanceKit vs LangSmith.

Structural diffing vs LLM-as-a-judge in CI?

Structural diffing is deterministic, fast, and free: the same two traces always produce the same verdict, so it can gate a pull request with zero judge flakiness and zero per-run API cost. LLM-as-a-judge complements it for semantic output quality, but a probabilistic judge makes a poor sole gate — a flaky verdict on a merge-blocking check erodes trust fast.

Does nondeterminism make fingerprints flaky?

No — the fingerprint hashes the execution structure (typed steps and engines in sequence order), not token output, so two runs that take the same path fingerprint identically even when the wording differs. When the path itself legitimately varies, tune strictness instead of abandoning the gate: max_regression_level gates on severity only, allow_divergent_steps tolerates benign per-step changes, and a custom evaluator defines what counts as equivalent.

# 09 · get started

Get started

shell

$ pip install dprovenancekit

v0.4.0 on PyPI · Python 3.9+ · zero third-party dependencies in the core · Apache-2.0.
Validated: 284 tests; precision / recall / F1 of 1.000 across 8 standard + 5 adversarial regression-detection scenarios (dprovenancekit evaluate), matching the original Swift implementation case-for-case.
CI-ready: a GitHub Action on the Marketplace and a GitLab CI template in the repo.

Next: wire it into CI → View on GitHub

The SDK, CLI, and Action are free and open source. A hosted trace visualizer (span tree, payload inspector, side-by-side semantic diff) is in early access at app.dprovenance.dev.