Workflow · June 30, 2026
Turn a Bug Report + Fix Diff into a Regression Test Suite
The task
You just merged a fix for a nasty bug. Now you need regression tests so it never comes back — covering not just the exact reproduction, but the neighboring cases that share the same root cause. This is the kind of work that gets skipped at 5pm on Friday, which is exactly when the bug shows up again six months later.
Before AI
The honest baseline: most developers write one test that mirrors the bug report, label it test_regression_issue_1234, and move on. Thorough engineers spend 30-60 minutes reasoning about adjacent inputs, boundary conditions, and what else the diff touched. Either way it's friction, and the thorough version usually loses to the next ticket in the queue.
The workflow
The flow: feed a model the bug report plus the fix diff, get a ranked list of test scenarios, then convert the top ones into runnable test code. You review and commit. The model is doing the boring expansion work — you're still the one deciding what ships.
Step 1 — Extract the failure mode from the bug + diff
The first prompt's job is to force the model to articulate, in its own words, what was actually broken and what the fix actually changed. This matters because a bug report describes symptoms and a diff describes mechanics — the failure mode lives in between, and skipping this step gives you tests that are either too literal (just reproduces the ticket) or too vague (tests the whole function).
You are a senior engineer doing root-cause analysis on a bug that was just fixed. Below is the bug report followed by the unified diff of the fix. Produce a structured analysis with these sections: 1. **Observed failure** — one sentence on what the user saw. 2. **Root cause** — what in the old code produced that behavior. Be specific about the line(s) or condition(s). 3. **Fix mechanism** — what the diff changes about the control flow, data, or invariants. 4. **Class of bug** — name the general category (off-by-one, null handling, race condition, type coercion, unhandled state, etc.) in 2-4 words. 5. **Blast radius** — other inputs or code paths that share the same root cause and could fail similarly. List 3-6 of them as bullet points. Be concrete (specific input shapes, not "edge cases"). Do not write any test code yet. Keep the whole analysis under 400 words. Here is the input:
BUG REPORT #4471
Title: Retry budget exhausted on first 429 when stream=True
Reporter: Priya M.
Severity: High
Description:
When calling client.chat.completions.create(..., stream=True) and the
upstream returns HTTP 429 on the very first chunk, the SDK reports
"retry budget exhausted" immediately instead of retrying. Non-streaming
calls retry correctly (we see 3 attempts in logs). Repro: hit a rate-
limited key with stream=True, observe a single attempt in network trace.
Expected: same retry behavior as non-streaming (3 attempts, exp backoff).
Actual: 1 attempt, then RetryBudgetExceeded raised to caller.
---
FIX DIFF (sdk/_transport/stream.py)
@@ -82,7 +82,7 @@ class StreamingTransport:
def _open(self, request: Request) -> Response:
attempt = 0
- budget = self._retry_budget
+ budget = self._retry_budget.clone()
while True:
resp = self._http.send(request)
if resp.status_code < 400:
return resp
- if not budget.consume(cost=resp.status_code):
+ if not budget.consume(cost=1):
raise RetryBudgetExceeded(last=resp)
attempt += 1
self._sleep(self._backoff(attempt))
NOTES FROM AUTHOR (Sam K.):
Two issues fused into one ticket. (1) _retry_budget was a shared
instance across streaming and non-streaming paths, so prior non-stream
calls were draining the budget seen by stream calls. .clone() gives
each call its own copy. (2) cost=resp.status_code meant a 429 cost
429 budget units against a default budget of 3, so it tripped on the
first failure. cost=1 matches the non-streaming path.Step 2 — Generate ranked test scenarios
Now take the analysis and turn it into a prioritized scenario list. Ranked, because you almost certainly won't implement all of them and you want the high-leverage ones first.
Using the root-cause analysis above, produce a ranked list of regression test scenarios. For each scenario include: - **Name** — snake_case, descriptive (e.g. `test_stream_retries_on_first_429`). - **What it verifies** — one sentence. - **Inputs / setup** — concrete: HTTP status codes, call args, prior state. - **Assertion** — what must be true for the test to pass. - **Priority** — P0 (must have, directly guards the fix), P1 (guards the bug class), P2 (related invariants worth pinning). - **Risk if skipped** — one phrase on what regression slips through. Rules: - Include at least one test that would have failed against the OLD code and passes against the NEW code. Mark it explicitly. - Include at least one test for state isolation between calls (the shared-instance issue). - Include at least one boundary test (budget=0, budget=1, exactly-at-budget). - Do NOT include tests for unrelated functionality. Stay inside the blast radius. - Aim for 6-9 scenarios total. Quality over quantity. Output as a numbered markdown list.
Step 3 — Convert the top scenarios into runnable test code
Last step: turn the P0 and P1 scenarios into actual code. We constrain the framework and mocking style explicitly — otherwise the model picks whatever is most common on the internet, which may not match your repo. If your stack differs, swap the framework/mocking lines below before running.
Write Python test code for every P0 and P1 scenario from the list above. Constraints: - Framework: pytest. - HTTP mocking: respx (or describe equivalent if respx doesn't fit a given case). - Each test is a standalone function. No shared fixtures unless two or more tests need the exact same setup, in which case define a fixture once. - Use realistic but synthetic data. No real API keys, hostnames, or user data. - Include a brief docstring on each test stating which scenario from the list it implements and (if applicable) noting "would fail against pre-fix code". - At the top of the file, add a one-paragraph module docstring linking these tests to the original bug ID (#4471) and the fix commit (placeholder: FIXME-commit-sha). - Do not invent SDK methods or attributes that weren't referenced in the bug report or diff. If a test needs something not shown, add a `# TODO: confirm` comment instead of guessing. Output a single Python file, ready to drop into `tests/regression/`.
You can see the broader pattern this fits into — using diffs as structured prompts — discussed in GitHub's own writeup on generating tests from pull request context. The workflow above is the manual, model-agnostic version of the same idea.
Gotchas
- The model will invent SDK surface area. Especially attribute names on internal classes. The
# TODO: confirminstruction in step 3 helps, but read every test before committing — ifbudget.remainingdoesn't exist, the test won't run. - Mock-shape drift. If your codebase uses a different HTTP mocking library (pytest-httpx, responses, vcrpy), the generated tests need a translation pass. Tell the model your library explicitly in step 3.
- P0 doesn't always mean correct. The model's priority ranking is a starting point. The "would have failed against old code" test is the one you must keep — verify by mentally running the old diff against it.
- Concurrency bugs are out of scope. If the root cause is a race, generated tests will look plausible but rarely reproduce the race deterministically. Use this workflow for logic bugs; reach for stress tests and deterministic schedulers for races.
- Two-bugs-in-one-ticket cases. Like the sample — the model sometimes collapses them into one root cause. If step 1's "root cause" section reads as a single sentence for a multi-issue fix, push back and rerun.
Time saved
Roughly 30-45 minutes per bug fix, assuming the thorough baseline (sit down, enumerate adjacent inputs, write 5-8 tests). For developers who would otherwise write a single literal-reproduction test, the savings are smaller in wall-clock time but the coverage uplift is the real win — you end up with the test suite you should have written, in the time you actually had. Estimate, not a measurement; varies heavily with how clean the diff is.
What we got when we ran it
We executed every prompt above, in order, against the sample input (model: claude-sonnet-4-6). Outputs below are real and unedited apart from length truncation.
Step 1 output
Root-Cause Analysis — Bug #4471 ### 1. Observed Failure A streaming call receiving HTTP 429 on its first attempt immediately raised `RetryBudgetExceeded` instead of retrying up to the configured limit. ### 2. Root Cause Two distinct defects in `_open()` (lines ~84–90): - **Shared budget instance**: `budget = self._retry_budget` assigned the *same* object shared across all calls. Every non-streaming call that retried drained units from this shared budget, leaving it partially or fully depleted before a streaming call ever began. - **Wrong cost value**: `budget.consume(cost=resp.status_code)` passed the HTTP status code (e.g., `429`) as the cost. With a default budget of `3`, a single 429 response consumed 429 units, instantly exhausting the budget regardless of remaining capacity. Either defect alone could trigger the symptom; together they guaranteed failure on the first 429 in any streaming call that followed prior retried requests. ### 3. Fix Mechanism - `self._retry_budget.clone()` creates a fresh, full-capacity budget per call, isolating each invocation's retry accounting from all others. - `cost=1` makes each failed attempt consume exactly one unit, matching the non-streaming path's semantics and the intended budget size of 3 attempts. ### 4. Class of Bug **Shared mutable state** + **semantic misuse of constant** ### 5. Blast Radius - **Sequential non-streaming calls on the same client instance**: later calls begin with a partially drained budget if earlier calls consumed retries, silently reducing their retry allowance. - **Concurrent requests sharing one client**: a race between two threads where one triggers retries can deplete the shared budget seen by
…output truncated at 250 words.
Step 2 output
Regression Test Scenarios — Bug #4471 --- 1. **`test_stream_retries_on_first_429`** *(⚠️ Would have FAILED against old code, passes against new code)* - **What it verifies:** A streaming call that receives HTTP 429 on the first attempt retries up to the configured budget rather than raising immediately. - **Inputs / setup:** Fresh client, `budget=3`, `stream=True`, mock HTTP sequence: `[429, 429, 200]`. - **Assertion:** No exception raised; exactly 3 HTTP sends recorded; final response has status 200. - **Priority:** P0 - **Risk if skipped:** The primary regression — single-attempt failure on 429 — ships silently. --- 2. **`test_stream_budget_exhausted_after_correct_attempt_count`** - **What it verifies:** A streaming call raises `RetryBudgetExceeded` only after consuming exactly `budget` attempts, not before. - **Inputs / setup:** Fresh client, `budget=3`, `stream=True`, mock HTTP sequence: `[429, 429, 429, 429]` (more failures than budget). - **Assertion:** `RetryBudgetExceeded` is raised; exactly 3 sends recorded (budget=3 means 3 total attempts, not 3 retries — pin whichever the spec says); `last` response on the exception has status 429. - **Priority:** P0 - **Risk if skipped:** Off-by-one in attempt counting goes undetected. --- 3. **`test_stream_budget_isolated_from_non_stream_retries`** *(State isolation test)* - **What it verifies:** Prior non-streaming retries do not drain the budget available to a subsequent streaming call. - **Inputs / setup:** Single shared client, `budget=3`. First call: `stream=False`, mock sequence `[429, 429, 200]` (exhausts 2 units on non-stream path). Second call: `stream=True`, mock sequence `[429, 200]`. - **Assertion:** Second (streaming) call succeeds without raising; its send count is exactly 2, independent of the first call's retries. -
…output truncated at 250 words.
Step 3 output
``python """ Regression tests for Bug #4471 — "Retry budget exhausted on first 429 when stream=True". These tests guard the two defects identified in the root-cause analysis: 1. _retry_budget was a shared mutable instance across all calls; prior non-streaming retries drained the budget seen by later streaming calls. 2. budget.consume(cost=resp.status_code) used the HTTP status code as the cost, so a single 429 consumed 429 units against a default budget of 3. Fix commit: FIXME-commit-sha Source: sdk/_transport/stream.py, lines ~84-90 Bug ID: #4471 Reporter: Priya M. """ from __future__ import annotations import threading from typing import Generator from unittest.mock import MagicMock, call, patch import pytest # --------------------------------------------------------------------------- # Imports from the SDK under test. # Adjust the import paths to match the actual package layout. # --------------------------------------------------------------------------- from sdk._transport.stream import StreamingTransport # TODO: confirm path from sdk.exceptions import RetryBudgetExceeded # TODO: confirm path from sdk.retry import RetryBudget # TODO: confirm path # --------------------------------------------------------------------------- # Helpers # --------------------------------------------------------------------------- def _make_response(status_code: int) -> MagicMock: """Return a minimal mock HTTP response with the given status code.""" resp = MagicMock() resp.status_code = status_code return resp def _make_transport(budget: int) -> StreamingTransport: """ Construct a StreamingTransport with a real RetryBudget of the given size and a mock HTTP sender. The caller is responsible for configuring transport._http.send.side_effect`. """ retry_budget = RetryBudget(budget) # TODO: confirm constructor signature http = MagicMock() transport = StreamingTransport( # TODO: confirm constructor signature http=http, retry_budget=retry_budget, ) return transport def _make_request() -> MagicMock: """Return a minimal mock Request object.""" req = MagicMock() req.method = "POST" req.url
…output truncated at 250 words.
Source: Agentic Daily