Invariant Tests: The Missing Layer Between Unit Tests and Production

Mar 28, 2026

Your unit tests pass. Your integration tests pass. CI is green. The PR merges.

A customer gets charged twice.

This is not a testing failure. It is a testing gap. The unit tests verified that each function returns the correct output for its input. They did not verify that a property of the system — "a customer is charged exactly once per order" — holds across all paths, all retries, and all race conditions.

That property is an invariant. And invariant tests are the highest-leverage gate you can add for AI-generated code.

Why Unit Tests Are Not Enough

A unit test checks: does this function return the right value for this input? An invariant test checks: does this property hold for every input, including the ones I did not think of?

The difference matters because of how AI-generated code fails. It does not fail randomly. It fails systematically. SWE-CI (arXiv:2603.03823, March 2026) tested 18 AI models maintaining code across 100 codebases over 233 days. Seventy-five percent of the models broke previously working code during maintenance. Only 2 of 18 exceeded a 0.5 zero-regression rate.

The failure mode is consistent: the AI generates a function that passes every existing test. The function is locally correct. But it violates a global property — an ordering constraint, an idempotency guarantee, a state transition rule — that no individual test was designed to check. The invariant was implicit. The AI did not know it existed. Neither did the test suite.

CodeRabbit's analysis of 470 PRs found AI-generated code produces 10.83 issues per PR versus 6.45 for human-written code. Logic errors were 75% more frequent. These are not surface-level bugs that linters catch. They are semantic violations that only surface when you test the system's properties, not its functions.

The Three Diagnostic Questions

Before you move on from any feature, ask three questions:

1. What must never happen twice?

Charges. Notifications. Database migrations. Order placements. Webhook deliveries. If the answer is "nothing," you are not thinking hard enough. Every system that processes external events has at least one idempotency requirement.

2. What must always be true after this operation completes?

Account balance equals sum of transactions. User count in the index equals user count in the database. The state machine is in a valid terminal state. If you cannot state the postcondition, the feature is not well enough understood to ship safely.

3. What breaks if operations run out of order?

Message processing. State transitions. Multi-step workflows. If your system assumes events arrive in order but the queue does not guarantee ordering, you have an invariant that is not enforced.

If those questions have clear answers, they define your invariant tests. If they do not, the feature needs more specification before it needs more code.

The Idempotent Webhook Receiver: A Concrete Case

Consider a webhook receiver that processes payment confirmations. The payment provider sends a POST when a payment succeeds. Your system records the payment, updates the order status, and sends a confirmation email.

The unit tests verify: given a valid webhook payload, the function records the payment, updates the status, and returns 200.

The unit tests pass. Ship it.

Then the payment provider retries. Their documentation says they retry up to three times if they do not receive a 200 within five seconds. Your server was slow for two seconds. The provider retried. Your system processed the payment twice. The customer was charged once but received two confirmation emails. Or worse: the system recorded two payments, and the accounting reconciliation breaks at end of month.

The invariant: For any given payment ID, the system must process exactly one payment record, regardless of how many times the webhook is delivered.

The invariant test:

` For any valid payment payload P, calling process_webhook(P) N times (where N is 1, 2, 5, 100) produces exactly one payment record in the database and exactly one confirmation email sent. `

This test does not check a single function's return value. It checks a property of the system. It holds for every input, every retry count, every timing scenario. A property-based testing framework like Hypothesis (Python) or fast-check (JavaScript/TypeScript) can generate hundreds of random payment payloads and retry counts and verify the property holds for each one.

An AI model generating the webhook handler will produce a function that works. It will not spontaneously add idempotency protection unless the spec or the prompt explicitly demands it. The invariant test is the gate that catches the omission before production does.

Where Invariants Catch What Unit Tests Cannot

The Amazon incidents from 2026 had this geometry. Code that passed CI. Static analysis clean. Tests covering expected paths. What the pipeline did not catch was the interaction between the change and the live system.

In The Delivery Gap's incident database, Tier 3 (policy gates) was missing in 60% of the 15 public incidents mapped. The Chevrolet chatbot that agreed to sell a car for a dollar — that is an invariant violation. No offer below X% of MSRP without human approval.

These are not exotic edge cases. They are business rules that everyone on the team knows but nobody encoded as a test. The rules lived in people's heads. AI-generated code does not have access to people's heads.

How to Start

You do not need a property-based testing framework on day one. Start with standard integration tests that explicitly assert invariants.

Step 1: Run the three diagnostic questions on your three most critical features. Payment processing, user registration, order fulfillment — whatever generates the most incidents or rework. Write down the answers.

Step 2: For each answer, write one test that asserts the property directly. Not "given this input, expect this output." Instead: "given any valid input processed N times, this property holds." Even without a property-based framework, you can write a loop that calls the function five times with the same input and asserts the database has exactly one record.

Step 3: Add the invariant tests to CI as blockers. Not warnings. Blockers. A failed invariant test means the property is violated. That is not negotiable.

Step 4: Graduate to property-based testing. Hypothesis and fast-check generate random inputs and verify invariants hold across all of them. A property-based test for idempotency generates hundreds of random orders and verifies the property holds for each. This is where the leverage compounds — the framework finds edge cases you would never write by hand.

Why This Is the Highest-Leverage Gate for AI-Generated Code

AI models produce code that is locally correct and globally fragile. Unit tests verify local correctness. Linters verify syntax. Contract gates verify interfaces. But invariant gates verify the properties that define whether the system actually works.

The unit test says: this function processes a payment correctly. The invariant test says: no matter what happens, a customer is charged exactly once.

One of those catches the bug that AI introduced. The other watches it sail past.

If you add one gate to your pipeline this quarter, make it an invariant gate on your highest-risk business rule. The three diagnostic questions take ten minutes. The first invariant test takes an hour. The cost of not having it is the incident you will spend a week cleaning up.

This post draws from Chapters 7 and 9 of [The Delivery Gap](https://leanpub.com/the-delivery-gap), which cover invariant gates as Tier 2 of the quality gate framework and the practical implementation appendix.

Artificial Intelligence, Real Results

Discussion about this post

Ready for more?