Release Quality

Fixtures Turn AI Lessons Into Release Gates

May 24, 20268 min read

Kam AI

Product and research

Fixtures Turn AI Lessons Into Release Gates hero image

5th Grade Summary

A fixture is a saved test case.

When Kam learns from a mistake, it should save that lesson as a fixture.

Then every release can check that the mistake does not come back.

That is how labels become protection.

An approved label is not the end of the loop.

It is the beginning of regression protection.

If Kam learns that a team trend answer failed because it skipped the historical denominator, that lesson should not stay in a review queue. It should become a fixture that future releases must pass.

What a fixture contains

A fixture should be concrete enough to replay.

It should include the input, expected route, expected entities, required reads, forbidden behavior, grader set, and answer contract. It should also include artifact pointers, not just prose.

Fixture anatomy

Fixture part: Workload
Example: chat.team_trends.v1
Purpose: Defines product obligation

Fixture part: Prompt
Example: "Which games counted?"
Purpose: Replays user scenario

Fixture part: Context
Example: selected team, sport, trend, saved read
Purpose: Avoids blank-prompt testing

Fixture part: Expected route
Example: team_trends_denominator
Purpose: Checks routing

Fixture part: Required reads
Example: HISTORICAL_DENOMINATOR
Purpose: Checks data contract

Fixture part: Required fields
Example: date, opponent, closing spread, final score
Purpose: Checks auditability

Fixture part: Forbidden behavior
Example: generic trend answer, source mixing
Purpose: Blocks known failures

Fixture part: Graders
Example: route, entity, denominator, freshness, answer shape
Purpose: Turns expectation into checks

Fixture part: Approval
Example: reviewer, timestamp, label id
Purpose: Preserves lineage

Fixture part	Example	Purpose
Workload	chat.team_trends.v1	Defines product obligation
Prompt	"Which games counted?"	Replays user scenario
Context	selected team, sport, trend, saved read	Avoids blank-prompt testing
Expected route	team_trends_denominator	Checks routing
Required reads	HISTORICAL_DENOMINATOR	Checks data contract
Required fields	date, opponent, closing spread, final score	Checks auditability
Forbidden behavior	generic trend answer, source mixing	Blocks known failures
Graders	route, entity, denominator, freshness, answer shape	Turns expectation into checks
Approval	reviewer, timestamp, label id	Preserves lineage

Takeaway: A fixture should preserve the product truth that made the label worth approving.

From review to gate

Visual artifact

Fixture promotion flow

The release gate should only enforce fixtures after the evidence is approved and the graders are stable.

01evidence
Trace fails
Production behavior exposes a route, source, freshness, denominator, or usefulness failure.
02scope
Label approved
Human review confirms the expected behavior and failure taxonomy.
03answer
Fixture created
The input, context, expected contract, and graders are packaged for replay.
04answer
Gate enforced
The workload scorecard blocks releases that regress approved fixtures.

Do not gate on unreviewed labels. Do not leave reviewed labels unprotected.

Before and after release quality

Before:

tests pass
build passes
ship
wait for user reports

After:

tests pass
fixtures pass by workload
scorecards stay within threshold
release packet records evidence
ship
monitor trace drift

The second path is more work, but it reduces repeat failures.

Release gate priorities

Deterministic fixtures

Gate first

Trace replay

High value

LLM judge samples

Selective

Manual spot checks

Still needed

Takeaway: The strongest gate is a reviewed fixture with deterministic checks and known production lineage.

What gates should measure

A release gate should be specific.

Global pass rate can hide damage. If the overall suite is healthy but chat.market_shape.v1 regresses, the release still creates user-facing risk.

Better gates include:

per-workload fixture pass rate
severe label regression count
source separation failures
stale hot-read confidence failures
missing denominator failures
answer path fallback rate
judge disagreement rate
human-review hold count

Gate levels

Blocker

Wrong route, source mixing, missing denominator, unsafe confidence, or fixture failure in a critical workload.

Warning

Judge score dips, longer latency, or drift in a lower-risk answer family.

Observe

New workload has low sample size but no confirmed severe failures yet.

Takeaway: A gate should tell the team whether to block, warn, or monitor.

Why this is better than generic evals

Generic evals can say whether an answer seems good.

Kam fixtures preserve the exact product failure the team already saw. They know the sport, route, team, market, source family, and missing contract. That makes them harder to replace with off-the-shelf tests.

Open-source tools can help run or organize evals, but the fixture content is Kam's asset.

The lesson

The better Kam framework turns lessons into gates.

A trace without a label is a clue. A label without a fixture is memory. A fixture without a release gate is optional. The full loop creates protection.

The next action is to make fixture promotion a first-class KamOps workflow with release-gate status visible on every approved label.

Related field notes

View all posts

deterministic-gradersllm-judge

Deterministic Graders Before LLM Judges

Why Kam checks route, entity, source, freshness, denominator, and contract facts before asking an LLM judge whether an answer was useful.

8 min read