Skip to main content

5th Grade Summary

A fixture is a saved test case.

When Kam learns from a mistake, it should save that lesson as a fixture.

Then every release can check that the mistake does not come back.

That is how labels become protection.

An approved label is not the end of the loop.

It is the beginning of regression protection.

If Kam learns that a team trend answer failed because it skipped the historical denominator, that lesson should not stay in a review queue. It should become a fixture that future releases must pass.

What a fixture contains

A fixture should be concrete enough to replay.

It should include the input, expected route, expected entities, required reads, forbidden behavior, grader set, and answer contract. It should also include artifact pointers, not just prose.

Fixture anatomy

Fixture part
Workload
Example
chat.team_trends.v1
Purpose
Defines product obligation
Fixture part
Prompt
Example
"Which games counted?"
Purpose
Replays user scenario
Fixture part
Context
Example
selected team, sport, trend, saved read
Purpose
Avoids blank-prompt testing
Fixture part
Expected route
Example
team_trends_denominator
Purpose
Checks routing
Fixture part
Required reads
Example
HISTORICAL_DENOMINATOR
Purpose
Checks data contract
Fixture part
Required fields
Example
date, opponent, closing spread, final score
Purpose
Checks auditability
Fixture part
Forbidden behavior
Example
generic trend answer, source mixing
Purpose
Blocks known failures
Fixture part
Graders
Example
route, entity, denominator, freshness, answer shape
Purpose
Turns expectation into checks
Fixture part
Approval
Example
reviewer, timestamp, label id
Purpose
Preserves lineage

Takeaway: A fixture should preserve the product truth that made the label worth approving.

From review to gate

Visual artifact

Fixture promotion flow

The release gate should only enforce fixtures after the evidence is approved and the graders are stable.

  1. 01evidence

    Trace fails

    Production behavior exposes a route, source, freshness, denominator, or usefulness failure.

  2. 02scope

    Label approved

    Human review confirms the expected behavior and failure taxonomy.

  3. 03answer

    Fixture created

    The input, context, expected contract, and graders are packaged for replay.

  4. 04answer

    Gate enforced

    The workload scorecard blocks releases that regress approved fixtures.

Do not gate on unreviewed labels. Do not leave reviewed labels unprotected.

Before and after release quality

Before:

tests pass
build passes
ship
wait for user reports

After:

tests pass
fixtures pass by workload
scorecards stay within threshold
release packet records evidence
ship
monitor trace drift

The second path is more work, but it reduces repeat failures.

Release gate priorities

Deterministic fixtures

Gate first

Trace replay

High value

LLM judge samples

Selective

Manual spot checks

Still needed

Takeaway: The strongest gate is a reviewed fixture with deterministic checks and known production lineage.

What gates should measure

A release gate should be specific.

Global pass rate can hide damage. If the overall suite is healthy but chat.market_shape.v1 regresses, the release still creates user-facing risk.

Better gates include:

  • per-workload fixture pass rate
  • severe label regression count
  • source separation failures
  • stale hot-read confidence failures
  • missing denominator failures
  • answer path fallback rate
  • judge disagreement rate
  • human-review hold count

Gate levels

Blocker

Wrong route, source mixing, missing denominator, unsafe confidence, or fixture failure in a critical workload.

Warning

Judge score dips, longer latency, or drift in a lower-risk answer family.

Observe

New workload has low sample size but no confirmed severe failures yet.

Takeaway: A gate should tell the team whether to block, warn, or monitor.

Why this is better than generic evals

Generic evals can say whether an answer seems good.

Kam fixtures preserve the exact product failure the team already saw. They know the sport, route, team, market, source family, and missing contract. That makes them harder to replace with off-the-shelf tests.

Open-source tools can help run or organize evals, but the fixture content is Kam's asset.

The lesson

The better Kam framework turns lessons into gates.

A trace without a label is a clue. A label without a fixture is memory. A fixture without a release gate is optional. The full loop creates protection.

The next action is to make fixture promotion a first-class KamOps workflow with release-gate status visible on every approved label.

Read next

Related field notes

View all posts