Kam Evals

Deterministic Graders Before LLM Judges

May 24, 20268 min read

Kam AI

Product and research

Deterministic Graders Before LLM Judges hero image

5th Grade Summary

An LLM judge can say an answer sounds good.

Kam also needs to know if the answer used the right route, team, sport, source, data freshness, and denominator.

Those checks should be deterministic.

Kam asks the judge only after the facts are checked.

Raw LLM judging is not enough for Kam.

That does not mean LLM judges are useless. It means they should sit in the right place in the ladder. They are helpful for usefulness, clarity, and nuanced answer quality. They are not the best first tool for checking whether the answer used the right team, mixed source families, skipped the required hot read, or failed to expose the historical denominator.

What deterministic means

A deterministic grader checks facts that should not depend on taste.

Examples:

Did the route match the workload?
Did the answer resolve the correct sport?
Did it use the right team or player?
Did it separate sportsbook odds from prediction-market context?
Did it load the required hot read?
Did it show as_of freshness?
Did it include the games behind an aggregate trend?
Did it avoid agentic workflow when normal chat should answer quickly?

These are not vibes.

They are product contracts.

What should be deterministic

Check: Route
Deterministic rule: Expected workload equals actual workload
Why an LLM judge is too late: A fluent wrong route can still sound helpful

Check: Entity
Deterministic rule: Team, player, game, or market matches expected entity
Why an LLM judge is too late: The judge may miss sports-specific identity drift

Check: Source
Deterministic rule: Sportsbook and prediction-market evidence are separated
Why an LLM judge is too late: Mixed sources can create false confidence

Check: Freshness
Deterministic rule: Required timestamp or stale warning is present
Why an LLM judge is too late: Staleness is a contract, not a writing preference

Check: Denominator
Deterministic rule: Aggregate trend names sample, games, and dates
Why an LLM judge is too late: Trend claims need auditability

Check: Tool policy
Deterministic rule: Disallowed tools or workflows were not used
Why an LLM judge is too late: Automation should follow bounded rules

Check: Answer shape
Deterministic rule: Required caveat and next check are present
Why an LLM judge is too late: Users need visible uncertainty before action

Check	Deterministic rule	Why an LLM judge is too late
Route	Expected workload equals actual workload	A fluent wrong route can still sound helpful
Entity	Team, player, game, or market matches expected entity	The judge may miss sports-specific identity drift
Source	Sportsbook and prediction-market evidence are separated	Mixed sources can create false confidence
Freshness	Required timestamp or stale warning is present	Staleness is a contract, not a writing preference
Denominator	Aggregate trend names sample, games, and dates	Trend claims need auditability
Tool policy	Disallowed tools or workflows were not used	Automation should follow bounded rules
Answer shape	Required caveat and next check are present	Users need visible uncertainty before action

Takeaway: The judge should evaluate the answer after Kam proves the answer was eligible to be trusted.

The ladder

Visual artifact

Kam judge ladder

A good eval path fails as close to the product contract as possible.

01scope
Contract grader
Checks route, entity, workload, required fields, and forbidden paths.
02evidence
Source grader
Checks sportsbook, prediction-market, hot-read, freshness, and denominator evidence.
03answer
LLM judge
Checks usefulness, clarity, calibration, and whether the answer would help a human decide next steps.
04answer
Human approval
Approves the label, fixture, rubric, or release decision when judgment matters.

Do not spend judge tokens on answers that already failed deterministic rules.

Why this matters in sports

Sports questions are compact.

The user may ask:

Why did this move?

That question is only answerable if the system knows what "this" means. It needs the selected event, market, book, current line, prior line, timestamp, saved read, and source family. If the assistant guesses, the answer can become persuasive fiction.

The deterministic grader should fail the answer before style review.

Before and after

Before:

LLM answer -> LLM judge -> score

After:

trace -> route grader -> source grader -> denominator grader -> judge -> human approval

The after version is slower to design but faster to operate. It gives engineers a precise failure code instead of a vague low score.

Where each method is strongest

Route/entity checks

Deterministic

Freshness/source checks

Deterministic

Usefulness review

Judge helps

Final promotion

Human

Takeaway: The right mix is not anti-judge. It is judge-after-contract.

What the LLM judge should do

The judge still matters.

It should answer questions like:

Did the answer explain uncertainty in a useful way?
Did it help the user decide whether to wait, pass, save, or investigate?
Did it avoid overconfident betting language?
Did it connect the source evidence to the user question?
Was the answer concise enough for the product surface?

Best use of KamJudge

Usefulness

Would this answer help a human make a better next check?

Calibration

Does confidence match source quality, freshness, and missing data?

Communication

Is the answer clear without hiding caveats or overloading the user?

Takeaway: KamJudge is most valuable when deterministic evidence has already narrowed the question.

The lesson

The better Kam framework makes graders boring on purpose.

Every deterministic contract that can be checked without a judge should be checked without a judge. That lowers cost, improves diagnosis, and keeps LLM review focused on judgment rather than bookkeeping.

The next action is to expand grader coverage around the highest-risk labels: wrong route, unresolved sport, mixed sources, stale hot reads, missing denominator, and unsafe workflow escalation.

Related field notes

View all posts

fixturesrelease-gates

Fixtures Turn AI Lessons Into Release Gates

How Kam promotes approved labels into fixtures and uses workload gates to prevent the same answer failures from shipping again.

8 min read

evalstrace-replay

How Kam Checks an Answer Before You Trust It

A plain-English look at how Kam checks freshness, missing data, scope, and answer quality before a card or explanation reaches the user.

20 min read

trustsource-context

How Kam AI Is Built

A plain-English tour of how Kam watches your spots, checks sources, explains moves, saves reads, and supports review.

10 min read