Kam Evals
Deterministic Graders Before LLM Judges


Kam AI
Product and research

Kam Evals


Kam AI
Product and research

An LLM judge can say an answer sounds good.
Kam also needs to know if the answer used the right route, team, sport, source, data freshness, and denominator.
Those checks should be deterministic.
Kam asks the judge only after the facts are checked.
Raw LLM judging is not enough for Kam.
That does not mean LLM judges are useless. It means they should sit in the right place in the ladder. They are helpful for usefulness, clarity, and nuanced answer quality. They are not the best first tool for checking whether the answer used the right team, mixed source families, skipped the required hot read, or failed to expose the historical denominator.
A deterministic grader checks facts that should not depend on taste.
Examples:
as_of freshness?These are not vibes.
They are product contracts.
What should be deterministic
Takeaway: The judge should evaluate the answer after Kam proves the answer was eligible to be trusted.
Visual artifact
A good eval path fails as close to the product contract as possible.
Checks route, entity, workload, required fields, and forbidden paths.
Checks sportsbook, prediction-market, hot-read, freshness, and denominator evidence.
Checks usefulness, clarity, calibration, and whether the answer would help a human decide next steps.
Approves the label, fixture, rubric, or release decision when judgment matters.
Sports questions are compact.
The user may ask:
Why did this move?
That question is only answerable if the system knows what "this" means. It needs the selected event, market, book, current line, prior line, timestamp, saved read, and source family. If the assistant guesses, the answer can become persuasive fiction.
The deterministic grader should fail the answer before style review.
Before:
LLM answer -> LLM judge -> score
After:
trace -> route grader -> source grader -> denominator grader -> judge -> human approval
The after version is slower to design but faster to operate. It gives engineers a precise failure code instead of a vague low score.
Where each method is strongest
Route/entity checks
Deterministic
Freshness/source checks
Deterministic
Usefulness review
Judge helps
Final promotion
Human
Takeaway: The right mix is not anti-judge. It is judge-after-contract.
The judge still matters.
It should answer questions like:
Best use of KamJudge
Would this answer help a human make a better next check?
Does confidence match source quality, freshness, and missing data?
Is the answer clear without hiding caveats or overloading the user?
Takeaway: KamJudge is most valuable when deterministic evidence has already narrowed the question.
The better Kam framework makes graders boring on purpose.
Every deterministic contract that can be checked without a judge should be checked without a judge. That lowers cost, improves diagnosis, and keeps LLM review focused on judgment rather than bookkeeping.
The next action is to expand grader coverage around the highest-risk labels: wrong route, unresolved sport, mixed sources, stale hot reads, missing denominator, and unsafe workflow escalation.
Read next
How Kam promotes approved labels into fixtures and uses workload gates to prevent the same answer failures from shipping again.
8 min read
A plain-English look at how Kam checks freshness, missing data, scope, and answer quality before a card or explanation reaches the user.
20 min read
A plain-English tour of how Kam watches your spots, checks sources, explains moves, saves reads, and supports review.
10 min read