How Kam Checks an Answer Before You Trust It

5th Grade Summary

Kam checks answers before users trust them.

First, it checks if the question went to the right kind of answer.

Then it checks the data, tools, freshness, saved read, and answer shape.

If an answer is wrong, Kam labels the mistake and turns it into a future test.

Most AI products test whether the app responds.

Kam has to test whether the response deserves trust.

That is a different job.

Customers should not have to know the word "eval."

They should feel something simpler:

Kam checked the move before it asked me to care.

Sports-market research is full of questions that sound simple but hide risk:

Why did this line move?
Is my bet covering?
Any gap left?
What moved since I last checked?
Which games counted?

Those questions are short because the user assumes the product already knows the board, the selected event, the market, the book, the ticket, the prior thesis, and the freshness state. If the system guesses at any of that context, the answer can sound confident while being wrong.

That is why Kam's eval system is not one giant "did the model sound good?" test.

It is a ladder.

The ladder starts with deterministic checks that fail close to the source of the problem. It ends with live answer review and production readiness. The goal is not to make evals impressive. The goal is to make bad answers expensive to ship.

Why normal evals are not enough

Generic AI evals often start at the final answer.

Was the answer helpful?

Was the answer accurate?

Was the answer concise?

Those questions matter, but they are too late. By the time the final answer exists, the system already made several product decisions:

it interpreted the user's question
it selected a route
it loaded or skipped memory
it chose tools
it decided whether data was fresh enough
it constructed a prompt
it accepted or rejected missing context
it shaped the answer

If those steps are wrong, the final answer review becomes a crime-scene investigation.

Kam needs the eval to fail earlier.

If a user asks, "Which games counted?" after an aggregate ATS trend, the eval should not wait for a human to notice the answer is vague. It should already know that the route must drill down into the historical games behind the aggregate. It should know the forbidden route. It should know the required fields: date, opponent, closing spread, final score, ATS result, cover margin, sample size, and as_of.

That is the difference between testing a chatbot and testing a research system.

The ladder

Kam's eval ladder moves from deterministic structure to live judgment.

user prompt
  -> resolver reachability
  -> route
  -> read-model plan
  -> screen-chat parity
  -> tool plan
  -> prompt contract
  -> AgentTask lifecycle
  -> skill trial
  -> fixture E2E
  -> trace replay
  -> answer-path family
  -> live E2E
  -> human or judged review

The order matters.

Do not ask a model judge whether an answer was useful if the route was wrong.

Do not ask a human whether the writing was smooth if the answer used stale data too confidently.

Do not celebrate a passing live run if the same prompt fails on the second try.

The ladder is designed to separate "the stack functioned" from "the product helped a user make a safer decision."

Visual artifact

A simple eval ladder

A useful eval suite should fail as close to the real problem as possible. Do not wait for a final answer judge when route, data, or freshness already failed.

01scope
Route matched the question
The prompt reached the right skill instead of a generic answer path.
02evidence
Required data was present
The answer had the right event, market, line, score, source, or saved read.
03evidence
Freshness was acceptable
Stale, delayed, missing, and unsafe-to-rank states were visible before confidence.
04answer
Answer had a safe next step
The final response explained uncertainty and gave a practical next check.

Evals are not a vanity score. They are the trust system that stops bad answers from shipping.

What each eval layer protects

Layer: Resolver
What it catches: Skill exists but cannot be reached from user language
Why it matters: Prevents dark skills

Layer: Route
What it catches: User text maps to the wrong skill
Why it matters: Stops bad answers before tools run

Layer: Tool plan
What it catches: Controller chooses the wrong path
Why it matters: Prevents expensive or irrelevant calls

Layer: Read model plan
What it catches: Product truth object is missing, stale, or bypassed
Why it matters: Keeps chat and screens aligned

Layer: Prompt contract
What it catches: Required rules are absent from the compiled prompt
Why it matters: Makes answer behavior auditable

Layer: AgentTask lifecycle
What it catches: Work pauses, resumes, retries, or finishes in the wrong state
Why it matters: Prevents broken workflows

Layer: Skill trial
What it catches: One skill is flaky across repeated deterministic cases
Why it matters: Stops lucky one-off passes

Layer: Trace replay
What it catches: A saved turn no longer satisfies current contracts
Why it matters: Turns production behavior into regression coverage

Layer: Answer path
What it catches: Follow-up context is preserved or reset incorrectly
Why it matters: Tests real multi-turn journeys

Layer: Human review
What it catches: Product judgment, usefulness, and writing quality
Why it matters: Converts taste into durable rules

Layer	What it catches	Why it matters
Resolver	Skill exists but cannot be reached from user language	Prevents dark skills
Route	User text maps to the wrong skill	Stops bad answers before tools run
Tool plan	Controller chooses the wrong path	Prevents expensive or irrelevant calls
Read model plan	Product truth object is missing, stale, or bypassed	Keeps chat and screens aligned
Prompt contract	Required rules are absent from the compiled prompt	Makes answer behavior auditable
AgentTask lifecycle	Work pauses, resumes, retries, or finishes in the wrong state	Prevents broken workflows
Skill trial	One skill is flaky across repeated deterministic cases	Stops lucky one-off passes
Trace replay	A saved turn no longer satisfies current contracts	Turns production behavior into regression coverage
Answer path	Follow-up context is preserved or reset incorrectly	Tests real multi-turn journeys
Human review	Product judgment, usefulness, and writing quality	Converts taste into durable rules

Takeaway: Kam evals start with deterministic product facts, then move toward live answer quality.

The vocabulary matters

Kam uses a simple eval vocabulary internally:

A case is one user scenario, skill scenario, fixture, or replayed trace.
An experiment is a named suite that bundles cases.
An evaluator is the check applied to a case.
A task adapter is the harness that turns the case into output and answer-path data.
An answer path is the route, tools, trace events, task state, and readable turn story.
A report is the artifact used to make a release decision.

This vocabulary prevents vague debates.

Instead of saying "the eval failed," the team can say:

Case: aggregate ATS follow-up
Experiment: answer-path family
Evaluator: route + required fields
Failure: wrong_skill
Next action: add route expectation and forbidden route

That is more useful than a score with no diagnosis.

Start before the model call

The most important Kam evals do not require a model call.

That is intentional.

If a route is broken, the model cannot fix it. If a skill capsule is unreachable, the answer will never use it. If the tool policy is incomplete, the system may wander into the wrong data source. If the product read model is stale, the model may write beautifully about the wrong truth.

So Kam starts with the boring checks.

Can the resolver reach the skill?

Does the priority table know when to use it?

Does the route have fallback behavior?

Does the skill declare its required object families?

Does it know its freshness SLA?

Does it know when to stop and ask a user for scope?

Those checks are not glamorous, but they are high leverage.

The practical split

Before model

Resolver reachability, route expectations, tool policy, read-model contracts, prompt-contract assembly, and task lifecycle.

During model

Live answer generation, provider behavior, tool-call adherence, writing contract, and response shape.

After model

Human review, failure labels, trace promotion, readiness scoring, postdeploy canaries, and recurring regression packs.

Takeaway: The earlier a regression is caught, the cheaper it is to fix and the less likely it is to become prompt sprawl.

A real failure pattern

Consider this user path:

Kam: The Lakers and Thunder are 11-9 ATS in the sample.
User: Which games counted?

A weak system may route the follow-up back to today's ATS board because it sees "games" and "ATS."

That is wrong.

The user is asking for the historical drilldown behind a previous aggregate answer.

The eval should specify:

expected route: get_betting_trend_game_details
forbidden route: get_ats_board
forbidden phrase: No graded ATS results are available today
required fields: date, opponent, closing spread, final score, ATS result, cover margin, sample size, as_of

This is not an abstract language problem.

It is a product continuity problem.

The answer is only safe if Kam carries the aggregate scope into the follow-up, resolves the drilldown, and names the sample that created the first claim.

Product-value buckets

Kam evals should prove value in the questions users actually ask.

The core buckets are:

Board
My bets
Movement
Trends

Each bucket has a different failure mode.

Board answers can over-rank stale or incomplete boards.

My-bet answers can grade the wrong line.

Movement answers can describe line movement from the book's perspective instead of the bettor's perspective.

Trend answers can use percentages without a loaded denominator.

The product-value eval buckets

Bucket: Board
User asks: Are favorites or dogs covering today?
Kam must prove: Grounded board category summary with denominators only when loaded

Bucket: My bets
User asks: I took Warriors +5.5. Am I covering?
Kam must prove: Exact ticket state, entry line, score, cover margin, and one next action

Bucket: Movement
User asks: Did the market move in my favor?
Kam must prove: Bettor-side open-to-close value, not generic line-moved prose

Bucket: Trends
User asks: Is LeBron covering spreads lately?
Kam must prove: Grounded trend when supported, or a clean missing-data stop

Bucket	User asks	Kam must prove
Board	Are favorites or dogs covering today?	Grounded board category summary with denominators only when loaded
My bets	I took Warriors +5.5. Am I covering?	Exact ticket state, entry line, score, cover margin, and one next action
Movement	Did the market move in my favor?	Bettor-side open-to-close value, not generic line-moved prose
Trends	Is LeBron covering spreads lately?	Grounded trend when supported, or a clean missing-data stop

Takeaway: The eval question should look like a bettor's question, not a developer label.

This is why Kam treats unsupported but valuable questions as eval material.

If a user asks a valuable question and the data is not ready, the correct behavior is not to delete the scenario. The correct behavior is to route to a missing-data guardrail.

A missing-data stop can be a good answer.

A fake edge is not.

Hard betting rules

Some rules should never be left to vibes.

For example:

moneyline grades as outright win or loss
spread grades from the user's entry line
spread margin is team_score + entry_spread - opponent_score
positive margin covered, zero pushed, negative failed to cover
do not grade an opening ticket with the closing line unless the user says they bet the close
do not use live odds to decide whether a pregame ticket got value, covered, or lost
use percent framing only when numerator and denominator are loaded

These rules make the evals product-specific.

They also make the product honest.

If a user says, "I took Warriors +5.5. Am I covering?" Kam should not answer with generic market commentary. It should lead with the current cover state, show the margin, and give one next action.

If the score or entry line is missing, Kam should ask for the exact missing field. It should not invent a result.

The answer review loop

Deterministic evals catch structure.

Human review catches judgment.

Kam's review loop is straightforward:

answer
  -> review against checklist
  -> label mistake
  -> write ideal answer
  -> add contract, fixture, or replay
  -> rerun

The important part is the label.

"Bad answer" is not enough.

The label should name the product failure:

wrong_skill
missing_table
unsupported_causal_claim
ignored_preferred_book
stale_data_overconfidence
unsafe_to_rank
missing_next_action
asked_more_than_one_question

Labels turn taste into backlog.

They also stop prompt edits from becoming unstructured patches.

The Kam answer review order

Structural correctness: did the system route, load context, and use tools correctly?
Contract correctness: did the answer follow the skill rulebook?
Factual correctness: did the answer use grounded facts and freshness state?
Judgment quality: was it useful, appropriately cautious, and easy to act on?

Writing style comes last.

A polished wrong answer is still wrong.

Trace replay

An E2E eval report is useful once.

A replayable trace is useful forever.

Kam's trace loop looks like this:

production turn
  -> captured trace
  -> inspect in Kam Ops
  -> export fixture JSON
  -> replay against validator
  -> commit meaningful fixtures into eval history

A replayable trace should include:

trace id
request id
surface
use case
skill id
model id
prompt profile
context pack
tool plan
controller facts
prompt contract
final answer
object refs
events

Trust receipt

What Kam should prove before confidence

A useful answer should leave a small receipt: route, scope, freshness, evidence, missing data, and confidence state.

Route

Line movement eval

Scope

Selected NBA game / spread market / opening and current line

Freshness

Current line updated within the accepted freshness window

Evidence loaded

Opening spread is present
Current spread is present
Selected sportsbook is known
Game state is not final

Missing or caveated

Injury source timing may be unavailable
Prediction-market comparison may be missing
Market-volume data may be unavailable

Status: Partial confidence until cause data is confirmed

It should also derive an answer-path artifact:

before: conversation state, selected event, selected book
routing: intent, route, reason, prefetch tool
tools: compact inputs and output summaries
truth objects: event, board, team, ticket, or watchlist objects used
answer: final text and answer shape
after: continuity state
validation: parity pass/fail and mismatches
performance: latency and step sequence

Kam's current replay mode is contract-only. It validates that the saved turn still has the required answer fields and that deterministic trace receipts still match the generated parity contract.

That is already valuable.

It means a production miss can become a saved case. It means the same mistake can fail tomorrow's build instead of becoming a Slack memory.

Screen-chat parity

Sports research apps have a dangerous failure mode:

the screen says one thing and chat says another.

That cannot happen.

If Game Detail shows one spread, one freshness state, one source reference, or one market-alignment value, chat should not invent a second version of truth. It should explain the same product object.

That is why Kam uses parity checks around user-visible facts:

fact_id
display_value
coverage_status
source_refs
as_of

The target is simple:

The screen and chat should share one read-model truth.

Raw tool fallback is allowed only when the product read model is missing, stale, incomplete for the requested lens, or outside the hot-path contract. Even then, the fallback reason should be visible in the trace or answer.

That turns fallback from a hidden accident into an auditable product decision.

AgentTask lifecycle is product quality

A correct final sentence does not compensate for a wrong task state.

If a task should wait for the user, it should not finish.

If the backend should retrieve an archive, Kam should not ask the user to provide backend-only data.

If a resume token is stale, wrong, terminal, or from the wrong provider continuation, it should be rejected.

These lifecycle checks matter because Kam is not only answering questions. It is managing research workflows.

For AgentTask-backed flows, evals should verify:

pause
resume
waiting for user
cancel
retry
done
stale-token rejection
wrong-task rejection
provider-switch rejection
normalized timeline selectors

That may sound operational, but users feel it directly.

Broken task state becomes repeated questions, lost context, or a product that looks like it forgot what it was doing.

Skill trials

A skill is not shippable just because it works once.

Kam tracks both pass@k and pass^k.

pass@k asks: did the agent succeed at least once?

pass^k asks: did every repeated run succeed?

For customer-facing chat, pass^k matters more.

One lucky run is not enough when the user is making a decision.

What production readiness should reward

Deterministic route and contract coverage

First line of defense

Repeated high-risk flow pass^k

Stability over luck

Trace replay promotion

Real misses become tests

Human failure labels

Taste becomes rules

Live E2E alone

Useful, but too late alone

Takeaway: A production eval loop should value repeatable structure and regression coverage more than one fluent live answer.

Trajectory evals

Single-turn evals are necessary.

They are not enough.

Users do not ask one perfect prompt and leave.

They move through journeys:

board -> open game -> event trends -> line move -> decision

or:

saved bet -> live state -> value movement -> why-you-liked-it review -> postgame lesson

Trajectory evals test whether Kam behaves like one coherent product across those turns.

The core rule is:

If Kam can resolve scope from loaded state or tools, act.
If not, ask one narrow question.

That rule sounds simple. It is hard in practice.

Kam has to preserve scope when the user continues. It has to reset scope when the user starts fresh. It has to avoid stale context substitution. It has to ask at most one narrow clarification when blocked. It has to avoid asking users for backend data the system should produce.

Trajectory failures are product failures

Failure label: failed_scope_carryover
What happened: Follow-up lost the selected event or board
User impact: User repeats context

Failure label: failed_scope_reset
What happened: New question incorrectly reused old context
User impact: User gets answer for wrong object

Failure label: answered_before_object_resolution
What happened: Kam guessed before resolving the game, line, or ticket
User impact: Fake confidence

Failure label: asked_more_than_one_question
What happened: Kam turned a narrow block into an interview
User impact: Friction

Failure label: stale_context_substitution
What happened: Kam used old state because it was available
User impact: Unsafe answer

Failure label: asked_user_for_backend_data
What happened: Kam asked for data the system should retrieve
User impact: Broken workflow

Failure label	What happened	User impact
failed_scope_carryover	Follow-up lost the selected event or board	User repeats context
failed_scope_reset	New question incorrectly reused old context	User gets answer for wrong object
answered_before_object_resolution	Kam guessed before resolving the game, line, or ticket	Fake confidence
asked_more_than_one_question	Kam turned a narrow block into an interview	Friction
stale_context_substitution	Kam used old state because it was available	Unsafe answer
asked_user_for_backend_data	Kam asked for data the system should retrieve	Broken workflow

Takeaway: Trajectory evals test whether the product remembers, forgets, pauses, and resumes at the right time.

Family-level scoring uses pass^k.

One blocked turn breaks the journey.

That is strict by design. A multi-turn workflow is only as trustworthy as the step that loses scope.

Production readiness

The final question is not "did the latest eval pass?"

The better question is:

Is the eval system production ready?

Kam's production-readiness report looks for patterns:

failure taxonomy
worst skills and scenarios
repeatability and flake risk
trace replay coverage
personalization coverage
postdeploy canary readiness
production trace promotion into regression fixtures
scenario coverage across high-value user jobs

The rough score interpretation:

90+: production-grade eval loop
80-89: release-ready with monitoring
70-79: solid predeploy framework
55-69: needs hardening
below 55: missing core production eval coverage

The score is not a replacement for judgment.

It is a map.

It tells the team where the risk is hiding.

Scenario coverage

Kam does not need a giant random pile of prompts.

It needs coverage that matches real product jobs.

A narrow beta floor is around 60 unique scenarios. A production-level starting point is around 80. A broad target is around 150. The important part is balance:

Production scenario coverage target

Bucket: Board
Minimum unique scenarios: 20

Bucket: My bets / ticket state
Minimum unique scenarios: 20

Bucket: Movement / CLV
Minimum unique scenarios: 15

Bucket: Trends / player / team
Minimum unique scenarios: 10

Bucket: System state / blocked / follow-up
Minimum unique scenarios: 10

Bucket	Minimum unique scenarios
Board	20
My bets / ticket state	20
Movement / CLV	15
Trends / player / team	10
System state / blocked / follow-up	10

Takeaway: The prompt bank should represent the user jobs and failures that matter, not just the prompts that are easy to write.

The most valuable families are the obvious ones:

"What are the NBA odds today?"
"Tell me whether books and prediction markets agree today."
"Should I open this game?"
"Why did this line move?"
"What moved since I last checked?"
"FanDuel only."
"Just underdogs."
no games, stale games, missing prediction-market data, missing workspace
user correction paths where the user says the answer is stale or wrong

Simple prompts are not simple evals.

They are where most product trust is won or lost.

Postdeploy canaries

Local gates cannot catch every production issue.

They cannot fully prove environment health, production data shape, provider behavior, or live endpoint availability.

So Kam needs small postdeploy canaries:

daily board
open game
line move
market alignment
workspace delta
home research endpoint shape

A good canary is small enough to run every deploy and specific enough to fail for a real reason.

For example, the Home research canary checks that production rows have unique headlines, source badges, matching source counts, dive-deeper prompts, valid confidence, valid display shape, exactly one default-expanded row, and explicit risk notes when coverage is sparse.

That is not a chatbot eval.

That is a product-health eval.

Visual evals

Visuals are useful only when they clarify grounded numeric data.

Kam should not add a chart because an answer is long.

It should add a chart when the chart helps the user inspect movement, deltas, or comparisons.

The visual pack checks:

one compact chart when numeric sports data supports a visual
no chart when game, sport, workspace, or last-checked marker is missing
line or bar charts for movement, deltas, and comparisons
no pie charts for odds movement
title, takeaway, and text explanation for accessibility
a next action after the visual

That is the right standard.

Visual polish without grounded data is decoration. Grounded visuals shorten decision time.

What this means for product velocity

The best eval systems make teams faster, not slower.

They do that by making failures smaller.

If a bug is a route problem, fix the route.

If it is a missing read model, fix the data contract.

If it is a task-state problem, fix the lifecycle.

If it is a writing problem, update the answer contract.

If it is a repeated judgment miss, add a review label and promote it into a fixture.

This prevents the default AI-product failure mode:

bad answer -> add prompt text -> prompt gets bigger -> behavior gets less legible

Kam's eval system should create the opposite loop:

bad answer -> labeled failure -> smallest source fix -> regression test -> readiness report

That is how product quality compounds.

The operating principle

Kam is not trying to make an AI that always sounds confident.

Kam is trying to make a research system that knows when confidence has been earned.

That requires a different kind of eval culture:

deterministic before judged
route before answer
facts before prose
traces before memory
labels before opinions
repeated pass before lucky pass
production misses become tests
user-value buckets shape the scenario bank

The point is not to remove human judgment.

The point is to put human judgment in the right place.

Humans should decide what good research feels like, which failures matter, and which answer shape users trust. Evals should preserve those decisions so the product does not relearn the same lessons every week.

Eval architecture

Trust is a system property

Kam's answer quality comes from resolver gates, read-model contracts, trace replay, human labels, and production readiness reports working together. The visible answer is only the final artifact.

Read the docs Read Chat Is the Moat

The short version

Good evals catch one bad answer.

Production evals catch patterns.

Kam needs both.

The deterministic ladder protects the product before model calls. Skill trials prove repeated stability. Trace replay turns real behavior into fixtures. Human review turns taste into labels. Trajectory evals test whether the product survives the journey. Readiness reports show where the risk is concentrated.

That is how Kam gets better without pretending that fluency equals trust.

Related field notes

View all posts

context-checksmoat

Catch Moves. Don’t Chase.

Why Kam is not just chat: it watches your spots, checks what moved, and gives you the read before you chase a bad number.

16 min read

trustsource-context

How Kam AI Is Built

A plain-English tour of how Kam watches your spots, checks sources, explains moves, saves reads, and supports review.

10 min read

sports-promptsresearch-workflow

Sports Questions That Get a Better Read

A practical prompt library for board checks, line moves, tickets, missing info, watchlists, and postgame review.

14 min read