Skip to main content

5th Grade Summary

Kam checks answers before users trust them.

First, it checks if the question went to the right kind of answer.

Then it checks the data, tools, freshness, saved read, and answer shape.

If an answer is wrong, Kam labels the mistake and turns it into a future test.

Most AI products test whether the app responds.

Kam has to test whether the response deserves trust.

That is a different job.

Customers should not have to know the word "eval."

They should feel something simpler:

Kam checked the move before it asked me to care.

Sports-market research is full of questions that sound simple but hide risk:

Why did this line move?
Is my bet covering?
Any gap left?
What moved since I last checked?
Which games counted?

Those questions are short because the user assumes the product already knows the board, the selected event, the market, the book, the ticket, the prior thesis, and the freshness state. If the system guesses at any of that context, the answer can sound confident while being wrong.

That is why Kam's eval system is not one giant "did the model sound good?" test.

It is a ladder.

The ladder starts with deterministic checks that fail close to the source of the problem. It ends with live answer review and production readiness. The goal is not to make evals impressive. The goal is to make bad answers expensive to ship.

Why normal evals are not enough

Generic AI evals often start at the final answer.

Was the answer helpful?

Was the answer accurate?

Was the answer concise?

Those questions matter, but they are too late. By the time the final answer exists, the system already made several product decisions:

  • it interpreted the user's question
  • it selected a route
  • it loaded or skipped memory
  • it chose tools
  • it decided whether data was fresh enough
  • it constructed a prompt
  • it accepted or rejected missing context
  • it shaped the answer

If those steps are wrong, the final answer review becomes a crime-scene investigation.

Kam needs the eval to fail earlier.

If a user asks, "Which games counted?" after an aggregate ATS trend, the eval should not wait for a human to notice the answer is vague. It should already know that the route must drill down into the historical games behind the aggregate. It should know the forbidden route. It should know the required fields: date, opponent, closing spread, final score, ATS result, cover margin, sample size, and as_of.

That is the difference between testing a chatbot and testing a research system.

The ladder

Kam's eval ladder moves from deterministic structure to live judgment.

user prompt
  -> resolver reachability
  -> route
  -> read-model plan
  -> screen-chat parity
  -> tool plan
  -> prompt contract
  -> AgentTask lifecycle
  -> skill trial
  -> fixture E2E
  -> trace replay
  -> answer-path family
  -> live E2E
  -> human or judged review

The order matters.

Do not ask a model judge whether an answer was useful if the route was wrong.

Do not ask a human whether the writing was smooth if the answer used stale data too confidently.

Do not celebrate a passing live run if the same prompt fails on the second try.

The ladder is designed to separate "the stack functioned" from "the product helped a user make a safer decision."

Visual artifact

A simple eval ladder

A useful eval suite should fail as close to the real problem as possible. Do not wait for a final answer judge when route, data, or freshness already failed.

  1. 01scope

    Route matched the question

    The prompt reached the right skill instead of a generic answer path.

  2. 02evidence

    Required data was present

    The answer had the right event, market, line, score, source, or saved read.

  3. 03evidence

    Freshness was acceptable

    Stale, delayed, missing, and unsafe-to-rank states were visible before confidence.

  4. 04answer

    Answer had a safe next step

    The final response explained uncertainty and gave a practical next check.

Evals are not a vanity score. They are the trust system that stops bad answers from shipping.

What each eval layer protects

Layer
Resolver
What it catches
Skill exists but cannot be reached from user language
Why it matters
Prevents dark skills
Layer
Route
What it catches
User text maps to the wrong skill
Why it matters
Stops bad answers before tools run
Layer
Tool plan
What it catches
Controller chooses the wrong path
Why it matters
Prevents expensive or irrelevant calls
Layer
Read model plan
What it catches
Product truth object is missing, stale, or bypassed
Why it matters
Keeps chat and screens aligned
Layer
Prompt contract
What it catches
Required rules are absent from the compiled prompt
Why it matters
Makes answer behavior auditable
Layer
AgentTask lifecycle
What it catches
Work pauses, resumes, retries, or finishes in the wrong state
Why it matters
Prevents broken workflows
Layer
Skill trial
What it catches
One skill is flaky across repeated deterministic cases
Why it matters
Stops lucky one-off passes
Layer
Trace replay
What it catches
A saved turn no longer satisfies current contracts
Why it matters
Turns production behavior into regression coverage
Layer
Answer path
What it catches
Follow-up context is preserved or reset incorrectly
Why it matters
Tests real multi-turn journeys
Layer
Human review
What it catches
Product judgment, usefulness, and writing quality
Why it matters
Converts taste into durable rules

Takeaway: Kam evals start with deterministic product facts, then move toward live answer quality.

The vocabulary matters

Kam uses a simple eval vocabulary internally:

  • A case is one user scenario, skill scenario, fixture, or replayed trace.
  • An experiment is a named suite that bundles cases.
  • An evaluator is the check applied to a case.
  • A task adapter is the harness that turns the case into output and answer-path data.
  • An answer path is the route, tools, trace events, task state, and readable turn story.
  • A report is the artifact used to make a release decision.

This vocabulary prevents vague debates.

Instead of saying "the eval failed," the team can say:

Case: aggregate ATS follow-up
Experiment: answer-path family
Evaluator: route + required fields
Failure: wrong_skill
Next action: add route expectation and forbidden route

That is more useful than a score with no diagnosis.

Start before the model call

The most important Kam evals do not require a model call.

That is intentional.

If a route is broken, the model cannot fix it. If a skill capsule is unreachable, the answer will never use it. If the tool policy is incomplete, the system may wander into the wrong data source. If the product read model is stale, the model may write beautifully about the wrong truth.

So Kam starts with the boring checks.

Can the resolver reach the skill?

Does the priority table know when to use it?

Does the route have fallback behavior?

Does the skill declare its required object families?

Does it know its freshness SLA?

Does it know when to stop and ask a user for scope?

Those checks are not glamorous, but they are high leverage.

The practical split

Before model

Resolver reachability, route expectations, tool policy, read-model contracts, prompt-contract assembly, and task lifecycle.

During model

Live answer generation, provider behavior, tool-call adherence, writing contract, and response shape.

After model

Human review, failure labels, trace promotion, readiness scoring, postdeploy canaries, and recurring regression packs.

Takeaway: The earlier a regression is caught, the cheaper it is to fix and the less likely it is to become prompt sprawl.

A real failure pattern

Consider this user path:

Kam: The Lakers and Thunder are 11-9 ATS in the sample.
User: Which games counted?

A weak system may route the follow-up back to today's ATS board because it sees "games" and "ATS."

That is wrong.

The user is asking for the historical drilldown behind a previous aggregate answer.

The eval should specify:

  • expected route: get_betting_trend_game_details
  • forbidden route: get_ats_board
  • forbidden phrase: No graded ATS results are available today
  • required fields: date, opponent, closing spread, final score, ATS result, cover margin, sample size, as_of

This is not an abstract language problem.

It is a product continuity problem.

The answer is only safe if Kam carries the aggregate scope into the follow-up, resolves the drilldown, and names the sample that created the first claim.

Product-value buckets

Kam evals should prove value in the questions users actually ask.

The core buckets are:

  1. Board
  2. My bets
  3. Movement
  4. Trends

Each bucket has a different failure mode.

Board answers can over-rank stale or incomplete boards.

My-bet answers can grade the wrong line.

Movement answers can describe line movement from the book's perspective instead of the bettor's perspective.

Trend answers can use percentages without a loaded denominator.

The product-value eval buckets

Bucket
Board
User asks
Are favorites or dogs covering today?
Kam must prove
Grounded board category summary with denominators only when loaded
Bucket
My bets
User asks
I took Warriors +5.5. Am I covering?
Kam must prove
Exact ticket state, entry line, score, cover margin, and one next action
Bucket
Movement
User asks
Did the market move in my favor?
Kam must prove
Bettor-side open-to-close value, not generic line-moved prose
Bucket
Trends
User asks
Is LeBron covering spreads lately?
Kam must prove
Grounded trend when supported, or a clean missing-data stop

Takeaway: The eval question should look like a bettor's question, not a developer label.

This is why Kam treats unsupported but valuable questions as eval material.

If a user asks a valuable question and the data is not ready, the correct behavior is not to delete the scenario. The correct behavior is to route to a missing-data guardrail.

A missing-data stop can be a good answer.

A fake edge is not.

Hard betting rules

Some rules should never be left to vibes.

For example:

  • moneyline grades as outright win or loss
  • spread grades from the user's entry line
  • spread margin is team_score + entry_spread - opponent_score
  • positive margin covered, zero pushed, negative failed to cover
  • do not grade an opening ticket with the closing line unless the user says they bet the close
  • do not use live odds to decide whether a pregame ticket got value, covered, or lost
  • use percent framing only when numerator and denominator are loaded

These rules make the evals product-specific.

They also make the product honest.

If a user says, "I took Warriors +5.5. Am I covering?" Kam should not answer with generic market commentary. It should lead with the current cover state, show the margin, and give one next action.

If the score or entry line is missing, Kam should ask for the exact missing field. It should not invent a result.

The answer review loop

Deterministic evals catch structure.

Human review catches judgment.

Kam's review loop is straightforward:

answer
  -> review against checklist
  -> label mistake
  -> write ideal answer
  -> add contract, fixture, or replay
  -> rerun

The important part is the label.

"Bad answer" is not enough.

The label should name the product failure:

  • wrong_skill
  • missing_table
  • unsupported_causal_claim
  • ignored_preferred_book
  • stale_data_overconfidence
  • unsafe_to_rank
  • missing_next_action
  • asked_more_than_one_question

Labels turn taste into backlog.

They also stop prompt edits from becoming unstructured patches.

The Kam answer review order

  • Structural correctness: did the system route, load context, and use tools correctly?
  • Contract correctness: did the answer follow the skill rulebook?
  • Factual correctness: did the answer use grounded facts and freshness state?
  • Judgment quality: was it useful, appropriately cautious, and easy to act on?

Writing style comes last.

A polished wrong answer is still wrong.

Trace replay

An E2E eval report is useful once.

A replayable trace is useful forever.

Kam's trace loop looks like this:

production turn
  -> captured trace
  -> inspect in Kam Ops
  -> export fixture JSON
  -> replay against validator
  -> commit meaningful fixtures into eval history

A replayable trace should include:

  • trace id
  • request id
  • surface
  • use case
  • skill id
  • model id
  • prompt profile
  • context pack
  • tool plan
  • controller facts
  • prompt contract
  • final answer
  • object refs
  • events

Trust receipt

What Kam should prove before confidence

A useful answer should leave a small receipt: route, scope, freshness, evidence, missing data, and confidence state.

Route

Line movement eval

Scope

Selected NBA game / spread market / opening and current line

Freshness

Current line updated within the accepted freshness window

Evidence loaded

  • Opening spread is present
  • Current spread is present
  • Selected sportsbook is known
  • Game state is not final

Missing or caveated

  • Injury source timing may be unavailable
  • Prediction-market comparison may be missing
  • Market-volume data may be unavailable
Status: Partial confidence until cause data is confirmed

It should also derive an answer-path artifact:

  • before: conversation state, selected event, selected book
  • routing: intent, route, reason, prefetch tool
  • tools: compact inputs and output summaries
  • truth objects: event, board, team, ticket, or watchlist objects used
  • answer: final text and answer shape
  • after: continuity state
  • validation: parity pass/fail and mismatches
  • performance: latency and step sequence

Kam's current replay mode is contract-only. It validates that the saved turn still has the required answer fields and that deterministic trace receipts still match the generated parity contract.

That is already valuable.

It means a production miss can become a saved case. It means the same mistake can fail tomorrow's build instead of becoming a Slack memory.

Screen-chat parity

Sports research apps have a dangerous failure mode:

the screen says one thing and chat says another.

That cannot happen.

If Game Detail shows one spread, one freshness state, one source reference, or one market-alignment value, chat should not invent a second version of truth. It should explain the same product object.

That is why Kam uses parity checks around user-visible facts:

  • fact_id
  • display_value
  • coverage_status
  • source_refs
  • as_of

The target is simple:

The screen and chat should share one read-model truth.

Raw tool fallback is allowed only when the product read model is missing, stale, incomplete for the requested lens, or outside the hot-path contract. Even then, the fallback reason should be visible in the trace or answer.

That turns fallback from a hidden accident into an auditable product decision.

AgentTask lifecycle is product quality

A correct final sentence does not compensate for a wrong task state.

If a task should wait for the user, it should not finish.

If the backend should retrieve an archive, Kam should not ask the user to provide backend-only data.

If a resume token is stale, wrong, terminal, or from the wrong provider continuation, it should be rejected.

These lifecycle checks matter because Kam is not only answering questions. It is managing research workflows.

For AgentTask-backed flows, evals should verify:

  • pause
  • resume
  • waiting for user
  • cancel
  • retry
  • done
  • stale-token rejection
  • wrong-task rejection
  • provider-switch rejection
  • normalized timeline selectors

That may sound operational, but users feel it directly.

Broken task state becomes repeated questions, lost context, or a product that looks like it forgot what it was doing.

Skill trials

A skill is not shippable just because it works once.

Kam tracks both pass@k and pass^k.

pass@k asks: did the agent succeed at least once?

pass^k asks: did every repeated run succeed?

For customer-facing chat, pass^k matters more.

One lucky run is not enough when the user is making a decision.

What production readiness should reward

Deterministic route and contract coverage

First line of defense

Repeated high-risk flow pass^k

Stability over luck

Trace replay promotion

Real misses become tests

Human failure labels

Taste becomes rules

Live E2E alone

Useful, but too late alone

Takeaway: A production eval loop should value repeatable structure and regression coverage more than one fluent live answer.

Trajectory evals

Single-turn evals are necessary.

They are not enough.

Users do not ask one perfect prompt and leave.

They move through journeys:

board -> open game -> event trends -> line move -> decision

or:

saved bet -> live state -> value movement -> why-you-liked-it review -> postgame lesson

Trajectory evals test whether Kam behaves like one coherent product across those turns.

The core rule is:

If Kam can resolve scope from loaded state or tools, act.
If not, ask one narrow question.

That rule sounds simple. It is hard in practice.

Kam has to preserve scope when the user continues. It has to reset scope when the user starts fresh. It has to avoid stale context substitution. It has to ask at most one narrow clarification when blocked. It has to avoid asking users for backend data the system should produce.

Trajectory failures are product failures

Failure label
failed_scope_carryover
What happened
Follow-up lost the selected event or board
User impact
User repeats context
Failure label
failed_scope_reset
What happened
New question incorrectly reused old context
User impact
User gets answer for wrong object
Failure label
answered_before_object_resolution
What happened
Kam guessed before resolving the game, line, or ticket
User impact
Fake confidence
Failure label
asked_more_than_one_question
What happened
Kam turned a narrow block into an interview
User impact
Friction
Failure label
stale_context_substitution
What happened
Kam used old state because it was available
User impact
Unsafe answer
Failure label
asked_user_for_backend_data
What happened
Kam asked for data the system should retrieve
User impact
Broken workflow

Takeaway: Trajectory evals test whether the product remembers, forgets, pauses, and resumes at the right time.

Family-level scoring uses pass^k.

One blocked turn breaks the journey.

That is strict by design. A multi-turn workflow is only as trustworthy as the step that loses scope.

Production readiness

The final question is not "did the latest eval pass?"

The better question is:

Is the eval system production ready?

Kam's production-readiness report looks for patterns:

  • failure taxonomy
  • worst skills and scenarios
  • repeatability and flake risk
  • trace replay coverage
  • personalization coverage
  • postdeploy canary readiness
  • production trace promotion into regression fixtures
  • scenario coverage across high-value user jobs

The rough score interpretation:

  • 90+: production-grade eval loop
  • 80-89: release-ready with monitoring
  • 70-79: solid predeploy framework
  • 55-69: needs hardening
  • below 55: missing core production eval coverage

The score is not a replacement for judgment.

It is a map.

It tells the team where the risk is hiding.

Scenario coverage

Kam does not need a giant random pile of prompts.

It needs coverage that matches real product jobs.

A narrow beta floor is around 60 unique scenarios. A production-level starting point is around 80. A broad target is around 150. The important part is balance:

Production scenario coverage target

Bucket
Board
Minimum unique scenarios
20
Bucket
My bets / ticket state
Minimum unique scenarios
20
Bucket
Movement / CLV
Minimum unique scenarios
15
Bucket
Trends / player / team
Minimum unique scenarios
10
Bucket
System state / blocked / follow-up
Minimum unique scenarios
10

Takeaway: The prompt bank should represent the user jobs and failures that matter, not just the prompts that are easy to write.

The most valuable families are the obvious ones:

  • "What are the NBA odds today?"
  • "Tell me whether books and prediction markets agree today."
  • "Should I open this game?"
  • "Why did this line move?"
  • "What moved since I last checked?"
  • "FanDuel only."
  • "Just underdogs."
  • no games, stale games, missing prediction-market data, missing workspace
  • user correction paths where the user says the answer is stale or wrong

Simple prompts are not simple evals.

They are where most product trust is won or lost.

Postdeploy canaries

Local gates cannot catch every production issue.

They cannot fully prove environment health, production data shape, provider behavior, or live endpoint availability.

So Kam needs small postdeploy canaries:

  • daily board
  • open game
  • line move
  • market alignment
  • workspace delta
  • home research endpoint shape

A good canary is small enough to run every deploy and specific enough to fail for a real reason.

For example, the Home research canary checks that production rows have unique headlines, source badges, matching source counts, dive-deeper prompts, valid confidence, valid display shape, exactly one default-expanded row, and explicit risk notes when coverage is sparse.

That is not a chatbot eval.

That is a product-health eval.

Visual evals

Visuals are useful only when they clarify grounded numeric data.

Kam should not add a chart because an answer is long.

It should add a chart when the chart helps the user inspect movement, deltas, or comparisons.

The visual pack checks:

  • one compact chart when numeric sports data supports a visual
  • no chart when game, sport, workspace, or last-checked marker is missing
  • line or bar charts for movement, deltas, and comparisons
  • no pie charts for odds movement
  • title, takeaway, and text explanation for accessibility
  • a next action after the visual

That is the right standard.

Visual polish without grounded data is decoration. Grounded visuals shorten decision time.

What this means for product velocity

The best eval systems make teams faster, not slower.

They do that by making failures smaller.

If a bug is a route problem, fix the route.

If it is a missing read model, fix the data contract.

If it is a task-state problem, fix the lifecycle.

If it is a writing problem, update the answer contract.

If it is a repeated judgment miss, add a review label and promote it into a fixture.

This prevents the default AI-product failure mode:

bad answer -> add prompt text -> prompt gets bigger -> behavior gets less legible

Kam's eval system should create the opposite loop:

bad answer -> labeled failure -> smallest source fix -> regression test -> readiness report

That is how product quality compounds.

The operating principle

Kam is not trying to make an AI that always sounds confident.

Kam is trying to make a research system that knows when confidence has been earned.

That requires a different kind of eval culture:

  • deterministic before judged
  • route before answer
  • facts before prose
  • traces before memory
  • labels before opinions
  • repeated pass before lucky pass
  • production misses become tests
  • user-value buckets shape the scenario bank

The point is not to remove human judgment.

The point is to put human judgment in the right place.

Humans should decide what good research feels like, which failures matter, and which answer shape users trust. Evals should preserve those decisions so the product does not relearn the same lessons every week.

Eval architecture

Trust is a system property

Kam's answer quality comes from resolver gates, read-model contracts, trace replay, human labels, and production readiness reports working together. The visible answer is only the final artifact.

The short version

Good evals catch one bad answer.

Production evals catch patterns.

Kam needs both.

The deterministic ladder protects the product before model calls. Skill trials prove repeated stability. Trace replay turns real behavior into fixtures. Human review turns taste into labels. Trajectory evals test whether the product survives the journey. Readiness reports show where the risk is concentrated.

That is how Kam gets better without pretending that fluency equals trust.

Read next

Related field notes

View all posts