Trust
How Kam Checks an Answer Before You Trust It


Kam AI
Product and research

Trust


Kam AI
Product and research

Kam checks answers before users trust them.
First, it checks if the question went to the right kind of answer.
Then it checks the data, tools, freshness, saved read, and answer shape.
If an answer is wrong, Kam labels the mistake and turns it into a future test.
Most AI products test whether the app responds.
Kam has to test whether the response deserves trust.
That is a different job.
Customers should not have to know the word "eval."
They should feel something simpler:
Kam checked the move before it asked me to care.
Sports-market research is full of questions that sound simple but hide risk:
Why did this line move?
Is my bet covering?
Any gap left?
What moved since I last checked?
Which games counted?
Those questions are short because the user assumes the product already knows the board, the selected event, the market, the book, the ticket, the prior thesis, and the freshness state. If the system guesses at any of that context, the answer can sound confident while being wrong.
That is why Kam's eval system is not one giant "did the model sound good?" test.
It is a ladder.
The ladder starts with deterministic checks that fail close to the source of the problem. It ends with live answer review and production readiness. The goal is not to make evals impressive. The goal is to make bad answers expensive to ship.
Generic AI evals often start at the final answer.
Was the answer helpful?
Was the answer accurate?
Was the answer concise?
Those questions matter, but they are too late. By the time the final answer exists, the system already made several product decisions:
If those steps are wrong, the final answer review becomes a crime-scene investigation.
Kam needs the eval to fail earlier.
If a user asks, "Which games counted?" after an aggregate ATS trend, the eval should not wait for a human to notice the answer is vague. It should already know that the route must drill down into the historical games behind the aggregate. It should know the forbidden route. It should know the required fields: date, opponent, closing spread, final score, ATS result, cover margin, sample size, and as_of.
That is the difference between testing a chatbot and testing a research system.
Kam's eval ladder moves from deterministic structure to live judgment.
user prompt
-> resolver reachability
-> route
-> read-model plan
-> screen-chat parity
-> tool plan
-> prompt contract
-> AgentTask lifecycle
-> skill trial
-> fixture E2E
-> trace replay
-> answer-path family
-> live E2E
-> human or judged review
The order matters.
Do not ask a model judge whether an answer was useful if the route was wrong.
Do not ask a human whether the writing was smooth if the answer used stale data too confidently.
Do not celebrate a passing live run if the same prompt fails on the second try.
The ladder is designed to separate "the stack functioned" from "the product helped a user make a safer decision."
Visual artifact
A useful eval suite should fail as close to the real problem as possible. Do not wait for a final answer judge when route, data, or freshness already failed.
The prompt reached the right skill instead of a generic answer path.
The answer had the right event, market, line, score, source, or saved read.
Stale, delayed, missing, and unsafe-to-rank states were visible before confidence.
The final response explained uncertainty and gave a practical next check.
What each eval layer protects
Takeaway: Kam evals start with deterministic product facts, then move toward live answer quality.
Kam uses a simple eval vocabulary internally:
This vocabulary prevents vague debates.
Instead of saying "the eval failed," the team can say:
Case: aggregate ATS follow-up
Experiment: answer-path family
Evaluator: route + required fields
Failure: wrong_skill
Next action: add route expectation and forbidden route
That is more useful than a score with no diagnosis.
The most important Kam evals do not require a model call.
That is intentional.
If a route is broken, the model cannot fix it. If a skill capsule is unreachable, the answer will never use it. If the tool policy is incomplete, the system may wander into the wrong data source. If the product read model is stale, the model may write beautifully about the wrong truth.
So Kam starts with the boring checks.
Can the resolver reach the skill?
Does the priority table know when to use it?
Does the route have fallback behavior?
Does the skill declare its required object families?
Does it know its freshness SLA?
Does it know when to stop and ask a user for scope?
Those checks are not glamorous, but they are high leverage.
The practical split
Resolver reachability, route expectations, tool policy, read-model contracts, prompt-contract assembly, and task lifecycle.
Live answer generation, provider behavior, tool-call adherence, writing contract, and response shape.
Human review, failure labels, trace promotion, readiness scoring, postdeploy canaries, and recurring regression packs.
Takeaway: The earlier a regression is caught, the cheaper it is to fix and the less likely it is to become prompt sprawl.
Consider this user path:
Kam: The Lakers and Thunder are 11-9 ATS in the sample.
User: Which games counted?
A weak system may route the follow-up back to today's ATS board because it sees "games" and "ATS."
That is wrong.
The user is asking for the historical drilldown behind a previous aggregate answer.
The eval should specify:
get_betting_trend_game_detailsget_ats_boardNo graded ATS results are available todayas_ofThis is not an abstract language problem.
It is a product continuity problem.
The answer is only safe if Kam carries the aggregate scope into the follow-up, resolves the drilldown, and names the sample that created the first claim.
Kam evals should prove value in the questions users actually ask.
The core buckets are:
Each bucket has a different failure mode.
Board answers can over-rank stale or incomplete boards.
My-bet answers can grade the wrong line.
Movement answers can describe line movement from the book's perspective instead of the bettor's perspective.
Trend answers can use percentages without a loaded denominator.
The product-value eval buckets
Takeaway: The eval question should look like a bettor's question, not a developer label.
This is why Kam treats unsupported but valuable questions as eval material.
If a user asks a valuable question and the data is not ready, the correct behavior is not to delete the scenario. The correct behavior is to route to a missing-data guardrail.
A missing-data stop can be a good answer.
A fake edge is not.
Some rules should never be left to vibes.
For example:
team_score + entry_spread - opponent_scoreThese rules make the evals product-specific.
They also make the product honest.
If a user says, "I took Warriors +5.5. Am I covering?" Kam should not answer with generic market commentary. It should lead with the current cover state, show the margin, and give one next action.
If the score or entry line is missing, Kam should ask for the exact missing field. It should not invent a result.
Deterministic evals catch structure.
Human review catches judgment.
Kam's review loop is straightforward:
answer
-> review against checklist
-> label mistake
-> write ideal answer
-> add contract, fixture, or replay
-> rerun
The important part is the label.
"Bad answer" is not enough.
The label should name the product failure:
wrong_skillmissing_tableunsupported_causal_claimignored_preferred_bookstale_data_overconfidenceunsafe_to_rankmissing_next_actionasked_more_than_one_questionLabels turn taste into backlog.
They also stop prompt edits from becoming unstructured patches.
Writing style comes last.
A polished wrong answer is still wrong.
An E2E eval report is useful once.
A replayable trace is useful forever.
Kam's trace loop looks like this:
production turn
-> captured trace
-> inspect in Kam Ops
-> export fixture JSON
-> replay against validator
-> commit meaningful fixtures into eval history
A replayable trace should include:
Trust receipt
A useful answer should leave a small receipt: route, scope, freshness, evidence, missing data, and confidence state.
Route
Line movement eval
Scope
Selected NBA game / spread market / opening and current line
Freshness
Current line updated within the accepted freshness window
Evidence loaded
Missing or caveated
It should also derive an answer-path artifact:
Kam's current replay mode is contract-only. It validates that the saved turn still has the required answer fields and that deterministic trace receipts still match the generated parity contract.
That is already valuable.
It means a production miss can become a saved case. It means the same mistake can fail tomorrow's build instead of becoming a Slack memory.
Sports research apps have a dangerous failure mode:
the screen says one thing and chat says another.
That cannot happen.
If Game Detail shows one spread, one freshness state, one source reference, or one market-alignment value, chat should not invent a second version of truth. It should explain the same product object.
That is why Kam uses parity checks around user-visible facts:
fact_iddisplay_valuecoverage_statussource_refsas_ofThe target is simple:
The screen and chat should share one read-model truth.
Raw tool fallback is allowed only when the product read model is missing, stale, incomplete for the requested lens, or outside the hot-path contract. Even then, the fallback reason should be visible in the trace or answer.
That turns fallback from a hidden accident into an auditable product decision.
A correct final sentence does not compensate for a wrong task state.
If a task should wait for the user, it should not finish.
If the backend should retrieve an archive, Kam should not ask the user to provide backend-only data.
If a resume token is stale, wrong, terminal, or from the wrong provider continuation, it should be rejected.
These lifecycle checks matter because Kam is not only answering questions. It is managing research workflows.
For AgentTask-backed flows, evals should verify:
That may sound operational, but users feel it directly.
Broken task state becomes repeated questions, lost context, or a product that looks like it forgot what it was doing.
A skill is not shippable just because it works once.
Kam tracks both pass@k and pass^k.
pass@k asks: did the agent succeed at least once?
pass^k asks: did every repeated run succeed?
For customer-facing chat, pass^k matters more.
One lucky run is not enough when the user is making a decision.
What production readiness should reward
Deterministic route and contract coverage
First line of defense
Repeated high-risk flow pass^k
Stability over luck
Trace replay promotion
Real misses become tests
Human failure labels
Taste becomes rules
Live E2E alone
Useful, but too late alone
Takeaway: A production eval loop should value repeatable structure and regression coverage more than one fluent live answer.
Single-turn evals are necessary.
They are not enough.
Users do not ask one perfect prompt and leave.
They move through journeys:
board -> open game -> event trends -> line move -> decision
or:
saved bet -> live state -> value movement -> why-you-liked-it review -> postgame lesson
Trajectory evals test whether Kam behaves like one coherent product across those turns.
The core rule is:
If Kam can resolve scope from loaded state or tools, act.
If not, ask one narrow question.
That rule sounds simple. It is hard in practice.
Kam has to preserve scope when the user continues. It has to reset scope when the user starts fresh. It has to avoid stale context substitution. It has to ask at most one narrow clarification when blocked. It has to avoid asking users for backend data the system should produce.
Trajectory failures are product failures
Takeaway: Trajectory evals test whether the product remembers, forgets, pauses, and resumes at the right time.
Family-level scoring uses pass^k.
One blocked turn breaks the journey.
That is strict by design. A multi-turn workflow is only as trustworthy as the step that loses scope.
The final question is not "did the latest eval pass?"
The better question is:
Is the eval system production ready?
Kam's production-readiness report looks for patterns:
The rough score interpretation:
The score is not a replacement for judgment.
It is a map.
It tells the team where the risk is hiding.
Kam does not need a giant random pile of prompts.
It needs coverage that matches real product jobs.
A narrow beta floor is around 60 unique scenarios. A production-level starting point is around 80. A broad target is around 150. The important part is balance:
Production scenario coverage target
Takeaway: The prompt bank should represent the user jobs and failures that matter, not just the prompts that are easy to write.
The most valuable families are the obvious ones:
Simple prompts are not simple evals.
They are where most product trust is won or lost.
Local gates cannot catch every production issue.
They cannot fully prove environment health, production data shape, provider behavior, or live endpoint availability.
So Kam needs small postdeploy canaries:
A good canary is small enough to run every deploy and specific enough to fail for a real reason.
For example, the Home research canary checks that production rows have unique headlines, source badges, matching source counts, dive-deeper prompts, valid confidence, valid display shape, exactly one default-expanded row, and explicit risk notes when coverage is sparse.
That is not a chatbot eval.
That is a product-health eval.
Visuals are useful only when they clarify grounded numeric data.
Kam should not add a chart because an answer is long.
It should add a chart when the chart helps the user inspect movement, deltas, or comparisons.
The visual pack checks:
That is the right standard.
Visual polish without grounded data is decoration. Grounded visuals shorten decision time.
The best eval systems make teams faster, not slower.
They do that by making failures smaller.
If a bug is a route problem, fix the route.
If it is a missing read model, fix the data contract.
If it is a task-state problem, fix the lifecycle.
If it is a writing problem, update the answer contract.
If it is a repeated judgment miss, add a review label and promote it into a fixture.
This prevents the default AI-product failure mode:
bad answer -> add prompt text -> prompt gets bigger -> behavior gets less legible
Kam's eval system should create the opposite loop:
bad answer -> labeled failure -> smallest source fix -> regression test -> readiness report
That is how product quality compounds.
Kam is not trying to make an AI that always sounds confident.
Kam is trying to make a research system that knows when confidence has been earned.
That requires a different kind of eval culture:
The point is not to remove human judgment.
The point is to put human judgment in the right place.
Humans should decide what good research feels like, which failures matter, and which answer shape users trust. Evals should preserve those decisions so the product does not relearn the same lessons every week.
Eval architecture
Kam's answer quality comes from resolver gates, read-model contracts, trace replay, human labels, and production readiness reports working together. The visible answer is only the final artifact.
Good evals catch one bad answer.
Production evals catch patterns.
Kam needs both.
The deterministic ladder protects the product before model calls. Skill trials prove repeated stability. Trace replay turns real behavior into fixtures. Human review turns taste into labels. Trajectory evals test whether the product survives the journey. Readiness reports show where the risk is concentrated.
That is how Kam gets better without pretending that fluency equals trust.
Read next
Why Kam is not just chat: it watches your spots, checks what moved, and gives you the read before you chase a bad number.
16 min read
A plain-English tour of how Kam watches your spots, checks sources, explains moves, saves reads, and supports review.
10 min read
A practical prompt library for board checks, line moves, tickets, missing info, watchlists, and postgame review.
14 min read