Tasks & Grading
The shared contract that makes 17 heterogeneous tasks gradeable and both lanes comparable: every task returns a small structured object validated by Zod and checked by a deterministic grader against known ground truth. Never grade open-ended prose.
The pieces
A task is a TaskType (src/shared/contract.ts) with five responsibilities:
interface TaskType {
id: string; // 'damage-assessment'
label: string; // 'Damage Assess'
icon: string; // '📦'
modality: 'vision' | 'document' | 'text' | 'video';
difficulty: 1 | 2 | 3; // drives points + the visible speed gap
outputSchema: ZodSchema; // THE single source of truth for this task
buildPrompt(s): Promise<ChatMessages>; // multimodal-aware prompt assembly
grade(output, truth): GradeResult; // deterministic comparison
focusFields?: FocusFieldSpec[]; // which rows the focus card renders
humanControls?: HumanControl[]; // the "I Wanna Play" overlay inputs
}
The Zod schema is the single source of truth. From it derives: the prompt's
format instruction, the parser, the grader's expected fields, and (via
focusFields) which rows light up on the focus card. Define the schema once and
everything else follows.
The 17 shipped tasks
Grouped by modality (the "warehouse sortation line" world):
| Modality | Tasks |
|---|---|
| Vision | label-parse, damage-assessment, hazmat-detection, seal-tamper, dim-weight, pallet-check, handwritten-label |
| Document | customs-invoice, manifest-recon, docs-completeness |
| Text | tariff-classification, exception-routing, address-validation, carrier-select, sla-risk, rma-disposition, restricted-screening |
Each is a config object in src/tasks/types/index.ts, ~15 lines, wiring a
schema (src/tasks/schemas.ts) to a grader (src/tasks/graders.ts).
How grading works
Each grader is a pure function (output, truth, difficulty) => GradeResult that
compares the agent's parsed output to ground truth field-by-field:
interface GradeResult {
correct: boolean; // did it match closely enough to count?
partial: number; // 0..1 — fraction of fields correct
scoreDelta: number; // partial × difficulty × BASE_POINTS
detail: string; // readout for the focus card
fields?: { key, label, ok, expected?, got? }[];
correctVerdict?: Verdict;
correctSummary?: string;
}
Graders are tolerant by design — case/whitespace-insensitive. With the same model on both lanes, any format quirk hits both equally, so tolerance can't bias the race (a bonus of the same-model default).
A subtle correctness rule: boolEq
Boolean fields use a dedicated boolEq() comparator, not !!a === !!b. Why? A
missing boolean field must NOT count as "correct false" — otherwise an empty
or garbage output would score free points on every false-valued truth field. The
test suite caught and fixed this:
function boolEq(a: unknown, b: unknown): boolean {
if (typeof a !== 'boolean') return false; // missing ≠ correct
return a === !!b;
}
Some fields get tolerance bands instead of exact match: numbers (e.g. dims ±1cm,
weight ±0.5kg), ETA (±2h). See src/tasks/graders.ts for each task's rules.
The scenario pool
Scenarios are data, separate from code — one JSON file per task in
data/scenarios/<task-id>.json, 111 total. Each scenario is truth-first: the
ground-truth object was decided first, then the input written to match.
{
"id": "dm-2",
"taskTypeId": "damage-assessment",
"difficulty": 2,
"adversarial": true,
"input": { "imageUrl": "/data/assets/damage/dm-2.png", "text": "..." },
"groundTruth": { "damaged": true, "damageType": "crushed", "severity": 4, "action": "refuse" },
"correctOutcome": { "verdict": "refuse", "pass": false, "summary": "Badly crushed corner — refuse." }
}
The loader (src/data/scenarios.ts) imports every *.json via
import.meta.glob and Zod-validates each scenario against its task schema at
load time — a malformed scenario fails the build, never reaching the demo.
Adversarial cases (tamper, hazmat mislabel, customs undervaluation, counterfeit goods) are salted in at ~22% draw rate and double as the natural triggers for the Exceptions Specialist in the multi-agent pipeline.
See Authoring Scenario Data for how to add more.