Schemas & formalizations

Updated 16 Aug 2025

From prompts to contracts

Early assistants answered free‑form questions and improvised actions. That style delighted people until tasks touched money, records, or safety. The remedy is to turn prose into contracts: intents mapped to schemas, confirmations that restate actions clearly, and tools that receive typed payloads. Contracts make automation legible to humans and predictable to software.

A contract begins with vocabulary. Teams must name the things the assistant can change and agree on how those names appear in code, logs, and receipts. The words in the interface should match fields in the schema, arguments in the tool, and labels in the audit trail. When naming drifts, trust erodes and debugging becomes folklore.

Contracts also include fallbacks. When a plan cannot execute because data or permission is missing, the assistant should deliver an intelligible answer rather than silence. Citations, comparisons, or next‑best actions keep users moving. By designing graceful exits up front, we reduce escalations and teach people what information unlocks the path ahead.

Finally, contracts need stewardship. Someone owns each intent, its schema, and its review process for risky scopes. That ownership is visible in documentation and in the code that binds UI to tools. Ownership avoids “orphaned surfaces” that accumulate bugs and erode confidence precisely where automation could add the most value.

With contracts in place, prompts become orchestration rather than magic spells. The model still predicts tokens, but it does so inside a scaffold that clarifies inputs, desired outcomes, and success criteria. That scaffold turns unpredictability into bounded variation, which is the difference between a charming demo and a reliable product.

Versioning strategy

Every artifact that shapes assistant behavior must be versioned: schemas, manifests, prompts, and tool definitions. Tying versions to deployments lets teams reproduce historical behavior, compare cohorts, and roll back safely. Version numbers belong in receipts so support can see which contract governed a decision without spelunking build logs.

Semantic versioning works when changes are clear: additive fields are minors; removals and incompatible meaning shifts are majors. When semantics blur, introduce migration guides and codemods that rewrite old payloads into new shapes. By treating breaking changes as rare and deliberate, we keep plan archives replayable and make upgrades predictable.

Branch policies matter. Draft intents live behind flags on per‑environment branches; stable ones merge to main only after validation. The CI pipeline enforces invariants: schemas must compile, examples must validate, and tool signatures must match. Releases bundle spec snapshots with binaries so incidents can rehydrate the exact behavior that ran.

Version negotiation is part of resilience. When a client submits an older plan, services should translate or reject with actionable errors rather than failing ambiguously. Clear negotiation keeps distributed systems aligned even when parts roll forward at different speeds, a reality for any organization with multiple teams and environments.

Finally, versions should be visible to humans. Dashboards and docs include sidebars showing the active contract for each surface. Engineers, support, and legal read the same page, reducing miscommunication and shortening investigations when outcomes surprise someone who expected an earlier behavior.

JSON Schema design

JSON Schema is the workhorse for assistant contracts because it is portable, expressive, and friendly to tooling. Good schemas prefer explicit enums to strings that drift, minimums and maximums to prose constraints, and formats for dates, emails, and currency. The schema should describe the world precisely enough that invalid plans cannot compile.

Defaults deserve care. They reduce friction but must never smuggle irreversible behavior. A default for pagination is harmless; a default for deleting related records is not. When defaults change meaning over time, version them. Treat defaults as public API, not as helpers tucked into code paths no one remembers reviewing.

Validation belongs in two places: at generation time and at execution time. The UI validates as the plan forms so people see errors early; the tool validates again to defend the boundary. Duplicate checks may feel redundant until a network partition or stale client shows why defense in depth keeps systems predictable under stress.

Examples and property descriptions are part of the spec. They teach humans how to populate structures and give assistants richer few‑shot material for synthesis. Keep examples short, realistic, and comprehensive across edge cases. Docs engines should compile directly from schemas so stale prose cannot drift out of sync with the contract.

Finally, schemas should mirror business rules. If a field requires manager approval above a threshold, encode that rule or expose an explicit state that tools can check. A schema that pretends policy does not exist invites silent failures and ad‑hoc exceptions that erode both compliance and user confidence.

Tool definitions and MCP

Model Context Protocol (MCP) turns plans into action by exposing tools with typed inputs, explicit permissions, and observable results. Tools are small programs with narrow responsibilities: fetch a page, post a message, write a file, submit a form. Narrowness increases safety, debuggability, and the chance that partial failures degrade gracefully.

Each tool publishes a JSON Schema for its arguments and an output shape for telemetry. The assistant receives both. When a plan reaches the tool boundary, the server validates the payload, executes, and emits events with timing and status. Logs become structured stories: intent, plan, calls, results. That structure is the difference between sticky notes and an operations manual.

Be explicit about side‑effects. Tools are classified as read or write. Write tools require explicit scopes that humans can grant, revoke, and audit. Plans list the scopes they need, and the UI reflects that need in the confirmation step. The system refuses to escalate silently; users are never surprised by irreversible work.

MCP also standardizes resources like file stores or caches so assistants can reference artifacts by URI rather than inventing ad‑hoc channels. When tools, schemas, and resources share conventions, ecosystems grow around them: debuggers, recorders, and replay tools that make assistant work reproducible by anyone in the organization.

Finally, treat tool servers as products. Version them, test them, and publish changelogs. A tool that behaves predictably across releases enables teams to compose plans confidently, because the agent’s capabilities act like a stable library rather than a shifting black box.

Permissions and boundaries

Boundaries make automation humane. Read and write scopes are separate; irreversible actions require explicit confirmation with a clear summary of effects. Plans declare which scopes they need before execution. Users can grant once, grant for a session, or refuse. Every decision is visible in receipts so trust does not depend on memory.

Least privilege applies to assistants as it does to services. Tools request the minimal scope that satisfies the plan; the platform enforces quotas and limits. When something exceeds its budget, the assistant pauses with an explanation and a path to escalate. Surprises sting; clarity slows people down less than invisible power ever will.

Organizational boundaries matter too. Teams own their data and decide which tools can touch it. Cross‑team plans require approvals that travel with the payload, not side channels that vanish during incidents. By making ownership part of the contract, we convert tribal knowledge into explicit, testable policy.

Boundaries also help with debugging. When a plan cannot cross a line, the system says why and who can move it. That message shortens the distance between a frustrated user and the person who can help. Boundaries are not walls; they are guides that keep the system legible at scale.

Finally, dignity is a requirement. The interface treats consent, undo, and attribution as first‑class features. People should feel that the assistant respects their autonomy and effort. Systems that demonstrate respect earn patience when mistakes happen, and patience is the oxygen that improvements need to reach production.

Prompt structure and canonicalization

Prompts age better when they follow a template that reads like an engineering brief: task, constraints, resources, tests. That structure prevents drift into wishful prose. When the assistant fills a schema from text, the system canonicalizes: it normalizes dates, currencies, and entity names so the plan becomes unambiguous before tools execute.

Few‑shot examples remain useful, but they belong in files under version control rather than inlined inside code. Treat them as fixtures. Rotate them periodically to avoid overfitting to accidental patterns. Keep them short, realistic, and diverse across edge cases so the model learns the shape of good answers rather than anchors to a single template.

System prompts carry tone and policy, not every rule. Stable rules live in the spec and permission lists the runtime loads on demand. When a rule changes, the spec changes, not the prose. That separation prevents context windows from bloating and keeps enforcement grounded in code, not memory.

Canonicalization extends to UI copy. Labels, hints, and confirmation text draw from the same dictionary that schemas use. When wording changes, both layers update together. This discipline removes a whole class of mismatches where UI promises one thing while tools expect another.

Cross‑links help non‑specialists. When concepts feel abstract, we reference explanatory essays that build intuition, such as accidental Turing completeness. People remember stories. Pairing stories with contracts keeps teams aligned even when roles, vendors, or models evolve.

Testing and replay

Plan generation and tool execution deserve separate test suites. Generation tests feed inputs and compare the emitted plan to a fixture. Execution tests run tools against sandboxes and verify side‑effects. By testing both halves independently, we isolate failures, shorten feedback cycles, and reduce the blast radius when something regresses.

Replay turns incidents into learning. Receipts contain everything needed to re‑run a plan offline: version, inputs, scopes, and tool outcomes. Engineers reproduce the issue, propose a fix, and attach the receipt to the pull request. Reviewers see the same evidence, the same plan, and the improved result, which accelerates approval without hand‑waving.

Mocks and stubs belong at the boundary. We avoid mocking the model because it destroys the signal we are trying to test. Instead we fix seeds and constrain prompts so outputs remain stable while still exercising the system. Determinism matters at the confirmation boundary; beyond it, controlled variation is acceptable if the contract holds.

Load tests represent real behavior: bursts of plan creation, retries after transient failures, and long‑running tool calls. Observability couples with testing here; synthetic traffic feeds dashboards so the team sees how backpressure and timeouts behave before real users feel them. Sharp edges found in rehearsal rarely escalate into production incidents.

Finally, test scaffolds should be boring. A one‑line helper that loads a spec and generates a plan is better than a labyrinth of factories. Boring scaffolds invite contributions from every engineer, not only the ones who wrote the framework. Inclusivity in tooling is a strategic advantage.

Telemetry and receipts

Receipts are the connective tissue of assistant features. Each combines the human intent, the plan version, the tool calls, and the outcomes with timestamps. With that packet, support can help, finance can reconcile, and legal can audit. Without it, the system devolves into folklore and screenshots.

Telemetry completes the picture. We record latency for generation, confirmation, and each tool call, plus error classes and retry counts. Aggregations roll up by intent and cohort so product decisions rely on evidence rather than anecdotes. Teams spot weak validators, confusing copy, or brittle tools and choose fixes with confidence.

Privacy is non‑negotiable. Sensitive fields are redacted or tokenized; access to receipts is gated by roles; retention follows policy. Telemetry exists to improve the system, not to amass data for its own sake. When privacy is visible, organizations feel comfortable letting assistants touch more valuable workflows.

Visualization matters as much as collection. Dashboards tie receipts to UI screenshots and error traces so issues tell coherent stories. Engineers move from incident to fix with fewer context switches, and managers see progress clearly. Well‑told stories build trust faster than raw metrics ever could.

Receipts also power education. Product teams publish sanitized examples in documentation so customers learn what a good plan looks like. New hires study them like case law. The practice creates a culture where clarity is valued and success is defined in terms users recognize.

Operations and governance

Governance sounds heavy until a release goes wrong. Clear owners, change logs, and backouts prevent minor mistakes from becoming crises. When automation touches external systems, incident response must be rehearsed: who pauses tools, who communicates with customers, and which receipts anchor the public narrative.

Review boards exist to weigh risk, not to slow progress. They approve new write scopes, verify migrations, and ensure that intents align with policy. The board’s calendar is published; decisions and rationales become part of the spec. Transparency converts “security says no” into “the organization understands why and when it will say yes.”

Runbooks live next to code. They include commands for rolling forward or back, dashboards to check first, and contacts for dependent systems. During an incident, people do not need prose; they need a recipe. By writing that recipe while calm, we buy speed when adrenaline clouds judgment.

Governance also covers vendors and models. When a provider changes behavior, we capture it in the spec with a version bump and update fixtures. Contracts with vendors reference our requirements for observability and redress, so escalation paths are contractual rather than improvised emails.

Above all, governance should scale with maturity. Early teams ship behind flags and accept more manual review. Mature teams automate checks and grant broader autonomy. The spec evolves from rulebook to nervous system that keeps the organization aligned as people and priorities change.

Roadmap and pitfalls

Start narrow. Choose a handful of intents tied to measurable outcomes and build contracts around them. Instrument from day one so you learn where edits cluster, which validators frustrate, and which tools dominate latency. Evidence will tell you which surfaces to expand next and which to retire with dignity.

Expect model churn. New capabilities arrive; old behaviors shift. Maintain adapters that shield contracts from provider changes. Keep few‑shot examples crisp and rotate them regularly. Treat models as replaceable engines that obey your spec rather than as oracles whose quirks dictate product shape.

Beware silent drift. UI copy, schema fields, and tool arguments can diverge quickly across teams. Automated checks compare them in CI. Docs compile from the same sources the runtime reads. If something cannot be verified automatically, consider whether it belongs in the contract at all.

Avoid one‑way doors. Plans that cannot be canceled, scopes that cannot be revoked, and migrations without rollback seed headlines. Design the exit before the entrance: undo paths, safe defaults, and circuit breakers. Small acts of humility save weeks of damage control later.

Finally, keep your eye on dignity. Assistants exist to remove drudgery, not to replace judgment. When systems are clear about what they can and cannot do, people relax and collaborate. That is the real promise of schema‑driven agents: not just correctness, but a working relationship between humans and software.