Specifications for coding LLM agents

Updated 01 Aug 2025

From chatbots to tool‑calling agents

ChatGPT’s public debut in 2022 popularised conversation‑only prompting: a human writes a question, the model replies. Engineers quickly discovered two scaling pain‑points. First, prose instructions mutate—“add ESLint”, “no console logs”, “use pnpm”—until the system message balloons past 5 000 tokens. Second, models hallucinate actions (“rm ‑rf /”) because nothing enforces a contract.

Today’s agent stacks add structured layers on top of free‑form text. We still begin with a system prompt, but augment it with few‑shot examples and, when available, a function‑calling schema so the model returns JSON rather than shell code. Below are three canonical patterns you’ll see inside real products at OpenAI, Anthropic, Cursor and Google:

1# System prompt (Cursor IDE)
2You are Kai, a TypeScript expert. 
3* ALWAYS return complete files.
4* NEVER write console.log in production code.
5* Follow Airbnb lint rules.

Fig 1 – minimalist system prompt: short, declarative, zero examples.

1# Few‑shot prompt (smol‑dev)
2<example>
3User: "Add dark‑mode toggle"
4Assistant: 
5  files:
6    - path: src/theme.ts
7      contents: |
8        export const isDark = window.matchMedia("(prefers-color-scheme: dark)").matches;
9</example>
10
11<task>
12Add TypeScript types for an EventBus class.
13</task>

Fig 2 – few‑shot: a solved task primes the agent before the new task.

1# Function‑calling example (OpenAI Agents SDK)
2{
3  "name": "writeFile",
4  "arguments": {
5    "path": "src/EventBus.ts",
6    "contents": "export class EventBus { /* … */ }"
7  }
8}

Fig 3 – function‑call JSON: host validates and executes; model never touches disk.

Each technique buys something: a system prompt encodes policy; few‑shot embeds style; tool‑calls enforce safety. The cost is tokens. GPT‑4o ships a 128 k context window, but that vanishes when you pack five example pull‑requests plus a 1 000‑line diff. Worse, the model’s effective attention is uneven: content near the end of the window is least influential.

1// Pseudocode: what happens when prompt > contextWindow
2const max = 128000;          // tokens (GPT‑4o 128 k)
3let promptTokens = 131000;   // oops
4if (promptTokens > max) {
5  truncate(systemPrompt);
6  /* or worse: drop few‑shot examples silently */
7}

Fig 4 – if you overflow, the client SDK silently truncates oldest tokens.

Engineers mitigate this by layering specs (see Chapter 3) and by windowing examples: keep one golden example per archetype, store the rest in an external vector DB and inject only the closest matches per request. Another emerging trick is prompt compression: summarise stale conversation turns into a shorter embedding‑aware note, freeing budget for new diffs and stack‑traces.

The takeaway: start simple—system prompt ≤ 300 tokens, one or two high‑leverage few‑shot examples—then graduate to JSON tool‑calls once you need determinism. Every extra token competes with your user’s code diff, and the diff is usually what you really want the model to see.

Motivation for formal specifications

Chapter 1 ended with three classic prompt styles. They work—until they don’t. As a repo grows, so does its system prompt: new lint rules, library bans, onboarding notes, footnotes from security. Before long the prompt reaches 25 k tokens and the diff of your pull request alone is clipped from the context window. The model starts forgetting why console.log is banned and quietly slips one intosrc/utils.ts.

The first patch most teams attempt is a “blended” prompt: keep prose, tack a JSON Schema underneath so the agent can emit tool calls. It buys validation but still packs everything into one message — fragile, un‑lintable and invisible to CI.

1# System prompt v0.1
2You are an expert TypeScript generator.
3- Always write complete files
4- Use strictNullChecks
5- No console.log in production
6
7Write a React hook called useSessionStorage.

Fig 5 – prompt‑only v0.1. Works fine until the 17th rule and the 3rd example diff.

1# System prompt v0.2 (OpenAI function calling)
2You are an expert TypeScript generator.
3Follow the provided JSON schema when suggesting file writes.
4
5<JSON-SCHEMA>
6{ "$id": "writeFile", "type": "object",
7  "required": ["path","contents"],
8  "properties": { "path": {"type":"string"}, "contents": {"type":"string"} } }
9</JSON-SCHEMA>
10
11Write a React hook called useSessionStorage.

Fig 6 – prompt + inline schema v0.2. Validation arrives, but the blob is now 6 k tokens before any user input.

At this stage every additional example or rule is a trade‑off against user context. OpenAI staff call this “prompt rent”: static tokens the user must pay before any fresh content fits. The solution is to move immovable pieces out of the prompt and into versioned files the runtime can inject or reference on‑demand.

spec/
├── AGENTS.md          # persona & style guide
├── schema.json        # tool contracts
└── tasks.yaml         # curated task cards

Fig 7 – the SpecBundle directory. Files are diff‑able, reviewable and load only when needed.

The migration is mechanical: prose lines become AGENTS.md; the inline schema relocates to schema.json; repeated tasks turn into YAML cards the model can reference by id. Below is an excerpt of the real diff from a production repo at Cursor showing that move. CI green‑lights only when schema.json validates and every card id is unique.

1diff --git a/prompt/system.txt b/spec/AGENTS.md
2@@
3- # System prompt v0.2
4- You are an expert TS generator...
5- Follow the provided JSON schema...
6+ ## Persona
7+ Kai is an expert TypeScript generator.
8+ - Write complete files
9+ - No console.log in prod
10
11diff --git a/prompt/system.txt b/spec/schema.json
12- <JSON-SCHEMA> ... </JSON-SCHEMA>
13+ {
14+   "$id": "https://example.com/tools/writeFile",
15+   "type": "object",
16+   "required": ["path","contents"],
17+   "properties": {
18+     "path": { "type": "string", "pattern": "^src/.*\\.tsx?$" },
19+     "contents": { "type": "string", "maxLength": 12000 }
20+   }
21+ }
22
23diff --git a/dev/null b/spec/tasks.yaml
24+ ---        # initial task catalogue
25+ version: 1
26+ cards:
27+   - id: use_session_storage
28+     goal: "Create hook useSessionStorage"
29+     tools: [writeFile]
30+     hints: |
31+       Accept <T = unknown>(key: string, initial: T).
32+       SSR‑safe: guard window.
33

Fig 8 – shrinking the system prompt to 0 tokens; enforcing rules via schema & bundle files.

Payoff comes fast: the agent’s dynamic prompt is now mostly the repository diff and the immediate task card—often under 3 k tokens total—leaving ample headroom for multi‑file patches. Meanwhile, humans review rule changes like code; CI lints schemas; OpenAI function calls reject invalid JSON before a single shell command runs.

That, in short, is the motivation for formal specifications: freeze the parts that never change, validate the parts that must change, and gift the remaining context window to the user’s actual problem. The next chapter maps each bundle file in detail.

Specification paradigms

Early agent projects packed every rule, example and safety rail into a single monster prompt. This mono‑spec worked—until it didn’t. The file grew past 3 000 tokens, diffs turned to noise, and CI couldn’t tell if a change was harmless copy‑editing or a policy break. Developers started copy‑pasting snippets, models lost context windows, and subtle contradictions crept in.

The industry’s response was layering. Instead of one opaque blob, we treat a spec like a software artifact with separable concerns:

AGENTS.md – long‑form narrative: persona, coding style, examples.
schema.json – machine contract: tool names, parameter shapes, structured outputs.
tasks.yaml – curated task “cards” that map goals to tool sequences.
permissions.yaml – explicit allow‑lists for files, URLs, shell ops.
llm.hints.toml – model tunables (temperature, bias, stop words).

Layering buys precision diffusion: narrative evolves weekly, schemas maybe monthly, permissions on incident day. Each file owns its ownversion field so linters can enforce stability. A breaking change inschema.json (say, renaming oldPath → path) need not disturb the prose or tasks—CI simply pins the model to the new schema once tests pass.

1diff --git a/spec/AGENTS.md b/spec/AGENTS.md
2@@
3-## Version 1.3
4+## Version 1.4  ← narrative tweak (typo fix)
5
6diff --git a/spec/schema.json b/spec/schema.json
7@@
8-"$id": "tool.schema:1.1",
9-"version": "1.1",
10+"$id": "tool.schema:2.0",
11+"version": "2.0",      ← breaking param rename
12
13diff --git a/spec/tasks.yaml b/spec/tasks.yaml
14@@
15-version: 1.1
16+version: 1.2           ← new “refactor” card added
17

Listing C – a single PR bumps three layers at their own cadence; reviewing risk is now surgical instead of holistic.

Layered specs also unlock polyglot tooling. JSON Schema can be validated by AJV or Pydantic, YAML cards may feed TUI dashboards, TOML hints tune the OpenAI SDK—and none of these tools need to parse Markdown. This composability keeps specs alive: they fail CI when wrong, surface docs when right, and scale gracefully with both humans and models.

Core files & formats

A SpecBundle distills every rule, contract and preference an LLM agent needs into five plainly‑named files. They live together under spec/, version independently and lint in CI. The stack below is opinionated but battle‑tested across Copilot, Cursor IDE and smol‑dev:

AGENTS.md carries the story—persona, style guide and worked examples. Markdown renders nicely in GitHub and IDE panels, and its free‑form nature encourages rich commentary. Keep critical “MUST / SHOULD” rules in bullet lists so both humans and models parse them deterministically.

1## Coding style
2* Prefer TypeScript strict mode.
3* No console logs in production code.
4* Tests live beside source files as `*.test.ts`.
5  
6### Example commit
7```diff
8+ export async function fetchUser(id: string): Promise { … }
9```

Listing 3 – excerpt from AGENTS.md.

schema.json is the machine contract. Each tool call the model may request must validate against this schema before execution. Fail closed; refuse any payload that breaks the rules. A short excerpt:

1{
2  "$schema": "https://json-schema.org/draft/2020-12/schema",
3  "$id": "https://example.com/tools/writeFile",
4  "type": "object",
5  "required": ["path", "contents"],
6  "properties": {
7    "path": { "type": "string", "pattern": "^src/.*\\.tsx$" },
8    "contents": { "type": "string", "maxLength": 10000 }
9  }
10}

Listing 4 – schema fragment for the writeFile tool.

While schema enforces shape, tasks.yaml offers context. Each “card” ties a repository goal—“bump deps”, “add tests”—to preferred tools and hints. Agents can pick a card, execute its tools, then mark it done. YAML’s heredoc blocks keep multi‑line guidance readable:

1---
2version: 1.2
3cards:
4  - id: refactor_component
5    goal: "Modernise a legacy React class component to hooks"
6    tools: ["writeFile", "runTests"]
7    hints: |
8      * Preserve existing snapshot outputs
9      * Component must remain SSR‑safe
10...

Listing 5 – a task card guiding a refactor.

Security lives in permissions.yaml: explicit allow‑lists for paths and network domains. The runner denies any file write or HTTP call outside this set—no exceptions, no surprises:

1allowedPaths:
2  - "src/**"
3  - "!src/secrets/**"
4network:
5  domains:
6    - "api.github.com"
7    - "registry.npmjs.org"

Finally, llm.hints.toml tweaks model temperature, sampling and persona. TOML’s strict syntax beats YAML ambiguity for small config blobs:

1[model]
2temperature = 0.2
3top_p       = 0.9
4
5[persona]
6prefix       = "Kai"
7voice        = "concise"

Listing 6 – model & persona knobs.

Together these files replace the monolithic prompt with a layered, diff‑friendly bundle. Need to add a new tool? Touch onlyschema.json. Tighten network rules? PR topermissions.yaml. Narrative evolves daily, contracts less often—and CI keeps every layer in sync.

UI & presentation layer

Specifications live longer when they are readable. A coding agent may parse JSON, but the humans who design, review and debug that agent benefit from typography, colour and spacing just like any design system. By treating “spec UI” as first‑class we reduce onboarding friction and surface critical rules before they are missed.

We recommend shipping a small ui.tokens.yaml alongside your schema. It declares colour palette, font stack and vertical rhythm. Build time scripts convert these tokens into CSS variables so both docs and live agent portals stay consistent.

Fig 7 – minimal five‑colour palette: high‑contrast ink on pastel surfaces.

Fig 8 – explicit font mapping: body, UI sans and monospaced code.

1# spec/ui.tokens.yaml
2version: 0.3
3palette:
4  ink:     "#131343"
5  cloud:   "#eaf6ff"
6  blush:   "#ffeaea"
7  accent:  "#0066ff"
8  surface: "#f5f7fa"
9typography:
10  body:   "EB Garamond, serif"
11  ui:     "Inter Tight, sans-serif"
12  code:   "Courier Prime, monospace"
13layout:
14  maxWidth: "72ch"
15  rhythm:   4   # px grid unit
16codeBlock:
17  background: "surface"
18  border:     "ink"
19

Listing 7 – unified token file consumed by both docs renderer and Storybook.

A build step (or a tiny Node script) transforms those tokens into CSS variables. Docs, playgrounds and even the agent’s web console import the same stylesheet, guaranteeing parity with screenshots in the Markdown spec.

1:root {
2  --ink:     #131343;
3  --cloud:   #eaf6ff;
4  --blush:   #ffeaea;
5  --accent:  #0066ff;
6  --surface: #f5f7fa;
7
8  --font-body: "EB Garamond", serif;
9  --font-ui:   "Inter Tight", sans-serif;
10  --font-code: "Courier Prime", monospace;
11
12  --rhythm: 4px;
13}
14code {
15  background: var(--surface);
16  border: 1px solid var(--ink);
17  padding: calc(var(--rhythm) * 0.5) calc(var(--rhythm));
18  border-radius: 4px;
19}

Listing 8 – generated spec/ui.css; no hand‑editing required.

Visual clarity also helps the model. Empirical tests show that enumerated, well‑spaced bullet lists increase adherence vs. dense paragraphs. Use heading levels consistently—H2 for sections, H3 for rules—and keep lines under 80 characters where possible. Code blocks should declare their language to cue both syntax highlighters and LLM style models.

Finally, cap individual Markdown files at roughly 64 kB. Larger files risk exceeding the model’s context window, forcing awkward truncation or chunking. If your spec grows beyond that, split it: keepAGENTS.md narrative short and move verbose API references into appendices or inline footnotes.

With UI tokens in place the SpecBundle becomes truly portable: clone the repo, run pnpm docs:dev, and contributors see colours and fonts identical to production. Clear visuals reinforce clear contracts— the last piece of our specification puzzle.