---
title: "What Is a Harness?"
subtitle: "A visual guide to the infrastructure that turns language models into agents."
date: "2026-04-22"
tags: ["blog", "harness-design", "fundamentals"]
---

An LLM, on its own, is a function. You hand it text. It hands you back text.

That's not what people mean when they say "AI agent." An agent books flights. It reads files, runs tests, opens pull requests, drives a browser. It does things.

So how do you get from the first to the second? That's what this post is about.

---

## Part 1: Building the harness

### The model, alone

The obvious place to start is to hand the task straight to the model and see how far that gets.

<!-- component:StepModelOnly -->
**[StepModelOnly component]**

Demo panel labelled **"model only"**. Shows a single inference step with no tools and no loop. The model receives a prompt and replies with text only.

Internal trace shown to the reader:

- *Thinking*: "The user wants me to book flights. I don't have any tools for that — I can only reply with text."
- *Final reply*: "I can help you think through it, but I can't actually book flights myself. From Copenhagen (CPH) to Amsterdam (AMS) you'll typically find KLM, SAS, and easyJet operating this route…"

Code shown alongside the trace:

```ts
import { anthropic } from 'ai-sdk';

const messages = [
  { role: 'user', content: userInput },
];

const reply = await anthropic.messages.create({
  model: 'claude-opus-4-7',
  messages,
});

console.log(reply.content);
```
<!-- /component:StepModelOnly -->

The model knows which airlines fly the route and can ballpark prices. What it can't do is find specific flights or book anything. There's no airline API on the other side of the text box. There's just more text coming back. The request arrives, a reply goes out, and nothing in the real world moves.

For that to change, the model needs a way to trigger code outside the text box.

### Giving it tools

The fix is to let the model ask for help. We describe a few things it's allowed to do (searching flights, booking them) and pass those descriptions in with the user's message. Each description has a name, a short explanation, and the shape of the arguments it takes. From the model's point of view, they're just options it can pick.

There's no function-calling magic here. The model still only does one thing: generate text. What changes is what we do with that text. We tell the model: if you want one of these tools, write it out in this exact format.

```
<search_flights>
{
  from: "CPH",
  to: "AMS",
  date: "2026-05-24",
}
</search_flights>
```

As the reply streams back, we watch for those tags. The moment we see a `<search_flights>` opening, we stop treating the output as conversation and start treating it as a request. We parse the JSON, call the real `search_flights` function, and carry on. The model wrote text in a particular shape. Our code noticed the shape and did something with it. That's all tool use really is.<!-- component:Footnote -->
**[Footnote component]**

Inline footnote marker.
<!-- /component:Footnote -->

<!-- component:StepWithTools -->
**[StepWithTools component]**

Demo panel labelled **"model + tools"**. Same task as the previous step but the model now has tools (`search_flights`, `book_flight`).

Internal trace shown to the reader:

- *Thinking*: "I should use my search_flights tool to find flights from CPH to AMS on 2026-05-24."
- *Tool call*: `search_flights({ from: "CPH", to: "AMS", date: "2026-05-24" })`
- *Tool result*:
  ```
  KL1124  08:05 → 09:30  €132
  SK1551  11:20 → 12:45  €148
  U21802  14:40 → 16:05  €119
  ```

Code shown as a diff from the previous step:

```ts
import { anthropic } from 'ai-sdk';

const tools = [
  {
    name: 'search_flights',
    description: 'Find flights between two airports',
    input_schema: { /* from, to, date */ },
  },
  { name: 'book_flight', /* ... */ },
];

const messages = [
  { role: 'user', content: userInput },
];

const reply = await anthropic.messages.create({
  model: 'claude-opus-4-7',
  tools,
  messages,
});

if (reply.stop_reason === 'tool_use') {
  const call = reply.content.find(c => c.type === 'tool_use');
  const result = await runTool(call.name, call.input);
  console.log(result);
}
```
<!-- /component:StepWithTools -->

The model picked the right tool, filled in the arguments, and we got flight data back. Open the **Code** tab to see how little code this takes.

Then it stops. The search ran, but nobody passed the result back to the model, so it never got to react to what it found. The return flight never got looked up. Nothing got booked.

### The agent loop

What's missing is a way to keep going. Once a tool runs, the model has to see the result and decide what to do next. And it has to be able to do that as many times as the task takes.<!-- component:Footnote -->
**[Footnote component]**

Inline footnote marker.
<!-- /component:Footnote -->

The fix is small. Wrap the whole thing in a loop. Call the model. If it asked for a tool, run the tool, append the result to the conversation, and call the model again. Exit when it stops asking for tools and gives a normal reply.

<!-- component:StepAgentLoop -->
**[StepAgentLoop component]**

Demo panel labelled **"model + tools + loop"**. Same tools as the previous step, but the harness now loops: tool result → model → tool call → tool result, until the model emits a final reply.

Internal trace shown to the reader:

- *Thinking*: "First let me search for the outbound flight from Copenhagen to Amsterdam."
- *Tool call*: `search_flights({ from: "CPH", to: "AMS", date: "2026-05-24" })`
- *Tool result*:
  ```
  KL1124  08:05 → 09:30  €132
  SK1551  11:20 → 12:45  €148
  U21802  14:40 → 16:05  €119
  ```
- *Thinking*: "Good, I have the outbound options. Now I need the return on 2026-05-30."
- *Tool call*: `search_flights({ from: "AMS", to: "CPH", date: "2026-05-30" })`
- *Tool result*:
  ```
  KL1133  17:15 → 18:40  €141
  SK1558  19:55 → 21:20  €156
  ```
- *Thinking*: "I have both legs. I'll suggest the cheapest KLM pairing and ask before booking."
- *Final reply*: "Found two solid options:

  - Outbound — KL1124, 24/05 at 08:05 (€132)
  - Return — KL1133, 30/05 at 17:15 (€141)

  Total €273. Want me to book them?"

Code shown as a diff from the previous step:

```ts
import { anthropic } from 'ai-sdk';

const tools = [
  {
    name: 'search_flights',
    description: 'Find flights between two airports',
    input_schema: { /* from, to, date */ },
  },
  { name: 'book_flight', /* ... */ },
];

const messages = [
  { role: 'user', content: userInput },
];

while (true) {
  const reply = await anthropic.messages.create({
    model: 'claude-opus-4-7',
    tools,
    messages,
  });

  messages.push({ role: 'assistant', content: reply.content });

  if (reply.stop_reason === 'end_turn') break;

  for (const call of reply.content.filter(c => c.type === 'tool_use')) {
    const result = await runTool(call.name, call.input);
    messages.push({
      role: 'user',
      content: [{ type: 'tool_result', tool_use_id: call.id, content: result }],
    });
  }
}
```
<!-- /component:StepAgentLoop -->

Same model, same tools. The only thing we added was a `while` and an append. That loop is the agent.

You could stop here and ship something useful. The rest of this post is about what goes on inside that growing conversation. Once the loop works, most of the hard problems in agent design are about the list of messages you keep handing back to the model.

---

## Part 2: Looking inside the harness

### What the model sees on every call

The model is stateless. Each call starts from nothing. If you want it to know what happened on the previous turn, you include that turn in the next call. Same for the turn before, and the one before that. What looks like memory is just us re-sending the whole history every time.

What we've been loosely calling "the conversation" is just a `messages` array we keep in a variable and replay. Below is the full transcript of the flight-booking task. Send a follow-up to see how it grows when the loop keeps going.

<!-- component:ContextStack -->
**[ContextStack component]**

Animated visualization of the model's context window during a flight-booking agent run. Shows the messages stacked vertically (system / user / assistant / tool-result), with a chat input at the bottom that lets the reader send a follow-up.

Initial messages already in context:

1. **System prompt**.
2. **User**: the original flight-booking request.
3. **Assistant**: thinking — "The user wants flights in both directions. Let me start with the outbound from Copenhagen to Amsterdam on May 24th." Followed by tool call `search_flights({ from: "CPH", to: "AMS", date: "2026-05-24" })`.
4. **Tool result**:
   ```
   3 flights found
   KL1124   KLM       08:05 → 09:30   €132   direct
   SK1551   SAS       11:20 → 12:45   €148   direct
   U21802   easyJet   14:40 → 16:05   €119   direct
   ```
5. **Assistant**: thinking — "Got the outbound options. Now the return from Amsterdam on May 30th." Followed by tool call `search_flights({ from: "AMS", to: "CPH", date: "2026-05-30" })`.
6. **Tool result**:
   ```
   2 flights found
   KL1133   KLM   17:15 → 18:40   €141   direct
   SK1558   SAS   19:55 → 21:20   €156   direct
   ```
7. **Assistant**: "Found two solid options: Outbound — KL1124 / Return — KL1133. Total €273. Want me to book them?"

When the reader clicks send, follow-up messages stream in:

- **User**: "Yes, book them both"
- **Assistant**: thinking — "Confirmed. Booking the outbound leg first." Followed by tool call `book_flight({ flight_id: "KL1124", date: "2026-05-24" })`, etc.

The whole point of the visualization: every previous message stays in context. Nothing is dropped.
<!-- /component:ContextStack -->

The system prompt is the first message. It's where you tell the model what it's for, what tools it has, and how to behave. Models are trained to weight what's in the system prompt above what users say, so this is also your main lever for steering behavior. When a user tries to jailbreak their way past the rules, the system prompt usually wins. Not always, but usually. It stays at the top of every call, unchanged. Everything below it (user messages, assistant replies, tool calls, tool results) just piles up. Nothing gets removed automatically.

That pile is the `messages` array from the loop we wrote in Part 1. It's the only memory the agent has. Whatever is in there is what the model can reason over on the next call.

### The cost of remembering

Because the full transcript goes out on every call, the input the model has to read grows with every turn.

<!-- component:ContextGrowthChart -->
**[ContextGrowthChart component]**

Bar chart titled **"context size per LLM call (tokens)"** showing how context grows across a 6-call agent run:

| Call    | Tokens |
| ------- | ------ |
| Call 1  | 180    |
| Call 2  | 520    |
| Call 3  | 880    |
| Call 4  | 1,640  |
| Call 5  | 2,900  |
| Call 6  | 5,100  |

Footnote shown below the chart: "Every tool result sticks around. Every assistant reply sticks around. By the sixth call the model is re-reading everything it has already done — which is why managing this context is the real engineering problem."
<!-- /component:ContextGrowthChart -->

Every noisy tool call keeps costing you. A test runner that dumps 200 lines of output, or a file read that returns a 4,000-line source file. They sit in the transcript for the rest of the task and get re-read on every call after they first fired. One careless tool call at step 2 can double the input the model has to wade through at step 10.

That's where most of the design work happens. Once the loop works, what's left is figuring out what goes into the transcript and what stays out. Compaction, sub-agents, skills, memory hierarchies. All different answers to the same question: **what belongs in the context, and what doesn't?**

---

## Where this leads

That's the whole thing. An agent is:

- A model
- A set of tools the model can call
- A loop that keeps calling the model with the growing transcript

Everything else (system prompts, sub-agents, context compaction, skills, permission gates, verification loops, memory hierarchies, etc.) is a refinement of one of those three pieces.

Reshaping the tools starts with which tools you hand the agent, how many, and how well they're designed. Too few and the agent can't do the job. Too many and it gets lost picking between them. A tool that dumps too much output, takes too many arguments, or just does the wrong thing will wreck a run on its own. The rest of the harness can't save you from it.

Reshaping the loop usually means injecting something between model calls: a verifier, a compaction pass, a handoff to a sub-agent.

When you read about a new agent framework or a clever prompt technique, you can usually locate it on this map: is it poking at the tools, the loop, or the transcript?

That's the map. Most of what follows in this series is just working out the details.

---

*This is the first in a series on harness design. Next post: experiments on progressive disclosure. How agents best navigate large context, and what shape that context should take. Reach out at [noah@schenktechnology.com](mailto:noah@schenktechnology.com) if you're building agents and want to compare notes.*