AI Agents Need Context, Not Just Prompts
What back-to-back drawing experiments, grounding, and the curse of knowledge teach us about prompts, specifications, tool definitions, and agent handoffs.
Two people sit back to back.
One has a picture. The other has a blank sheet of paper. The job sounds simple enough: describe the picture so the other person can draw it.
After thirty seconds, the simple task is already not simple.
“Draw a circle near the top.”
How near?
“Put a line under it.”
Straight down? Centered? Touching the circle?
“Now add a little shape on the left.”
Little compared to what? Left from whose point of view? Is it inside the circle, below it, next to it?
I like this setup because it is almost insultingly small, and still it contains the whole problem of working with large language models. The model is the person with the blank sheet. It has language, world knowledge, and a frightening ability to keep going. The picture in your head is the missing part.
The hallway conversation is missing. So is the staging failure from last Tuesday. So is the reason the weird code path exists: a customer integration depends on it, and nobody wants to reopen that very expensive wound.
It starts with zero private context.
That distinction matters. An LLM starts with broad statistical knowledge and whatever you put into the context window. But for your exact problem, your exact company, your exact system, your exact “obvious” exception, it is still sitting on the other side of the chair, waiting for a picture it cannot see.
We keep calling this prompting. That word is too small.
The real work is context transfer under uncertainty.
Humans Need Grounding Too
Psychologists have studied this kind of problem for decades. Herbert Clark and Deanna Wilkes-Gibbs used tangram figures to study how people refer to strange shapes together. One participant described abstract figures. The other tried to arrange matching figures in the same order. The useful part was how much negotiation happened before success: false starts, repairs, confirmations, shorter names after repeated use, and tiny agreements that only made sense to that pair. Their paper, Referring as a Collaborative Process, is a good antidote to the idea that meaning is simply packed into words and shipped across the room.
Meaning is negotiated until both people have enough shared understanding to continue. Annoying, yes. Also how communication works.
Clark and Susan Brennan later called this work grounding in communication. Grounding is the process by which participants establish that they understand each other well enough for the current purpose.
Well enough is the part that software teams should underline.
If I ask you to pass the salt, we do not need a shared ontology of tableware. If I ask an agent to migrate billing data, “well enough” gets expensive quickly.
The HCRC Map Task Corpus made this visible in another form. One person had a route on a map. Another had a similar map and had to reproduce the route from spoken instructions. The researchers varied things like familiarity between speakers, eye contact, and whether the landmarks on both maps matched.
That last detail is where engineering teams should start sweating.
Communication becomes harder when both sides think they are looking at the same world, but the worlds differ just enough. The map looks shared until the mismatch matters. The words have the same trick.
“User” means one thing to product, another to auth, and a third thing to billing. “Account” can mean a legal entity, a login, a tenant, or a table that should have been renamed in 2017. “Safe” can mean no data loss, no downtime, no visible UI change, no compliance issue, or no angry message from the one enterprise customer whose internal workflow somehow became your architecture.
That makes it a grounding problem, not a vocabulary cleanup exercise.
The Curse of Knowledge Is in Every Ticket
There is another reason this is hard: once you know something, it becomes difficult to model not knowing it.
Colin Camerer, George Loewenstein, and Martin Weber called this the curse of knowledge. Their point was not just that informed people know more. It was that informed people struggle to ignore their private information when predicting how less-informed people will judge a situation. More information can make you worse at explaining, because the missing steps stop feeling missing.
In communication research, Fussell and Krauss found a related pattern. Speakers do take the listener’s knowledge into account, but their estimates are biased toward what they themselves know.
That is almost every Jira ticket I have ever read.
The person writing the ticket can hear the melody. The person reading the ticket gets taps on a table.
Now replace the reader with an LLM. The model is very good at making the taps sound like music again. That is useful. It is also where the danger starts, because fluent reconstruction feels like understanding from the outside.
The model fills in the blank page with confident strokes. Some of those strokes came from your instructions. Some came from general training data. Some came from local files. Some came from the model choosing the most plausible continuation because the real missing context was never provided.
From the outside, all of these strokes look equally smooth.
Common Ground Is Built Locally
Humans solve this less by perfect mind reading than by building local agreements.
Susan Brennan and Herbert Clark studied how people settle on terms for objects during conversation. In their work on conceptual pacts and lexical choice, speakers and listeners developed partner-specific ways to refer to things. Once they had agreed on a description, they tended to keep using it with that partner, even if another description might have been clearer to a stranger.
This is why teams develop strange internal words. They are not always beautiful. Usually they are not. But they carry history.
That word in your codebase that makes no sense to a new hire may be a small fossil of a production incident, a product pivot, or an acquisition nobody had the energy to fully integrate. The term works because the team has a pact around it.
An LLM has to reconstruct that pact from whatever evidence you give it.
It may infer it from code. It may infer it from documentation. It may infer it from commit messages if the gods are feeling generous. But inference is not the same as shared history. When the model guesses the pact correctly, it looks smart. When it guesses wrong, it may still look smart for a while.
That is the annoying part. Bad context transfer rarely fails with a syntax error. It fails as plausible work on the wrong picture.
Even Humans Leak Their Own Perspective
The neat comparison would be: humans have theory of mind, models lack it, so humans transfer context and models need prompts. I do not think it is that clean.
Humans also fail at perspective taking. In experiments by Boaz Keysar and colleagues, listeners sometimes considered objects the speaker could not see when following instructions. The shared perspective should have ruled those objects out. The listeners still leaked their own view into interpretation and corrected later.
That matters because it stops us from treating human communication as the gold standard and AI communication as the broken copy. The human version is already messy. We rely on feedback, repair, confirmation, repeated exposure, and shared artifacts.
The difference with LLMs is that they remove some of the friction that used to reveal missing context.
A human engineer confronted with a vague ticket may ask three questions, open the code, grumble, and slow down. That delay is useful. It forces ambiguity into the open. An agent can produce a patch before anyone has noticed that “make the import flow safer” meant four different things.
Machine speed turns vague intent into concrete output faster than our grounding process can catch up.
That is why prompt quality feels more important now. The old problem did not disappear. The buffer got smaller.
A Prompt Is the First Move, Not the Whole Game
The usual prompt advice is still useful: be specific, provide context, show examples, define the output format, separate instructions from source material. OpenAI’s own prompting guidance says roughly this in practical terms: clear instructions, relevant context, examples, and explicit output expectations help models produce better results.
But the psychology research gives us a stronger frame.
A prompt is not a command.
A prompt is the first contribution in a grounding process.
If the task is small, one contribution may be enough. “Summarize this paragraph in three bullets” does not need a long ritual. The picture is in the paragraph. The success criteria are cheap.
For agentic work, one contribution is often not enough. The model needs to inspect files, restate assumptions, ask or infer what matters, propose a plan, make a change, run checks, and report evidence. Each step either improves common ground or hides a drift.
This is where I think many teams get the wrong lesson from better models. Stronger models reduce the cost of missing context. That does not make missing context disappear as a category. They can infer more. They can recover from worse instructions. They can read more files. Good. I want all of that.
Still, the model cannot know which private constraints are sacred unless the system exposes them somehow.
“Refactor this service” is not a task. It is a dare.
“Refactor InvoiceImportService so parsing and persistence are separate. Keep the current CSV quirks because customer exports depend on them. Do not change database schema. Existing import tests must pass. Add coverage for blank optional fields and duplicate invoice IDs. Stop and explain before touching the retry logic.”
That is already a different conversation. Not perfect, but it gives the person with the blank sheet some edges.
Specifications Are Grounding Artifacts
This is the practical conclusion: specifications are not paperwork. Good specs are shared context made durable.
A useful spec does several jobs at once.
It defines the target picture, names the private context, separates goals from non-goals, fixes local meanings for overloaded words, and gives examples and counterexamples. It also states what must stay invariant across valid solutions, and where the agent may explore or must stay boring.
That sounds heavier than “write a prompt,” because it is. But it is not bureaucracy if the alternative is three correction loops and a human trying to explain why a plausible patch is subtly wrong.
For LLMs and agents, I want specs to include at least these pieces when the task matters:
Goal - What should be true after the work?
Context - What local history, architecture, domain language, or business constraint would a new teammate miss?
Scope - Which files, APIs, modules, or user flows are in play?
Non-goals - Which tempting changes should stay out?
Terms - Which words have local meanings?
Examples - What does good look like?
Counterexamples - What would look plausible but be wrong?
Invariants - What must stay true even if the implementation changes?
Checks - Which tests, commands, screenshots, logs, or review steps prove the work?
Stop conditions - When should the agent pause instead of guessing?
That last one matters more than people think. Humans interrupt themselves when reality no longer matches the plan. Agents need explicit permission, and sometimes explicit obligation, to stop.
Here is the shape I like for a serious agent brief:
## Goal
Change the invoice import flow so malformed optional fields are reported per row
instead of aborting the whole file.
## Local Context
Enterprise customers upload CSV exports from three ERP systems. Empty optional
fields are common and must not fail the import. Duplicate invoice IDs must still
fail because downstream reconciliation depends on uniqueness.
## Scope
Work in `billing/import`. Do not change the database schema or public API
response shape.
## Non-Goals
Do not redesign retry behavior. Do not replace the CSV parser. Do not add async
processing.
## Acceptance Criteria
- Blank optional fields produce row warnings and continue.
- Duplicate invoice IDs still fail the import.
- Existing import tests pass.
- Add tests for blank optional fields and mixed valid/invalid rows.
## Stop Conditions
Stop and explain if the parser cannot distinguish blank optional fields from
missing required fields with the current data model.This is not poetry. Good.
The goal is not to impress the agent. The goal is to transfer the picture.
Agents Make the Blank Page Operational
The back-to-back drawing task is about communication. Agents add action.
They draw and act. They can edit files, call APIs, run migrations, send emails, update tickets, trigger workflows, and use tools whose side effects are very real. A misunderstanding is no longer just a bad sketch. It can become a bad pull request, a bad database change, or a bad message to a customer. Efficient, at least.
That means every agent boundary needs grounding too.
Anthropic’s guidance on building effective agents makes a useful point about tools: tool definitions deserve the same care as prompts. That is exactly right. A tool is a sentence with permissions attached.
If a tool is vague, the agent has to infer what it does, when to use it, what inputs are safe, and what failure means. That is a lot of invisible context around an operation that may mutate the world.
For agent systems, I want three layers of grounding:
Task grounding - What are we trying to do, and what does done mean?
Domain grounding - What local concepts, constraints, and history shape the work?
Tool grounding - What actions are available, what do they do, what are their limits, and when should the agent avoid them?
In a multi-agent pipeline, there is a fourth layer:
Handoff grounding - What exactly does Agent A pass to Agent B, and how does Agent B know whether the input is complete?
Loose conversational handoffs are where drift compounds. Agent A interprets the vague request one way. Agent B treats Agent A’s output as ground truth. Agent C optimizes the result. By the time a human sees the final artifact, the original misunderstanding has three layers of polish on it.
This is why schemas, acceptance criteria, tests, and validators matter more in agent pipelines than in casual chat. They are not process theater. They are ways to keep the picture from mutating as it crosses the room.
Context Is Not a Blob
There is also a trap in the phrase “give it more context.”
More context can still be the wrong context. A bigger pile of documents can make the model work harder while still missing the one fact that matters. Humans also do this. We forward a thread with eighteen messages and say “see below.” Somewhere inside is the reason the project is blocked. A classic enterprise treasure hunt, only without the treasure.
Context has structure.
Some context defines vocabulary. Some defines constraints. Some defines current state. Some defines taste. Some defines trust boundaries. Some defines what went wrong last time. Some is just noise that happened to be nearby.
For LLMs, context quality matters because the model has to decide what is relevant. If you give it a repository, a design doc, three tickets, and a vague goal, you mostly handed over a sorting problem.
The better pattern is to make the role of each context block explicit:
“These are hard constraints.”
“These are examples of the desired style.”
“These are historical reasons, not current requirements.”
“This file is authoritative.”
“This ticket is background only.”
“These tests define the behavior.”
This is the written equivalent of turning around, pointing at the picture, and saying: this line matters, this shadow does not.
Ask for the Reconstruction
One of the simplest habits is also one of the most useful: ask the model to describe the picture back before it draws.
For small tasks, this may be overkill. For risky work, it is cheap insurance.
Before implementation, ask for:
the model’s understanding of the goal
assumptions it is making
ambiguous terms it noticed
files or systems it believes are in scope
constraints it thinks are hard
tests or checks it plans to use
cases where it should stop
The point is not to make the model perform thoughtfulness for our comfort. It is a grounding check. You are looking for mismatches while they are still sentences, not after they become code.
The same applies after the work. A good agent report should not only say “done.” It should say what changed, why it believes the change satisfies the goal, what evidence it collected, and what risk remains. That report is part of the shared context for the next step.
In human teams, this is why good engineers narrate trade-offs in pull requests. The diff shows what changed. The explanation shows which picture they were drawing.
The Real Skill Is Seeing the Gap
I doubt the future belongs to people who can write magical prompts. That phrase already feels tired.
The useful skill is noticing the gap between your mental picture and the agent’s blank page.
Sometimes the gap is small. Give the model a paragraph and ask for a shorter paragraph. Fine.
Sometimes the gap is huge. Ask an agent to modify a production workflow, and the missing context includes domain rules, failure history, naming conventions, customer promises, deployment constraints, test reliability, and the quiet preferences of the team that has to maintain the result.
The back-to-back drawing experiment is useful because it removes the mysticism. We already know this problem. We have lived it with teammates, product managers, customers, consultants, and our own future selves reading code from six months ago.
LLMs did not invent the context transfer problem.
They made it faster.
That speed is valuable. I want agents that can inspect, draft, refactor, test, and explain. But the better they get at drawing, the more important it becomes to give them the right picture, or at least to make them show their sketch early enough that we can correct it.
When the output is wrong, the first question should not be “why was the model stupid?”
Often the better question is more uncomfortable:
What part of the picture did I assume had already crossed the gap?


