AI Agent Reliability Starts With Better Task Framing

Why clear goals, reusable workflows, and stop conditions reduce drift and review risk when teams use IBM Bob and similar tools.

May 02, 2026

I did not expect parenting teenagers to make me better at prompting AI coding agents. But after spending enough time with two teenagers at home and IBM Bob in real codebases, the pattern became hard to ignore. The worst interactions happen when I confuse control with clarity. The better ones happen when I explain the goal, define the boundaries, provide context, and give enough room for the other side to act intelligently.

That is true with teenagers. It is also true with AI coding tools.

This is not because large language models are children. They are not. They do not have emotions, intentions, self-awareness, or lived experience. But they do process language through attention, association, and context. That means some of the same communication patterns that help humans understand expectations also help LLM-based tools produce better work.

The important lesson is simple: “Don’t do X” is often weaker than “Do Y, in this context, for this reason, with these constraints.”

That sounds like parenting advice. It is also good prompt engineering.

Oh, before you leave, because this is only about IBM Bob and a sense a well crafted product pitch: This pattern is not specific to IBM Bob. The same principle applies to Cursor, GitHub Copilot, Claude Code, Gemini CLI, OpenAI Codex-style agents, and any other AI coding assistant that works from natural-language intent, repository context, tool access, and generated code changes. The product names, integrations, and workflows differ, but the underlying interaction model is similar enough to matter: the agent can only act on the context it can see, the instructions it can follow, and the constraints you make explicit.

The problem with “don’t”

Every parent knows the trap.

“Don’t slam the door.”

“Don’t leave your shoes in the hallway.”

“Don’t forget your homework.”

“Don’t be on your phone all evening.”

Sometimes it works. Often it does not. The instruction may be technically clear, but it still points attention at the forbidden behavior. The child hears the thing you do not want, not always the behavior you actually expect.

The classic psychological reference here is Daniel Wegner’s work on thought suppression, often remembered as the “white bear” experiment. Participants who were told not to think about a white bear often found the thought more difficult to suppress. The act of suppression kept the target concept active. Wegner called this an ironic process of mental control.

LLMs have a related problem, but for different reasons. They are not fighting intrusive thoughts. They are processing tokens. When you write “Do not mention Kubernetes,” the model still has to attend to “Kubernetes” to understand the instruction. You have placed the concept into the active context window. Recent research on negation in foundation models shows that negation remains a hard problem, especially where models need to distinguish absence, contradiction, or exclusion from ordinary semantic similarity.

A 2025 paper, Don’t Think of the White Bear: Ironic Negation in Transformer Models Under Cognitive Load, makes this connection explicit. The authors tested negation prompts such as “do not mention X” and found rebound behavior, especially when distractor text appears between the instruction and the output. Their ReboundBench dataset contains 5,000 varied negation prompts designed to probe this effect.

So the parenting lesson is not just cute. It maps surprisingly well to prompt engineering.

When I tell one of my teenagers, “Don’t leave the kitchen a mess,” I have technically expressed a constraint. But I have not defined the desired end state. A better instruction is: “Before you go back upstairs, put the plates in the dishwasher, wipe the counter, and throw away the packaging.”

That second version works better because it describes success.

The same applies to Bob.

Weak prompt:

Don’t create messy code. Don’t ignore tests. Don’t change unrelated files.

Better prompt:

Implement the feature with the smallest safe change.

Keep the existing package structure.
Update or add tests that prove the new behavior.
Limit edits to the service, DTO, and test files required for this task.
Before changing code, list the files you plan to inspect and explain why.

The second prompt does not rely on suppression. It creates a path.

Teenagers, agents, and the myth of obedience

There is another parenting trap: confusing obedience with understanding.

Authoritarian communication sounds efficient. It is direct, strict, and often emotionally satisfying in the moment. “Because I said so” has a certain runtime performance advantage. Unfortunately, it does not scale well.

Developmental psychology has studied this for decades. Diana Baumrind’s model of parenting styles distinguishes between authoritarian, permissive, and authoritative parenting. Later research often frames these styles around two dimensions: demandingness and responsiveness. Authoritarian parenting is high demand with low responsiveness. Permissive parenting is high responsiveness with low demand. Authoritative parenting combines clear expectations with explanation, warmth, and support.

That distinction is useful for prompt engineering.

A permissive prompt gives Bob freedom without direction:

Improve this application.

An authoritarian prompt gives rules without context:

Fix this now. Do not break anything. Do not ask questions. Do not make mistakes.

An authoritative prompt gives direction, context, and boundaries:

We need to reduce startup failures in this Quarkus service.

Goal:
Identify why the application fails when PostgreSQL is unavailable during startup.

Context:
This service uses Quarkus, Hibernate ORM with Panache, and Dev Services in local development.
Production uses an external PostgreSQL instance.

Expected outcome:
1. Explain the likely failure path.
2. Propose the smallest code or configuration change.
3. Add a test or verification step.
4. Do not change unrelated modules.
5. Ask before running commands that modify files outside this module.

This is the style I want with my teenagers too. Not soft. Not vague. Not controlling for the sake of control. Clear expectations plus enough context to make a good decision.

That matters more with teenagers because they are not small children anymore. They are practicing judgment. They need boundaries, but they also need to understand the “why” behind those boundaries. Research on adolescence repeatedly points to the continuing development of executive function, self-regulation, and attention control during these years. The environment around the teenager still matters because the brain is still learning how to plan, inhibit, prioritize, and recover from distraction.

Bob is not developing a brain. But it is operating under a similar practical constraint from the user’s perspective: it needs the environment structured well enough to make the next decision less ambiguous.

Ambiguity turns into assumptions

This is the sentence I keep coming back to:

Ambiguity turns into assumptions.

At home, this is painfully obvious.

“Please clean your room” can mean at least ten different things. To me it might mean clothes in the laundry, desk cleared, trash removed, bed made, and no half-empty glasses hiding behind a monitor. To a teenager it might mean the visible floor area has improved by 30 percent.

Both interpretations are internally reasonable. The conflict was created by an underspecified prompt.

AI agents behave the same way. When we say “modernize this code,” Bob has to infer what modernization means. Java version? Framework conventions? Dependency updates? Test coverage? Performance? Security? Package layout? Naming? Architectural boundaries?

The agent will fill the gap. That is not a character flaw. That is the job we gave it.

IBM Bob’s own documentation says prompt quality directly affects response quality, and recommends being specific, clear, and providing examples of the desired format or style.

For real projects, this is where Bob becomes more interesting than a chat box. The /init command can generate AGENTS.md files that provide persistent project context, including project purpose, directory structure, technology stack, architectural patterns, and development workflows. IBM’s documentation explicitly frames AGENTS.md as onboarding documentation for Bob because each new conversation otherwise starts without project memory.

That is scaffolding.

Instead of repeating “don’t mess this up” in every conversation, you give Bob durable context:

This repository uses Quarkus REST, not RESTEasy Classic.
Use Jakarta packages, not javax.
Tests must run with ./mvnw test.
Prefer Dev Services for local infrastructure.
Keep generated code out of src/main/java.
Do not introduce Lombok.

Even here, the strongest instructions are positive. “Use Quarkus REST” is more useful than “Don’t use the wrong REST stack.” “Use Jakarta packages” is more useful than “Don’t use javax.”

The better parenting pattern: say what good looks like

One thing I have learned with my teenagers is that correction works better when it is tied to an observable standard.

Not:

Don’t be irresponsible with your phone.

Better:

Put your phone downstairs by 9:30 on school nights, so you can sleep without notifications.

Not:

Don’t be rude.

Better:

When you disagree, lower your voice, say what you want, and do not attack the person.

Not:

Don’t make me remind you again.

Better:

Set an alarm now, then show me the reminder before you leave the room.

This is not magic. It is a shift from moral judgment to executable behavior.

For Bob, the equivalent is replacing vague quality demands with operational criteria.

Not:

Don’t hallucinate.

Better:

Use only information from the repository and the linked documentation.
When you are uncertain, say what you checked and what remains unknown.
Do not invent APIs. Verify unfamiliar classes against the project dependencies or official docs before using them.

Not:

Don’t create insecure code.

Better:

For this endpoint, require authentication.
Validate all request fields.
Return Problem Details style errors for invalid input.
Do not log secrets or tokens.
Add tests for authorized, unauthorized, and invalid requests.

Not:

Don’t over-engineer it.

Better:

Implement this as a single Quarkus service method and one REST resource method.
Avoid new abstractions unless two existing call sites need the same behavior.
Prefer readable code over configurability.

This is where prompting becomes engineering again. You define the target state, the constraints, the verification path, and the stop condition.

Parent Management Training and prompt design

There is a useful connection here to Parent Management Training, or PMT. PMT is an evidence-based family of interventions used for disruptive behavior problems. The practical idea is not “be nicer.” It is to change the environment, reinforce desired behavior, and reduce the need for constant reactive correction. Reviews have found PMT effective for disruptive behavior problems, while also noting variation across studies and implementations.

Translated to AI work, this means we should stop treating every bad output as a prompt failure in isolation. Often, the environment is wrong.

The agent lacks project context.
The task is too large.
The acceptance criteria are missing.
The repository conventions are implicit.
The tests are weak.
The tool permissions are too broad.
The user expects architectural judgment but provides only a ticket title.

This is exactly where IBM Bob’s modes, rules, and skills become useful. Skills give Bob task-specific instructions and supporting files. IBM’s documentation states that when a skill is activated, Bob receives the skill’s instructions and gains access to files in the skill directory, then follows that workflow for the task.

That is not just a productivity feature. It is behavioral design.

A good Bob skill is the AI equivalent of a household routine. It removes repeated negotiation. It says, “When this kind of task happens, this is how we do it here.”

For example, a Quarkus REST skill should not say:

Don’t use outdated APIs.
Don’t forget tests.
Don’t choose bad package names.

It should say:

Purpose:
Create and maintain Quarkus REST endpoints that follow this repository’s conventions.

Workflow:
1. Inspect existing REST resources before creating new ones.
2. Use jakarta.ws.rs imports.
3. Use quarkus-rest and quarkus-rest-jackson unless the repository already uses another approved stack.
4. Place request and response DTOs in the existing DTO package.
5. Add tests for success, validation failure, and authorization behavior.
6. Run or recommend ./mvnw test for the affected module.
7. Report changed files and remaining risks.

That is authoritative prompting. High expectations. High support.

Cognitive apprenticeship for agent work

Another useful model is cognitive apprenticeship. Collins, Brown, and Newman described cognitive apprenticeship as a way to make expert thinking visible through modeling, coaching, scaffolding, articulation, reflection, and exploration.

That maps directly to how I want to work with Bob.

When I use Bob well, I do not just throw a task over the wall. I model what good looks like. I show examples. I give the reasoning frame. I ask for a plan. I let it act in a constrained space. Then I review the result.

That sounds slower than “just code it.” It is not. It is faster than cleaning up a confident mess.

A cognitive apprenticeship style prompt for Bob looks like this:

You are helping me modify an existing Quarkus service.

First, inspect the current implementation and summarize the design in five sentences.
Then identify the smallest safe change for the requested behavior.
Before editing, show a short plan with affected files.
After editing, explain how the change follows the existing patterns.
Finally, suggest the most relevant test command.

This prompt does several things. It asks Bob to observe before acting. It makes reasoning visible. It creates a checkpoint before file modification. It ties the solution back to existing project patterns. It ends with verification.

This is the same pattern I want with teenagers learning independence. First we do it together. Then I ask them to explain the plan. Then they do the next step. Then we review what happened. Eventually, I back off.

The goal is not control. The goal is reliable autonomy.

Where negative constraints still belong

None of this means negative constraints are useless.

There are cases where “do not” is appropriate. Security rules, legal boundaries, data handling requirements, and hard architectural exclusions often need explicit negative constraints.

The mistake is relying on negative constraints as the primary way to steer behavior.

For example, this is reasonable:

Do not print secrets.
Do not commit generated credentials.
Do not modify files outside the payment-service module.

But those constraints should be paired with positive instructions:

Read secrets only from the configured environment variables.
Use placeholders in examples.
Limit all edits to payment-service.
If another module appears necessary, stop and explain why before editing.

This matters because negation understanding remains a known weakness in current models. Multimodal and language-model research continues to show that negation is not handled as robustly as ordinary affirmative meaning. NegBench, for example, evaluates vision-language models across retrieval and multiple-choice tasks with negated captions, using 79,000 examples across image, video, and medical datasets.

So the practical rule is not “never say don’t.”

The rule is:

Do not make “don’t” carry the whole instruction.

A practical Bob prompting framework

For my own Bob work, especially in Java and Quarkus repositories, I would reduce this whole article to one operating model:

Context:
What project, module, framework, and convention are we working inside?

Goal:
What should be true when the task is complete?

Boundaries:
What files, APIs, technologies, or behaviors are in scope?

Workflow:
What should Bob inspect, plan, change, and verify?

Examples:
What does good output or good code look like?

Stop condition:
When should Bob pause and ask before continuing?

Here is a complete example:

Context:
You are working in a Quarkus application that uses Quarkus REST, Hibernate ORM with Panache, PostgreSQL Dev Services, and JUnit tests.

Goal:
Add pagination to the customer search endpoint.

Boundaries:
Keep the existing endpoint path.
Do not introduce GraphQL or a new persistence library.
Use the existing Customer entity and repository patterns.
Limit edits to the customer resource, service, DTOs, and tests.

Workflow:
1. Inspect the current customer search implementation.
2. Summarize the existing flow.
3. Propose the smallest safe change.
4. Add page and size parameters with sensible validation.
5. Return a response that includes items, page, size, and total count.
6. Add tests for default pagination, custom pagination, invalid size, and empty results.

Examples:
Use the existing DTO style in this module.
Use Jakarta validation annotations where appropriate.

Stop condition:
If the current repository pattern cannot support total count cleanly, stop and explain the trade-off before editing.

This is the difference between asking Bob to guess and giving Bob a runnable mental model.

The uncomfortable conclusion

The uncomfortable part is that better prompting often exposes lazy thinking.

When Bob produces generic code, it is tempting to blame the model. Sometimes the model is the problem. But often I did not provide enough intent. I did not define success. I did not share the architectural boundary. I did not give examples. I expected the agent to infer the context I had in my head.

That is exactly the same mistake I make at home.

When I tell my teenagers, “Be ready on time,” I know what I mean. Shoes. Jacket. Bag. Water bottle. Phone charged. Actually standing near the door, not theoretically ready in another room.

But I did not say that.

Then I get annoyed when reality fails to match the invisible checklist in my head.

AI coding agents punish the same weakness. They turn missing context into plausible output. They turn vague goals into broad edits. They turn “improve this” into whatever the training distribution thinks improvement usually looks like.

IBM Bob gives us better places to put that context: AGENTS.md, project rules, modes, skills, examples, and explicit workflows. But the tool cannot externalize intent we never wrote down.

That is the real lesson from parenting teenagers and prompting AI agents.

Clarity is not control. Structure is not micromanagement. Positive direction is not softness.

It is how you help a capable system act well when the next step is ambiguous.

Discussion about this post

Ready for more?