How to Review Agent System Prompts Like Production Infrastructure

Turn prompt reviews into a repeatable engineering process with scorecards, calibration thresholds, failure-mode analysis, and paste-ready hardening blocks.

Apr 19, 2026

Most teams still write system prompts the way they write onboarding docs for humans: friendly tone, implicit context, and a belief that the reader will “figure it out.” An autonomous coding agent does not figure it out the same way. It optimizes for the next token under pressure from tools, context limits, and whatever ambiguity you left in the text. In production, a weak prompt does not fail politely. It invents files, skips discovery, overwrites working code, or burns the entire window on a single oversized task. The failure shows up as bad diffs, silent wrong assumptions, or a session that cannot resume after a reset.

The Bob Meta-Scorecard is a rubric and workflow for grading system prompts before you treat them as infrastructure. It is built around five pillars: grounding in the real tree, continuity across session loss, safety around destructive work, decomposition so the model does not try to ship Rome in one reply, and efficiency so the instructions leave room for actual repository work. This article turns that rubric into something you can run repeatedly: a workspace layout, complete template blocks, calibration defaults, hardening concerns for real teams, and a verification pass you can execute on a candidate prompt in under an hour.

The methodology assumes tool-using agents with long context (on the order of hundreds of thousands of tokens) and episodic resets. It is tuned for “Bob”-style agents in coding products, not for one-shot chat UIs where a human pastes context every turn. If your stack differs, keep the pillars and replace tool names with whatever your runtime actually exposes.

Prerequisites

You need a place to store prompts, scorecards, and diffs, plus a habit of reading prompts as operational specs rather than copy. You do not need a particular IDE beyond what you already use to review Markdown. But you can of course try out IBM Bob for free if you like.

A text editor and shell (or equivalent) for creating the folder layout below
Access to the system prompts you want to evaluate (or realistic redacted copies)
Permission to store synthetic examples next to real prompts without mixing them into production bundles
Familiarity with how your agent surfaces tools (file read, search, apply patch, terminal, and so on)

Project Setup

Create a small evaluation kit so every review produces comparable artifacts. One layout is enough; the names matter less than the discipline of always writing the same outputs.

From an empty parent directory:

mkdir -p bob-scorecard-kit/{prompts,incoming,scorecards,power-ups,calibration}
touch bob-scorecard-kit/calibration/thresholds.properties
touch bob-scorecard-kit/scorecards/.gitkeep
printf '%s\n' "# Candidate prompts (read-only inputs)" > bob-scorecard-kit/prompts/README.md
printf '%s\n' "# Pasted prompts awaiting triage" > bob-scorecard-kit/incoming/README.md

What each path is for

prompts/: frozen copies of prompts you intend to ship or compare (versioned by filename, not by memory)
incoming/: messy drafts you are not ready to score yet
scorecards/: one Markdown file per evaluation run, named after the prompt and date
power-ups/: the three rewrite injections you would actually merge
calibration/thresholds.properties: numeric bands and token budgets your team agrees on (see Configuration)

If you use Git, add incoming/ to .gitignore when that folder might hold customer-specific text. Keep prompts/ and scorecards/ under the same review rules as code.

Implementing the Five Pillars

Each pillar below follows the same shape: why it exists, a bad prompt fragment you should be able to recognize, and a strong template you can paste or adapt. After the templates, a short analysis ties the pillar to failure modes you see under stress.

Grounding: force the codebase radar

Context. Grounding is the difference between “sounds plausible” and “matches this repository.” Agents are rewarded for fluency. Without mandatory discovery steps, fluency wins over evidence.

Bad example (scores 1 of 5).

Given the authentication system in our service, propose concrete security improvements.

There is no requirement to list auth-related paths, read implementations, or cite evidence. The model can invent a generic OAuth checklist that never touches your code.

Strong example (scores 5 of 5).

Before you recommend any change:

1. Use the repository file listing tool to enumerate paths under `src/main/java` (or the language-appropriate root) and identify every file that participates in authentication, authorization, or session handling. List those paths explicitly in your reply.
2. Read each identified file in full unless it is larger than 400 lines; if larger, read the class header, public API surface, and any security-sensitive branches first, then summarize what remains unseen.
3. If the project declares dependencies for auth (for example Maven `pom.xml`, Gradle files, or lockfiles), read the relevant coordinates and versions for auth-related libraries.
4. Reply with a short evidence table in prose (not code): for each claim you plan to make later, cite `path` and, where possible, a line range or symbol name you observed.
5. Stop and ask for confirmation before proposing redesign work.

If a required file is missing, say so explicitly. Do not invent layout.

Analysis. This pattern works because it makes absence of evidence visible. The model cannot satisfy step four with hand-waving unless it breaks the instructions outright. The cost is length in the system prompt and friction in the happy path. That friction is the point: you are buying insurance against template-shaped answers. Under stress (large trees, generated noise), narrow the glob roots and raise the line-read threshold instead of deleting the grounding block.

Red flags to grep for

Placeholder brackets such as [insert service name here] with no discovery path
Phrases like “based on the codebase” with no tool verbs
Assumptions that the model already “sees” private hosts or CI secrets

Scoring anchors for grounding

5: Analysis is mandatory before generation; no escape hatch that says “if short on time, skip”
4: Strong tool guidance with rare exceptions you can name
3: Encourages analysis; model can still plausibly skip
2: Mentions context but not mechanics
1: Treats the model as omniscient about your tree

Continuity: design for session amnesia

Context. A reset wipes working memory. Anything not written to disk is gone. “Remember to update the todo list” is not continuity; it is folklore.

Bad example (scores 1 of 5).

Refactor the payment module for clarity. Keep track of what you finished as you go.

Strong example (scores 5 of 5).

Create or update `agent-work/payment-refactor.md` after every phase. The file must contain these headings in order:

## Completed work
- Bullets with file paths and what changed (one fact per bullet)

## Decisions
- Bullets with decision, rationale, and alternatives rejected

## Current phase
- A single integer `phase` and a single sentence describing the active task

## Next action
- One imperative sentence a new session can execute without chat history

## Resume protocol
- Exact text: "On startup, read this file, trust `Current phase` and `Next action`, verify repository state matches `Completed work`, then continue."

Update the file before you run tests that mutate disk state. If the file and the tree disagree, stop and reconcile with the user.

Analysis. Continuity is really a contract with your future self. The resume protocol line is load-bearing: it tells a cold start what “continue” means. The weak version of this pattern is a progress file without verification steps; then the agent cheerfully appends fiction after a partial revert. Pair continuity with grounding: the next session should re-read touched files, not only the markdown log.

Red flags

“Remember…” or “keep in mind…” as the only persistence mechanism
Progress files that log intent but not paths
No instruction for what to do when the log is stale

Scoring anchors for continuity

5: State file schema, update cadence, and cold-start resume text
4: Solid logging, weak or missing reconciliation rule
3: “Save progress” without schema
2: Vague mention of tracking
1: No durable state

Safety: shrink the blast radius

Context. Agents batch work. Batching plus filesystem tools equals destructive capability. Safety is not morality in the prompt; it is gating and reversibility.

Bad example (scores 1 of 5).

Improve performance across the service. Apply the changes you judge necessary.

Strong example (scores 5 of 5).

## Destructive and high-impact actions

Treat these as destructive: deleting files or directories, renaming public packages, rewriting build files, changing dependency major versions, editing migration SQL that already shipped, running commands that touch cloud resources.

Before any destructive action:
1. State the exact paths or resource identifiers affected.
2. State the smallest reversible backup you will create (for example a copy under `.agent-backup/<timestamp>/...` mirroring the original path).
3. Ask for explicit confirmation with a **Yes** or **No** question. Default to **No** if the user reply is ambiguous.

After changes:
- If the user says **undo** for a given step, restore from the backup you named, then verify with read-only tools.

Never run production database migrations or secret rotation unless the user pastes a literal token phrase you define out of band for this environment.

Analysis. I have seen teams lose a day to an agent that “cleaned up” unused files that were still wired by reflection. Explicit classification of what counts as destructive beats a vague “be careful.” Backups must be concrete enough that undo is a procedure, not a mood. The limit: users suffer confirmation fatigue if every touch asks twice. Calibrate the destructive list to your org; keep confirmations for deletes, dependency jumps, and infra commands.

Red flags

Single phrase “be careful”
Auto-approve rules hidden in examples (“unless trivial”)
Shell commands with wildcards and no working directory guard

Scoring anchors for safety

5: Classification, backup, confirm, undo path
4: Strong gates, thin recovery story
3: Warnings without procedure
2: Mentions risk only
1: Unbounded change authority

Decomposition: phases, deliverables, and gates

Context. Large asks encourage outline-level hallucination: APIs that sound right, files that never existed, tests that were never run. Phasing moves validation earlier.

Bad example (scores 1 of 5).

Implement full authentication: login, registration, password reset, two-factor authentication, session refresh, and audit logging. Include tests and documentation.

Strong example (scores 5 of 5).

## Delivery plan (do not skip phases)

**Phase 1: Login only**
- Deliverable: smallest slice that proves username and password verification against existing user storage, plus one integration test that fails on bad password.
- Gate: run the test command your build uses; paste the command and exit code in the progress file; wait for user confirmation before Phase 2.

**Phase 2: Registration**
- Deliverable: create-user path with validation; tests for happy path and duplicate user.
- Gate: same as Phase 1.

**Phase 3: Password reset**
- Deliverable: token issuance and consumption with time bounds; tests for expired and reused tokens.
- Gate: same pattern.

**Phase 4: Two-factor and session refresh**
- Deliverable: TOTP enrollment and refresh rotation if your stack already has patterns for them; if not, stop after documenting the gap instead of inventing crypto.

Rules: no phase may add a new external service without user confirmation. Each phase touches at most eight source files unless the user expands the limit.

Analysis. Token budgets in prompts are a coarse knob; what actually limits damage is the gate after each deliverable. The phase cap on touched files is artificial but effective against drive-by refactors. If your build is slow, say which subset of tests counts as “green” for the gate so the agent does not pretend a full suite ran.

Red flags

Single monolithic deliverable
“Implement everything” without ordering
No user or automated confirmation between slices

Scoring anchors for decomposition

5: Ordered phases, deliverables, explicit gates, scope caps
4: Phases without numeric limits or test discipline
3: Suggests steps only
2: “Step by step” with no structure
1: Single-shot epic

Efficiency: token weight versus completeness

Context. The system prompt competes with retrieved code, tool outputs, and the user’s messages. Efficiency is not minimalism for its own sake; it is signal per token plus deliberate outsourcing to files the agent reads once.

Bad pattern (scores 1 of 5).

Three thousand words that repeat the same rules in three sections, embed ten full XML examples, and restate generic security advice the model already encodes.

Strong pattern (scores 5 of 5).

## Operating loop

1. Discovery: follow `docs/agent-discovery.md` in this repository for search order.
2. Implementation: follow `docs/agent-edit-policy.md` for allowed directories and patch style.
3. Verification: run commands listed under `## Verify` in the active task file only.

## Task file

The user will name a task file. Treat that file as the single source of truth for scope and acceptance checks.

## Non-goals

Do not refactor unrelated modules. Do not add dependencies unless the task file’s **Dependencies** section is non-empty.

Analysis. This is the efficiency paradox handled correctly: short operational core, long detail moved to versioned docs the agent must read when working. A 50-word prompt with no grounding is not efficient; it is incomplete. Judge efficiency relative to task complexity: a read-only audit prompt should stay under a few hundred words of unique instruction; a full coding agent may justify a few thousand if every line changes behavior.

Red flags

Copy-paste duplication across “policy,” “reminder,” and “examples” sections
Giant static corpora in the prompt that should live in repo docs
No pointers, only prose

Scoring anchors for efficiency

5: Tight core, references for detail, little duplication
4: Slight redundancy, still leaves headroom
3: Moderate repetition
2: Verbose, recoverable only on very large windows
1: Bloated; crowds out evidence

Configuration

Store shared numeric bands in bob-scorecard-kit/calibration/thresholds.properties so two reviewers do not use different cutoffs. These values are defaults for human scoring, not model-parseable truth.

# Sum of five pillars, each 1 to 5
scorecard.pillars.count=5
scorecard.total.max=25

# Grade bands (inclusive lower bound, exclusive upper for next, except last)
grade.production.ready.min=23
grade.good.min=20
grade.needs.work.min=15
grade.not.ready.min=10

# Prompt length guidance for the efficiency pillar (word counts, approximate)
efficiency.excellent.words.max=500
efficiency.good.words.max=1000
efficiency.acceptable.words.max=2000
efficiency.poor.words.max=3000

# Context budget assumptions for commentary in scorecards (tokens, approximate)
context.assumed.total.tokens=200000
context.prompt.budget.simple.percent=1
context.prompt.budget.complex.percent=5

Each setting explained

scorecard.pillars.count and scorecard.total.max: Fixes the denominator when you extend the framework. If you add a sixth pillar later, bump both and reprint historical percentages with a footnote.
grade.*.min: Production readiness is a policy call. These lines match the original methodology: 23 to 25 as “ship,” 20 to 22 as “minor fixes,” 15 to 19 as “substantial rework,” 10 to 14 as “not ready,” below 10 as “rewrite from skeleton.” If your org never ships agents above 21, lower the bands and document the change in Git blame.
efficiency.*.words.max: Word counts are a proxy. Prefer counting tokens for serious runs. When counts disagree, trust tokens for the efficiency pillar and use words only as a quick scan.
context.*: Explains why a 10,000-token system prompt is a strategic choice for a complex workflow but heavy for a linter wrapper. If your deployment uses a different window, update these three lines so scorecards do not cite obsolete math.

Failure modes

Missing grade.* keys: reviewers invent cutoffs mid-quarter and you cannot compare runs
Stale context.*: arguments about efficiency that do not match your vendor limits
Over-tuning word limits: good prompts with long quoted user schemas look obese when they are mostly data; separate “instruction tokens” from “payload tokens” in commentary when that happens

Production Hardening

Operational failure modes in review

Scoring is a human process. Under time pressure, reviewers anchor on the pillar they personally care about (often safety) and underweight grounding. Mitigation: rotate reviewers, require evidence quotes in the scorecard for any pillar scored 4 or 5, and spot-check one file read log from a live run when possible.

Security and data exposure

Prompts often embed sample stack traces, SQL, or class names from real systems. The scorecard workspace must not become a second leak channel. Mitigation: redact before incoming/, forbid pasting production secrets into power-ups/ (injections should describe behavior, not values), and treat scorecards/ like code review material.

Concurrency and ordering guarantees

Two people scoring the same prompt revision on the same day should reach the same total within one point if they follow the same evidence rules. Mitigation: freeze the prompt under a hash-based filename, pin thresholds.properties in the scorecard header, and record the date. If scores diverge, the disagreement is usually grounding or decomposition, not math.

Abuse and gaming

Teams under metric pressure sometimes “teach to the test”: prompts bloated with rubric keywords but no real gates. Mitigation: run a live session against a small repository with a planted bug; the score is not the document, it is whether the agent finds the bug without inventing files.

Verification

You verify the methodology by producing a complete scorecard for a real candidate and checking internal consistency. Pick one prompt file from prompts/ and complete the steps.

Step 1: score each pillar

Assign integers 1 to 5 using the anchors in each pillar section. Write one paragraph of rationale per pillar that cites exact phrases from the candidate prompt.

Step 2: compute the rollup

Use the grade bands from thresholds.properties. Example of the output shape:

## Bob Meta-Scorecard: `payments-agent-v3.md` (2026-04-12)

**Grounding (3 of 5):** Suggests reading `src` but allows skipping when “timeboxed.”
**Continuity (2 of 5):** Mentions a todo list, no file path or resume protocol.
**Safety (5 of 5):** Deletes and dependency bumps gated with explicit confirmation.
**Decomposition (4 of 5):** Phases exist; tests not required between phases.
**Efficiency (4 of 5):** About 900 words with some duplicated warnings.

**Total (18 of 25, 72 percent):** Band: needs significant work per calibration file dated 2026-04-12.

Step 3: name the dominant failure mode

Answer one question in writing: “If this agent goes wrong in the first thirty minutes, what is the most likely story?” Use this skeleton:

## Critical failure mode: Template explosion

**Scenario:** The agent lists two directories, assumes the rest of the layout, generates a new package parallel to the real one, and imports compile until runtime wiring fails.

**Root cause:** Grounding allows skipping reads when the tree is “familiar.” Decomposition does not cap new files per phase.

**Likelihood:** High for repositories with multiple modules.

**Impact:** User merges green CI, then discovers dead code paths or duplicate beans.

Step 4: write exactly three power-ups

Each power-up is a paste-ready block tied to a pillar. Example:

## Power-up 1: Hard grounding gate (Grounding 3 to 5)

**Insert after:** the “Discovery” heading.

**Injection text:**
"Skipping file reads is not permitted. If you believe the tree is too large, stop and ask for a narrowed root path instead of proceeding."

**Expected movement:** Grounding 3 of 5 to 5 of 5 if the rest of the prompt already names tools.

Repeat for the second and third weakest pillars.

Step 5: re-score on paper

Apply the three injections mentally (or in a branch). Recompute totals. You should see at least two pillars move if you chose real weaknesses; if nothing moves, your power-ups were cosmetic.

What this proves

The workspace layout produces comparable artifacts
The failure mode story connects to specific prompt gaps
The power-ups are concrete enough to merge

Common Prompt Anti-Patterns

Placeholder trap: Brackets without discovery. Fix by naming tools and stopping conditions.

Single-shot fallacy: Epics in one answer. Fix with phases and gates.

Amnesia assumption: “Remember” without files. Fix with a structured progress file and resume text.

Efficiency paradox: Too short to be complete. Fix by referencing repo docs instead of omitting rules.

Safety omission: “Refactor as needed.” Fix with destructive classification and confirmation.

When to Use and When to Stop

Use this methodology when a system prompt will drive autonomous edits, when you compare prompt candidates for the same product, or when you train reviewers on agent constraints.

Do not use it as the only signal for human-pair programming modes, creative writing assistants, or cross-model comparisons without retuning anchors. Different models fail in different shapes; the pillars still help, but the numbers are not portable.

Calibration Stories

High score example (24 of 25). Mandatory discovery, progress file with resume text, explicit destructive gates, phased delivery with tests at gates, under about eight hundred words with minor duplication. Failure risk shifts to execution bugs, not spec holes.

Medium score example (14 of 25). Read-only analyst: safety is perfect because the prompt cannot touch disk, but grounding and continuity score 1 each because the human must paste all context every time. Fine for chat, poor for overnight agents.

Low score example (8 of 25). One line: “Build authentication with login, registration, and reset.” No tree contact, no state, no safety, single phase, only efficiency looks acceptable because the text is short. Expect generic framework soup.

Conclusion

We turned an informal rubric into a repeatable kit: same folders, same thresholds, same scorecard shape, and three paste-ready improvements per review, so the worst failure modes surface before you ship rather than after merge.

Discussion about this post

Ready for more?