Deterministic Islands in Probabilistic Workflows

Reliable agent systems do not make the LLM deterministic; they put model judgment around parsers, schemas, tests, CI gates, security scans, and approval points.

Jul 04, 2026

The first time you wire an agent workflow with hard checks, it can feel less autonomous. The model reads a messy request, guesses what the user means, writes code, edits prose, calls tools, and explains errors. Then, right in the middle of all that flexible reasoning, the workflow stops and asks a much simpler question:

Did the parser read the file?

Did the schema validate?

Did the tests pass?

Did the rendered artifact contain the expected objects?

Did the human approve the decision?

That pause turns a model claim into something the workflow can inspect. Reliable agent systems do not try to make the language model deterministic. They put the language model around deterministic islands: scripts, schemas, parsers, type systems, linters, tests, CI gates, and approval points. The model can reason, draft, route, and repair. The islands own the facts.

A large language model is good at working with incomplete context. It can infer intent from a vague issue, connect a failing test to the likely source file, or translate a human comment into a patch plan. That flexibility is why we use it. It can also talk itself into a clean answer while the change is still wrong, so it should not be the final judge of correctness.

Better prompts help, but they do not create contracts

A better prompt can make the model more consistent. It can name the goal, define style, list allowed tools, and remind the agent to run checks before it claims success. All of that helps.

But a prompt is still an instruction to a probabilistic system. It changes the distribution of likely answers. It does not create a hard guarantee.

If the model says “I validated the deck,” that sentence is cheap. If a script opened the generated PPTX, inspected the slide XML, rendered each slide to PNG, counted unresolved placeholders, checked overlap warnings, and returned a structured report, that is a different thing. The first one is a claim. The second one is evidence.

This is the mistake I see in many early agent setups. The team keeps adding instructions:

be careful
verify the result
do not hallucinate
make sure the output is valid
double-check before responding

Those lines help a little. They can reduce sloppy behavior. They still leave the core problem open: no definition of “valid,” no validator run, and no gate that blocks a bad artifact from moving forward.

Reliability needs something outside the model to say yes or no.

What a deterministic island is

A deterministic island is a step in the workflow where truth is owned by something more constrained than the language model.

This is an engineering contract, not mathematics. A test can be flaky. A linter can have a bug. A parser can change behavior between versions. But the step still has a contract: same input, same environment, same version, same result. When it fails, it should fail in a way another program can read.

Examples:

A JSON parser either parses the file or reports a line and column.
A schema validator accepts or rejects a payload against explicit fields and types.
A type checker says whether the code satisfies the language rules.
A unit test checks a behavior and returns pass or fail.
A linter catches formatting, unsafe patterns, or broken references.
A renderer exports an artifact and exposes layout warnings.
A CI gate runs the checks in a clean environment.
A human approval step decides whether the change should ship.

A human approval point can be an island too, as long as the workflow is clear about what the human owns. Humans are good at deciding intent, taste, risk, policy, and business meaning. They are not the right checksum algorithm for 27 generated files.

The model moves between these islands. It reads the evidence from one step, decides what to try next, and hands the next concrete task to another constrained system.

The simple loop

The loop is simple:

The model sits between checkpoints. It handles ambiguity, chooses a path, and converts failure output into the next attempt. Every factual claim goes through a narrower system before the workflow treats it as true.

A concrete example: generated presentation artifacts

Presentation generation is a good stress test because it mixes content, layout, binary files, brand rules, and visual judgment. It is also easy to fake success. A model can say “the slides look clean” without ever opening the final deck. That is not validation.

A stronger workflow starts by inspecting the source deck with a real parser. It reads every slide, extracts text boxes, images, tables, placeholders, fonts, dimensions, and object identifiers. The model can reason over that inventory, but it should not invent the inventory.

Then the workflow creates a structured map:

output slide 1 reuses source slide 3
this text box is rewritten
this table is filled
this placeholder is deleted
this logo stays untouched
no new objects are allowed on this slide

That map is a deterministic island if it is schema-validated. Required fields are present. Slide numbers are sequential. Source slide references exist. Edit targets resolve to real objects. If the map says to edit shape-17 and there is no shape-17, the workflow stops early.

After that, a generator duplicates the selected source slides and edits inherited objects in place. Again, the model can choose the content and explain why a source slide fits. The script owns the actual mutation.

The final check renders every slide and inspects the exported file. It looks for unresolved placeholders, broken page markers, missing required assets, text clipping, overlap warnings, and layout drift. A good check does not say “looks good.” It says something like:

{
  "status": "fail",
  "checks": {
    "renderedSlides": 12,
    "unresolvedPlaceholders": 2,
    "overlapWarnings": 0,
    "missingRequiredAssets": 0
  },
  "issues": [
    {
      "code": "unresolved-placeholder",
      "slide": 7,
      "objectId": "title-placeholder",
      "message": "Inherited title placeholder is empty."
    }
  ]
}

Now the agent has something it can use. It can go back to slide 7, locate the inherited title placeholder, and fix the real object. It does not have to stare at a screenshot and guess where the presentation format might hide a structural placeholder. That detail belongs to the checker.

The same pattern works outside slides.

Parsing files: let parsers own structure

Agents are good at reading text. They are much less reliable when you ask them to infer structure that a parser can provide exactly.

If the input is JSON, parse JSON. If it is XML, use an XML parser. If it is Java, use the compiler, an AST parser, or the IDE refactoring engine when available. If it is a zip-based document format, inspect the archive and the internal manifests. If it is Markdown with front matter, parse the front matter as data before editing prose.

The model can decide that a publishing pack probably contains social copy, images, tags, and scheduling notes. A parser should say which fields exist and where they live. The model can suggest that a Java class needs a rename. The refactoring tool should find symbol references. The model can guess that a deck has a title placeholder. The presentation inspector should return the placeholder ID.

This is beginner-friendly in practice because it removes mystery. The agent is no longer “looking at the project.” It is calling a tool that returns named facts.

Schemas and typed outputs make the next step safer

Structured output is one of the simplest ways to make agent workflows less fragile.

Free text makes the next step interpret intent again:

I think slide 3 should become the opening slide and we should update the headline.

A typed shape gives the workflow something to validate:

{
  "outputSlide": 1,
  "sourceSlide": 3,
  "role": "opening",
  "edits": [
    {
      "targetId": "shape-title",
      "action": "rewrite",
      "text": "Quarterly rollout plan"
    }
  ]
}

The schema can require outputSlide, sourceSlide, role, and edits. It can restrict action to known values. It can reject an edit without a target. It can reject a slide number that does not exist.

A schema cannot know whether “Quarterly rollout plan” is the right headline. It can still keep the workflow out of impossible states. That removes a large class of failures before the model gets another chance to improvise.

Typed outputs also make review easier. Instead of reading a long agent explanation and trying to infer what will happen, the human can inspect a concise plan:

which files change
which artifacts are generated
which channels are targeted
which checks will run
which decision needs approval

The model can still write the explanation. The machine path gets data.

Good tool responses for agents

Tools built for humans often return a nice sentence and a wall of logs. Tools built for agents need a different shape.

A good response should include:

status: success, fail, or blocked
stable error codes plus a short prose message
exact file paths, line numbers, object IDs, slide numbers, test names, or rule IDs
counts that describe the scope of the check
the command, version, or rule set used
a short human summary
structured issue records the model can act on
retry guidance when the failure is environmental
a path to full logs or raw output when the payload is too large

For example, a security scan needs the rule ID, file, line, matched snippet, severity, and remediation hint. A publishing tool needs the channel ID, post ID, scheduled timestamp, timezone, and preview URL if one exists. A test runner should separate compilation failure, test failure, timeout, and infrastructure failure. Those are different next steps.

Vague errors make agents guess. Guessing is exactly what the deterministic island was supposed to reduce.

Refactoring: let the model propose, let the tools verify

Refactoring shows the boundary well.

An agent can read a codebase and decide that CustomerManager should become CustomerService, that a method should move, or that repeated code should become a helper. That judgment combines naming, intent, local style, and trade-offs. The toolchain should handle the mechanical proof.

The mechanical change should lean on deterministic tools:

symbol-aware rename when the language server supports it
AST-based transforms for repeated code shapes
compiler checks for type errors
unit tests for behavior
formatting and import sorting
dependency analysis when public APIs move

The agent can write the first patch, run the checks, read failures, and repair. The compiler owns type truth. The tests own behavior truth inside their scope. The linter owns the style rules it knows. CI owns the clean-room replay.

This is also why “the agent said it reviewed the code” is a weak signal. A review without a diff, tests, static checks, and clear findings is just commentary.

Publishing: approval should be about intent

Publishing workflows are another good example because they combine mechanical details and human judgment.

An agent can draft social copy from an article. It can create variants for different channels. It can select a short hook, pull a canonical URL, and suggest a schedule. Those are good jobs for a model.

The mechanical checks should be deterministic:

parse the article metadata
extract the canonical URL
verify the image path exists
inspect the Open Graph preview
look up channel IDs from the provider because prose names drift
check for duplicate scheduled posts
validate that scheduled times are in the future and in the right timezone
enforce character limits
return the exact post IDs after scheduling

The human should approve the message, channel choice, timing, and risk. The human should not have to manually re-check whether the Mastodon post exceeded a character limit, whether the LinkedIn channel ID was copied correctly, or whether the URL resolves.

That is the point of the workflow. Human attention is expensive. Spend it on decisions.

Tests, linters, hooks, and CI turn guesses into evidence

The model can say “this should fix it.” Tests decide whether the behavior changed the way you expected.

Local checks catch fast mistakes. A linter catches a broken link, a missing field, a forbidden API, or an unsafe pattern before review. A pre-commit hook catches problems before they leave the workstation. CI runs the same checks in a clean environment, which catches missing files, hidden local state, version drift, and the usual “works on my machine” problem.

None of these checks prove the system is correct. They prove narrower claims:

this file parses
this generated artifact can be opened
this schema is valid
this code compiles
this test still passes
this forbidden pattern is absent
this package has no known critical vulnerability under the configured scanner

That list is still worth a lot. Reliable systems are built from many narrow claims with clear owners.

Security checks need their own islands

Security is a bad place to trust a model’s confidence.

An agent can help explain a security finding, suggest a safer API, or update code to pass a rule. But the detection step should be deterministic where possible:

secret scanners for tokens and credentials
static analysis for risky code patterns
dependency scanners for vulnerable packages
policy checks for container images or infrastructure changes
permission checks before tools touch external systems
approval gates for deploys, publishes, and destructive actions

The reviewer still owns exceptions and policy calls. The scanner should still find the pattern every time.

Common failure modes

The pattern is simple. The failures are also simple.

Fake validation

The agent says it validated the result, but no validator ran. Or the validator only checked that a file exists. Or the check looked at the generated source but not the exported artifact. This is common with binary outputs, generated docs, and UI screenshots.

The fix is to validate the thing the user will receive.

Hand-edited generated output

Someone fixes the generated PPTX, PDF, OpenAPI file, or code output by hand after the generator runs. The artifact looks right once, but the workflow is now split. The next run overwrites the fix or produces a different result.

If a manual edit is necessary, make it an input, update the generator, or record it as an explicit patch step with a check after it.

Skipped checks

Small changes are where people skip checks because the change feels obvious. Agents learn the same bad habit if the workflow lets them. A one-line publishing change can post to the wrong channel. A tiny refactor can break serialization. A harmless deck edit can leave an invisible placeholder behind.

Make the cheap checks automatic. Save human exceptions for the expensive ones.

Too many tools

A giant tool menu can make an agent worse. It chooses the wrong tool, calls tools in the wrong order, or spends time exploring options that should not exist for the task.

Good workflows expose fewer tools with clearer contracts. A tool named inspect_presentation that returns slide objects is easier to use than a general “run any office command” tool with no clear contract.

Vague errors

Validation failed is barely an error. The agent now has to infer the cause, and it may repair the wrong thing.

Errors should name the failed check, the target, the evidence, and the next action. unresolved-placeholder on slide 7 object title-placeholder is something an agent can fix.

The real pattern

The model is the reasoning layer. It handles ambiguity, proposes changes, explains failures, and decides what to try next. Deterministic islands are the truth layer. They parse, validate, generate, test, scan, and gate.

Good workflows move between the two deliberately.

Keep the model away from jobs we already have tools for: compiling, parsing, linting, scanning, rendering, and replaying checks in CI. The engineering task is to make those tools available at the right points, with outputs the model can read and humans can trust.

That is how agent systems become reliable enough for real projects. The probabilistic reasoning stays. It gets a map, a set of checkpoints, and a few hard places where guessing stops.

Discussion about this post

Ready for more?