Stop Measuring AI by How Much You Use It
A practical way to tell the difference between useful AI adoption and expensive prompt activity.
A colleague dropped a line in Slack that has been stuck in my head:
Some people are trying to burn as many coins as possible because it is measurable, and they think that counts as productivity.
Coins, credits, tokens, messages, seats, prompts. Pick your unit. Every enterprise AI platform eventually produces a dashboard that looks reassuringly precise. Someone can show that usage went up. Someone else can show that more teams activated their accounts. A third person can point to token consumption and say, with the straight face corporate dashboards seem to require, that adoption is accelerating.
The measurement is real. The interpretation is where things get slippery.
I understand why management wants adoption numbers. AI budgets are large, executive pressure is real, and nobody wants to explain that the company spent a year “exploring” while competitors shipped. IBM’s 2025 CEO study puts the pressure plainly: only 25% of surveyed AI initiatives delivered expected ROI, only 16% scaled enterprise-wide, and 64% of CEOs said fear of falling behind drives investment in some technologies before value is clear.
That is a perfect environment for adoption theater.
The danger is not that people use AI too much. I want developers, architects, platform engineers, support teams, and product people to experiment. I want agents inside the software delivery lifecycle. I want boring work to get automated and hard questions to get better support.
The danger is that we confuse measurable AI activity with productive change.
The Dashboard Is Not The Work
Usage metrics are useful. They tell you whether people have access, whether a rollout reached the intended audience, whether capacity planning is sane, whether cost is drifting, whether some teams need help, and whether governance controls are being bypassed.
They are bad productivity metrics.
Seat activation means someone can use the tool. Prompt count means someone did use the tool. Token spend means the system processed text. Reasoning-token consumption may mean people are asking the model to do harder work. It may also mean the agent got stuck in a loop, the prompt was sloppy, or the team turned a vague task into a very expensive conversation.
This is the old metrics problem with a new billable unit. Goodhart’s law is usually summarized as: when a measure becomes a target, it stops being a good measure. If prompt volume becomes a visible success metric, people will create prompt volume. If token consumption becomes evidence of transformation, people will burn tokens. Efficiently, even. We are very good at optimizing for the thing on the dashboard.
Developer productivity research has been warning about this for years. The SPACE framework exists partly because productivity is too complex to collapse into one number. It includes satisfaction and well-being, performance, activity, communication and collaboration, and efficiency and flow. The whole point is that activity is only one dimension. It is the easiest to count and one of the easiest to misread.
AI does not repeal that lesson. It makes the trap shinier.
Adoption Is High. Value Is Uneven.
The broad adoption story is real. McKinsey’s 2025 State of AI survey says 88% of respondents report regular AI use in at least one business function. It also says nearly two-thirds of organizations have not begun scaling AI across the enterprise, and only 39% report EBIT impact at the enterprise level.
That is the gap.
The tools are inside the building. The value is not automatically inside the operating model.
OpenAI’s own enterprise AI report shows how fast usage can scale: weekly ChatGPT Enterprise messages grew about 8x year over year, and average API reasoning-token consumption per organization grew about 320x. The report also argues that deeper workflow integration is where value appears. I agree with that framing. The important word is workflow. The value is not in the message count. The value is in the work that changed because of it.
Gartner’s supply-chain research makes the same point from another angle. In a 2025 survey, desk-based workers saved 4.11 hours per week with GenAI, and that time saved correlated with better individual output and quality. At team level, the time savings dropped to 1.5 hours per team member per week, with no correlation to improved output or higher quality.
That is not an anti-AI result. It is an anti-magic result.
Individual acceleration does not automatically become organizational throughput. A developer may finish a local task faster while the team spends more time reviewing a larger diff. A support engineer may draft a better answer faster while the knowledge base remains stale. A product manager may summarize customer feedback faster while prioritization still happens through politics, habit, and whoever had the louder slide.
AI can improve all of this. But the improvement has to land in the system, not just in an individual’s session history.
The Inner Loop Got Cheaper
For software delivery, the obvious AI story lives in the developer inner loop.
A developer states an intent. An agent reads context, edits files, runs tests, explains failures, adjusts the patch, drafts a pull request, and maybe writes the release notes. This is real. It is useful. It changes the economics of implementation.
Code, tests, docs, configuration, summaries, and analysis are now cheaper to produce. That is a big deal.
It also moves the bottleneck.
When implementation gets cheaper, specification and verification get more important. The hard part becomes deciding what should be true, constraining the work enough that the agent does not wander, proving that the result behaves correctly, and leaving enough evidence that another human can review the change without reconstructing the whole conversation from the diff.
This is where zero-spec agent work gets expensive. A vague prompt may produce a plausible patch quickly. That patch still has to be reviewed, tested, integrated, secured, deployed, operated, and explained. If the agent solved the wrong problem, the cost did not disappear. It moved downstream into the team.
I keep coming back to this sentence:
The unit of work is not the prompt. The unit of work is the change packet.
A useful change packet includes the intent, assumptions, non-goals, plan, diff, test results, validation results, open questions, rollback notes, and operational watchpoints. A pull request can carry that packet. A release artifact can carry it. The exact format matters less than the evidence.
Counting prompts ignores the packet. Counting tokens ignores the packet. Counting activated users ignores the packet.
The packet is where AI work becomes reviewable engineering work.
The Outer Loop Is Where Enterprise Value Appears
The narrow story says AI transforms the developer inner loop. True, but too small.
The serious enterprise story is the outer loop.
The inner loop turns intent into one change. The outer loop decides whether the stream of changes is sustainable. It includes portfolio choices, product priority, architecture ownership, risk classification, CI/CD, release controls, operations, compliance evidence, incident response, and learning from production.
Agents can help here too, but the work looks different. They can summarize incidents, draft runbook updates, compare release evidence, explain telemetry, prepare compliance notes, find repeated review comments, propose tests for recurring defects, and turn operational pain into better repository guidance.
That is where agentic development becomes an operating-model change.
If the outer loop does not change, faster implementation can make the organization worse. Bigger diffs arrive faster. Review queues grow. Test suites become trust bottlenecks. Security teams see more generated code with less visible intent. Platform teams answer the same questions again because the agent did not know local conventions. Operations inherits changes that looked technically valid but did not match ownership, deployment policy, or production reality.
The work looked cheap at the prompt boundary. The cost arrived in everyone else’s calendar.
This is why I like thinking in terms of an agentic SDLC rather than a coding assistant rollout. The goal is not just more AI in editors. The goal is a delivery system where agents can participate safely across the lifecycle:
Shared intent before execution
Small validated specs instead of vague prompts
Repo-local guidance such as
AGENTS.mdReusable skills with ownership and review
Behavior tests that act as executable oracles
Validation gates for security, policy, provenance, and release readiness
Risk-based autonomy instead of one permission model for every task
A controlled production boundary
Evidence that survives the chat session
Operational feedback that improves the next round of agent work
That list is not a plea for more process. Please, no. Enterprises have enough process sediment to form attractive geological layers.
The point is to move judgment into durable places. If a reviewer leaves the same comment on five agent-generated pull requests, that comment should probably become repo guidance, a skill, a test, or a validation gate. If an incident reveals that the agent did not understand an operational constraint, that learning should feed the readiness layer. If a team keeps burning tokens rediscovering the same build command, the problem is not enthusiasm. The problem is missing context.
Measure Flow, Not Heat
Token burn is heat. It tells you energy was consumed.
Flow tells you whether the system got better at turning intent into safe production change.
For software teams, we already have a better starting point than AI usage dashboards. DORA’s software delivery metrics measure change lead time, deployment frequency, failed deployment recovery time, change fail rate, and deployment rework rate. DORA groups these into throughput and instability, which is exactly the kind of balance AI programs need. Speed alone is not a win if the speed mostly creates rework.
DORA’s 2025 AI-assisted software development report also uses a phrase I like: AI is an amplifier. It magnifies an organization’s existing strengths and weaknesses. That feels right. A team with strong tests, clear ownership, small batches, good CI, and healthy review habits gets more leverage. A team with weak specs, flaky tests, unclear architecture, and overloaded reviewers gets faster confusion.
So measure the system.
At the inner-loop level, useful questions look like this:
Did change lead time improve for comparable work?
Did agent-assisted changes stay small enough to review?
Did the team add or improve behavior tests with the change?
How often did reviewers have to reconstruct intent from the diff?
How often were agent-generated changes accepted, substantially rewritten, or abandoned?
Which task categories produced reliable gains, and which produced rework?
At the outer-loop level, the questions get more interesting:
Did deployment rework go down?
Did change fail rate stay stable or improve?
Did recovery get faster because rollback notes and operational watchpoints improved?
Did repeated review comments turn into guidance, tests, or skills?
Did incidents feed back into
AGENTS.md, runbooks, validation gates, or architecture rules?Did security and compliance get better evidence earlier in the lifecycle?
Did platform teams see fewer repeated questions because context became reusable?
That is a different adoption story.
“We consumed 40 million tokens this quarter” is a cost sentence pretending to be a transformation sentence.
“We reduced review rework for dependency upgrades because the agent now follows a maintained skill, opens smaller pull requests, attaches test evidence, and updates rollback notes” is a transformation sentence.
Less shiny. Much better.
Exploration Still Counts
There is one caveat worth saying clearly: exploration is real work.
Early AI adoption needs space for experiments that do not immediately produce ROI. People need to learn the tools. Teams need to find where agents help and where they are mostly expensive autocomplete with confidence issues. Some use cases will die, and they should. Killing a bad AI use case is not resistance. It is governance with a pulse.
The problem starts when exploration metrics get promoted into productivity metrics.
During exploration, usage can be a healthy signal. You want to know whether people are trying the tools, where they get stuck, which workflows attract attention, and where cost spikes. But an exploration phase should produce learning artifacts:
A list of useful and rejected use cases
Prompt and workflow patterns that survived contact with real work
Team guidance for when to use agents and when not to
New or improved behavior tests
Updated
AGENTS.mdfilesReusable skills
Safer tool permissions
Better validation gates
Evidence templates for pull requests and releases
If exploration produces only consumption, it was consumption.
If exploration improves the outer loop, it was investment.
What I Would Ask Before Celebrating Adoption
The next time someone shows an AI adoption dashboard, I would not start by arguing with the chart. The chart probably tells a useful operational story. I would ask what story it cannot tell.
Five questions usually get there:
Which workflow changed?
What got better for the customer, developer, operator, or business?
What human work disappeared, and what new human work appeared?
What risk did we introduce or reduce?
Which durable artifact improved the next run: test, skill, guidance, validation gate, runbook, evidence trail, or architecture rule?
Those questions are harder than “how many people used AI last month?” That is why they are useful.
They also protect teams from the worst version of adoption pressure. If leadership asks only for usage, rational teams will produce usage. If leadership asks for workflow outcomes and evidence, teams have permission to do the slower, more valuable work of changing the system.
The Better Adoption Story
I am not interested in an enterprise where everyone gets a chatbot and a quarterly dashboard declares victory.
I am interested in an enterprise where agentic development changes how software moves from intent to production. The developer inner loop gets faster, yes. But the outer loop gets smarter too. Specifications become smaller and more testable. Pull requests carry better evidence. Review friction becomes reusable guidance. Incidents update future agent context. Production boundaries stay controlled. Low-risk work gets lighter paths. High-risk work gets better analysis without handing production keys to an enthusiastic model.
That is adoption worth measuring.
Burning coins may be necessary while teams learn. It is also a terrible definition of progress. Ashes are measurable. So is heat. Neither tells you whether the organization learned how to deliver safer, better software.
The useful AI question is:
What changed in the system because we used it?
If the answer is only “the usage graph went up,” we did not transform the SDLC. We decorated the burn rate.


