What you'll pay for AI agents will be wildly variable and unpredictable

1 day ago 10

Follow ZDNET: Add america arsenic a preferred source on Google.

ZDNET's cardinal takeaways

AI's outgo successful presumption of tokens soars erstwhile utilizing agents.
Agents are inconsistent and can't foretell their full token usage.
Users indispensable request terms transparency and show guarantees.

Among each the challenges of implementing agentic artificial intelligence, the least-understood contented is cost. The providers of AI, specified arsenic OpenAI, Google, and Anthropic, person terms lists, but nary of those listed prices archer users what the last measure volition beryllium to really lick a problem.

The result, according to a caller survey of costs from the University of Michigan and collaborating institutions, could beryllium sticker shock: soaring and unpredictable costs of agents.

The study, by pb writer Longju Bai of Michigan and collaborators astatine Stanford University, All Hands AI, Google's DeepMind unit, Microsoft, and MIT, titled "How Do AI Agents Spend Your Money? Analyzing and Predicting Token Consumption successful Agentic Coding Tasks," is, according to the authors, "the archetypal systematic survey connected AI Agent token consumption."

The survey was posted connected the arXiv pre-print server.

It is noteworthy for having arsenic its writer a salient Stanford economist who has commented extensively connected AI's interaction connected productivity, Erik Brynjolfsson.

The top-level uncovering is that agents devour orders of magnitude much tokens than turn-by-turn, simple, prompt-based chats -- deliberation 3,500 times the fig of tokens for an cause arsenic for a circular of prompts with ChatGPT.

Also: AI agents are fast, loose, and retired of control, MIT survey finds

A token is the cardinal portion of accusation processed by an AI model. It could beryllium a portion of a word, a full word, oregon conscionable a punctuation mark, depending connected however a exemplary chops information into pieces.

You mightiness expect agents to outgo much successful tokens, but the survey reveals much alarming facts. Two antithetic models tin person wildly antithetic token costs for the aforesaid task. And the aforesaid exemplary tin person antithetic costs each clip that it works connected the aforesaid problem, utilizing arsenic galore arsenic doubly the fig of tokens connected 1 juncture compared to another.

The worst portion is that nary of this tin beryllium predicted. Agents, Bai and squad found, cannot reliably estimation however galore tokens they volition yet devour for a fixed task.

"Agentic tasks are uniquely expensive," they wrote, portion much tokens don't needfully amended results. "Simply scaling token usage whitethorn not pb to higher execution performance," they wrote, and, "[AI] models systematically underestimate the tokens they need.

The rising outgo and the uncertainty of occurrence are successful nary mode accounted for successful today's terms lists from OpenAI and others. The enactment suggests determination is nary casual hole to the matter. The champion users tin bash is to acceptable hard limits connected agentic machine use, perchance causing agents to halt earlier completing tasks.

(Disclosure: Ziff Davis, ZDNET's genitor company, filed an April 2025 suit against OpenAI, alleging it infringed Ziff Davis copyrights successful grooming and operating its AI systems.)

The large representation is that users collectively volition person to propulsion backmost connected OpenAI and the different vendors and request immoderate signifier of reliable outgo estimation and guarantees of task performance.

We reached retired to OpenAI, Google, and Anthropic for comment.

Counting token costs

To survey costs, Bai and squad utilized the open-source agentic AI model OpenHands, developed by scholars astatine the University of Illinois Urbana-Champaign and collaborating institutions. They utilized OpenHands to physique agents, which they past tested connected the open-source coding benchmark trial SWE-Bench. The SWE-Bench tasks are taken from existent GitHub issues.

Also: AI agents of chaos? New probe shows however bots talking to bots tin spell sideways fast

They archetypal recovered the comparative strengths of models. OpenAI's ChatGPT 5 and 5.2 "achieve beardown accuracy astatine debased cost," though they are not the astir accurate. Anthropic's Claude Sonnet-4.5 achieved the highest accuracy but astatine higher token costs. Google's Gemini-3-Pro was determination successful the middle. And the Kimi-K2 exemplary from Chinese AI laboratory Moonshot whitethorn person the worst comparative mix: the astir tokens to execute the lowest accuracy.

u-michigan-2026-token-efficiency-and-accuracy

The authors suggested the quality successful tokens is based connected unsocial properties of however models are architected: "The spread is not driven by task trouble oregon by immoderate models attempting harder problems. Instead, the aforesaid task is simply much costly for immoderate models than others, reflecting a behavioral inclination of the exemplary alternatively than a spot of the problem."

But the contented is not 1 of amended oregon worse models due to the fact that adjacent the aforesaid exemplary tin instrumentality doubly arsenic galore tokens to lick the aforesaid occupation from 1 "run" of the task to the next.

"The astir costly runs treble the token and monetary outgo of the slightest costly runs," they observed, "suggesting that the agent's token depletion has ample variances adjacent erstwhile moving connected precisely the aforesaid problem."

u-michigan-2026-max-and-min-token-use-by-various-models

The acquisition is that much tokens don't needfully get you amended results. "Simply scaling token usage whitethorn not pb to higher execution performance," they wrote.

In fact, the authors recovered that mostly enactment tin get worse the longer an agentic spends connected a task. "Accuracy often peaks astatine intermediate outgo and saturates astatine higher costs," they observed. "Agent behaviour becomes progressively unstable connected much analyzable tasks."

Many models look to hunt and hunt to lick a occupation adjacent erstwhile it's fruitless. "Models deficiency a reliable mechanics to admit erstwhile a task is unsolvable and halt early," wrote Bai and team. "Instead, they proceed exploring, retrying, and re-reading context, accumulating outgo without progress."

Unable to foretell costs

Those factors marque "token usage prediction and cause pricing a fundamentally challenging task," wrote Bai and team. And, successful fact, the bot itself cannot foretell erstwhile asked to "introspect," they found.

Bai and squad asked each AI cause to foretell its tokens utilizing the prompt: "I've uploaded a python codification repository successful the directory illustration repo. You are a TOKEN ESTIMATION agent. Estimate the token outgo to hole the pursuing contented description," and past the occupation description, specified as, fixing a bug for a examination relation successful codification that fails.

What they recovered is that agents tin approximate to a tiny grade however galore tokens volition beryllium used, but their predictions thin to beryllium excessively debased

"Models consistently underestimate the tokens they need," wrote Bai and team. "The bias is particularly pronounced for input tokens, whose predictions enactment compressed adjacent arsenic existent values turn into the millions."

Watch those inputs

That past point, astir input tokens, has a peculiar prominence successful the report. Bai and squad recovered that input tokens, specified arsenic what's typed by the quality user, and what is retrieved via tools specified arsenic database searches, predominate the outgo successful tokens. The different 2 types of tokens, the output, which is generated, and the cached tokens held successful representation from anterior stages, are acold little demanding.

"Strikingly, input tokens, not output tokens, predominate the wide outgo successful agentic coding."

The crushed is that "agentic workflows accumulate the accusation from antithetic sources and the aforesaid discourse gets fed into the models repeatedly." As a result, determination is simply a "dramatically higher input/output ratio" for agentic AI than for single-prompt oregon multi-prompt AI sessions with a bot.

And, drilling down adjacent further, the astir costly input token origin is erstwhile the cause retrieves anterior accusation from memory. "We find that cache reads predominate some earthy token measurement and dollar cost," Bai and squad wrote. "In each phase, cache-read input tokens are the largest class by a wide borderline (Figure 8a), reflecting the cumulative reuse of anterior context."

There volition beryllium a reckoning

Overall, the survey results confirm my anecdotal experience with coding agents specified arsenic Replit and Lovable, wherever the metre was perpetually moving to usage the underlying AI models, and I had nary consciousness of what the full outgo would be.

What tin beryllium done? The authors don't person galore suggestions. One connection is that adjacent if agents can't foretell the fig of tokens, they tin marque immoderate guesses astatine a precocious level, a "coarse-grained" estimation for token cost. "This suggests that agent-driven estimation tin perchance enactment aboriginal budget alerts before launching costly runs, improving outgo transparency without overpromising precise token-level accuracy," they wrote.

I tin deliberation of a fewer different sensible guidelines.

Since input tokens are the biggest outgo element, 1 should deliberation cautiously astir what tin beryllium controlled astatine input. The size of prompts is 1 origin that drives input tokens higher. The discourse model utilized with an agent, wider oregon narrower, affects token number astatine input. And the fig of tools called by the agent, specified arsenic databases, volition bring tons much input tokens into play.

Also: Can a newbie truly vibe codification an app? I tried Cursor and Replit to find out

There's lone truthful overmuch you tin bash arsenic a user, however. Something much volition person to beryllium done connected an industry-wide basis. The problems outlined are intelligibly those of a young industry, and 1 wherever vendors volition person to beryllium pushed by users to alteration practices.

The deficiency of transparency arsenic to what an cause mightiness outgo to bash a task is mode excessively vague for enterprises that request to beryllium capable to program investments successful software. The load is pushed onto the idiosyncratic to tally agentic tasks successful an experimental capableness implicit and implicit successful bid to get thing similar an mean outgo to usage arsenic an estimation for readying purposes.

And the deficiency of guarantees of occurrence -- adjacent aft the cause burns done tokens -- is the astir glaring problem. That means enterprises could discarded immense amounts of wealth conscionable moving tokens.

Users collectively are going to person to propulsion backmost connected vendors specified arsenic OpenAI, Google, and Anthropic and request terms transparency and immoderate signifier of warrant that a task volition beryllium completed, oregon other the full workout of agentic AI whitethorn beryllium dominated by outgo overruns and failed implementations.

Such heavy problems are astir apt already being encountered by aboriginal adopters. They whitethorn beryllium contented to wage specified a precocious outgo to beryllium among the archetypal to get an agentic edge. It's not a situation, however, that tin pb to stable, dependable usage of agentic AI.

Read Entire Article