Is the era of subsidized AI over?

May 20, 2026 |

artificial-intelligence

Is the era of subsidized AI over?

What Gemini 3.5 Flash's pricing tells us about the economic reality behind frontier AI.

Yesterday, at Google I/O 2026, Google launched Gemini 3.5 Flash. The marketing focused on "frontier intelligence with action" — a model that handles coding, agentic workflows, and multimodal tasks at the level of much larger flagship models. The benchmarks look strong: 76.2% on Terminal-Bench 2.1, 83.6% on MCP Atlas, sitting in the top-right quadrant of the Artificial Analysis index.

But the most interesting storyline isn't in the benchmarks. It's in a row of numbers that Google, tellingly, chose not to include in their announcement post.

The numbers

Price per million tokens (input / output):

Model	Input	Output
Gemini 2.5 Flash	$0.30	$2.50
Gemini 3.0 Flash (preview)	$0.50	$3.00
Gemini 3.5 Flash	$1.50	$9.00

That's a 3x price increase on output between two consecutive generations of the same tier. For context:

Gemini 2.5 Pro was priced at $1.25 / $10 — so 3.5 Flash sits above the input price of the previous flagship.
Claude Sonnet is at $3 / $15 in its current generation. Gemini 3.5 Flash is at $9 output. The gap between "Flash" and "Sonnet" — historically a budget-versus-midtier distinction — has shrunk to 1.6x.
Gemini 3.1 Flash Lite sits at $0.25 / $1.50. That makes 3.5 Flash 6x more expensive.

The Flash positioning, which stood for years as "fast and cheap," has structurally shifted. An HN commenter put it bluntly: "I think flash just means 'fast' now."

The real cost runs further still

The raw token price is one thing. The actual cost per workload is another. Gemini 3.5 Flash has thinking-tokens built in and measurably consumes more tokens per task than its predecessors.

Artificial Analysis published the cost of running their full evaluation suite on different models. The numbers are brutal:

Model	Intelligence score	Total eval cost
Gemini 2.5 Flash	27	$172 (1.0x)
Gemini 3.0 Flash	46	$278 (1.6x)
Gemini 2.5 Pro	35	$649 (3.8x)
Gemini 3.5 Flash	55	$1,552 (9.0x)

Nine times more expensive than 2.5 Flash to run the same benchmarks. 5.6x more expensive than 3.0 Flash, which is barely six months old. And the most striking detail: 3.5 Flash cost 74% more than Gemini 3.1 Pro to run the entire suite — while scoring lower on some benchmarks.

You're paying Pro-tier prices (or more) for what is, in name, still a Flash model.

Why is this happening?

Roughly four explanations are circulating, and they aren't mutually exclusive:

1. Generative AI just isn't profitable at the old prices

The simplest explanation is also the most uncomfortable one: frontier AI was, for years, offered below cost to capture market share. Capex investments in datacenters across the sector run into the hundreds of billions. At some point those expenses need to flow back through the P&L. We've seen this pattern before — Uber, AWS in its early years, streaming services — and the endgame is always a price correction once the market has been "educated."

A widely-quoted reaction on Hacker News: "Gen AI is unprofitable, especially at the insanely cheap rates they've been offering to get people in the door. Expect more increases in the future."

2. A deliberate squeeze after lock-in

A second reading: providers wait until developers have built production workloads on their API, then turn the prices up. It's a classic platform strategy. Cheap access to form the habits, then a repositioning once switching costs are high enough.

Anyone who built agentic systems around Gemini 2.5 Flash now has three options: pay up, re-architect, or fall back to the inferior Flash-Lite tier. None of those are painless.

3. Flash isn't Flash anymore — a positioning shift

A more charitable interpretation: 3.5 Flash is closer in capability to a Pro model than to a traditional Flash. Google simply didn't update the naming (just as Gemini 3.0 Flash never made it out of preview — a telling detail). The Flash-Lite tier would then occupy the slot that "Flash" historically held.

Possibly. But that's at least a naming failure that puts developers in a tough spot. And it doesn't explain the discrepancy between what 3.5 Flash costs to run and what it delivers on benchmarks like MiMo-V2.5-Pro, where it scores roughly the same for 3x the price.

4. It's an agentic-only play

Google positions 3.5 Flash explicitly for long-horizon agentic workloads — coupled with their Antigravity harness and subagent architecture. The use cases they showcase (Shopify forecasting, Macquarie onboarding automation, Salesforce Agentforce) are all enterprise workflows where the value per completed task comfortably exceeds the token cost.

For that market, $9/M output is billable. For the developer building a chatbot or a retrieval layer, it isn't. Google may be deliberately accepting that they're giving up the long-tail segment to cheaper models, in order to focus on enterprise agents where margins run higher.

The counter-narrative

For balance: Google's own framing is that this model shifts the Pareto frontier. Arena.ai notes that eight GoogleDeepMind models dominate the Text Arena Pareto curve. In other words: for the intelligence you get, this is still competitive.

And there's truth in that. 3.5 Flash is two to four times faster than other frontier models on output tokens per second. For latency-sensitive applications, that speed advantage is real. On top of that: Gemini's caching discount (10% of input price) remains aggressive for anyone running agentic workloads where 90-95% of cost sits in cached prefix.

But it doesn't change the core question. The absolute cost has risen significantly, not fallen. And that breaks a pattern of nearly three years in which every generation delivered more intelligence per dollar.

The broader pattern

This isn't an isolated Google phenomenon. Looking across the market:

Claude Opus 4.7 runs at $5/$25. That tier hasn't really gotten cheaper across generations.
GPT-5.5 xhigh modes consume substantially more tokens than predecessors.
Anthropic's Mythos has deliberately not been made publicly available — reportedly because the compute requirements make a viable pricing model difficult.

The common pattern: capability per token continues to rise, but total cost of ownership per use case is rising alongside it. That's a fundamentally different economics than what we saw between 2023 and 2025, where new generations were almost axiomatically cheaper at constant or improved capability.

The direction has flipped.

What this means if you're building something

For anyone running production systems on LLM APIs, a few concrete implications:

1. Provider abstraction is no longer a luxury, it's an insurance policy. Hard-coupling to a single API now means taking on the full risk of unregulated price increases. A layer like OpenRouter, LiteLLM, or your own abstraction lets you swap models without touching application logic. This is exactly why BYOK (Bring Your Own Key) models are gaining traction — the end user carries the provider risk, not the builder.

2. Token efficiency is a KPI again. Between 2024 and early 2026, prompt engineering was mostly about quality, not frugality. With these pricing moves, every unnecessary output token is money again. Caching, prompt compression, and deliberately under-prompting models become skills that translate to financial outcomes.

3. Local and open-weight models are a serious hedge. Qwen 3.6, DeepSeek v4, and Gemma 4 have meaningfully lowered the bar for "good enough for production." A hybrid architecture — local models for 80% of queries, frontier models only when you need them — used to be an academic ideal. Now it's a business case.

4. The Flash-Lite tier deserves a second look. Anyone protesting 3.5 Flash's price may be better off on 3.1 Flash Lite ($0.25/$1.50). The capability gap is real, but a 6x price differential often justifies reworking the prompt pipeline.

One-off move or structural shift?

My intuition: this is no isolated accident. I suspect 2026 will be the year AI pricing normalizes toward true cost plus margin, and that we'll retrospectively view 2023-2025 as the "subsidy phase."

Three reasons I think that:

No provider can keep absorbing current capex without a path to break-even per inference call. NVIDIA is making money. Cloud providers are making money. The model labs aren't — not at current token prices.
For the first time, there's a credible alternative in the form of open-weight models that hit production-grade quality. That lowers the strategic necessity for providers to deliver below cost — anyone who wants to leave largely can.
The agentic workloads driving the industry consume token volumes orders of magnitude higher than chat volumes. The old pricing model was calibrated for chat. The new one will be calibrated for autonomous loops.

For anyone building: assume the margins you have on your AI stack today will be thinner next year. Plan accordingly.

What do you think? A one-off outlier, or the start of a new pricing reality? I'd love to hear from people running production workloads who are feeling the impact concretely.

Sources: - Google's announcement post - Hacker News discussion - Artificial Analysis — Gemini 3.5 Flash benchmarks