kkya – pauraque

Back in 2024, the big tech companies were all flexing their muscles over who had the most parameters and the highest benchmark scores. OpenAI rolled out something with a trillion parameters, Google countered with insane context windows. Every AI conversation started and ended with “Scaling Law.”

Fast forward to May 2026, and the vibe has done a complete 180.

In a single week in late April, all the major players dropped new models in a tight cluster — OpenAI launched GPT‑5.5, Anthropic updated Claude Opus, Alibaba released Qwen3.6‑Max, Tencent put out Hunyuan Hy3, and Xiaomi shipped MiMo‑V2.5‑Pro. But this time, nobody was bragging about parameter counts. The whole conversation had shifted. As multiple industry insiders put it, AI competition is moving from “how big is your model” to “how efficient and deployable is it, and what can it actually do in the real world.”

Let me translate that into plain English: Stop telling me how smart or scholarly your AI is. Just tell me — how much does it cost to get a task done? And once I can afford it, can I actually get the job done?

Alright, let’s look at just how brutal this price war has become.

Prices have gone completely insane — some are 100x apart, and one player is even raising prices

Let me throw out some concrete numbers so you can feel what “prices hitting rock bottom” really means.

On April 24, DeepSeek officially released its V4 model family. The V4‑Flash version was priced at 2 RMB per million output tokens. The V4‑Pro version was set at 6 RMB, but with a limited‑time 75% discount, it came out even lower. Two days later, DeepSeek doubled down: they announced that the “input cache hit” price across the entire API family would be permanently cut to one‑tenth of its previous rate. V4‑Flash cache hits dropped to 0.02 RMB per million tokens, and V4‑Pro, with the promotional discount stacked on top, went as low as 0.025 RMB per million tokens.

What does that number actually mean?

0.02 RMB per million tokens. That’s basically two Chinese cents to process the equivalent of several million words.

Now let’s look at the other side of the table.

OpenAI’s GPT‑5.5 Pro: $30 p e r m i l l i o n i n p u t t o k e n s,$ 30permillioninputtokens,180 per million output tokens.
Anthropic’s Claude Opus 4.7: $5 p e r m i l l i o n i n p u t t o k e n s,$ 5permillioninputtokens,25 per million output tokens.
Google’s Gemini 3.1 Pro: $2 p e r m i l l i o n i n p u t t o k e n s,$ 2permillioninputtokens,12 per million output tokens.

The gap is staggering. Every single conversation with GPT‑5.5 Pro costs roughly 32 times more than one with DeepSeek V4. If you specifically look at output cost, the difference is even more extreme — GPT‑5.5 Pro’s output side is nearly 100 times more expensive than V4‑Pro’s.

One developer put it into a painfully vivid analogy: Saoud Rizwan, the founder of Cline, pointed out that if Uber switched its AI services from Claude to DeepSeek V4, the company’s entire 2026 AI budget could be stretched from running out in 4 months to lasting for 7 years. The ironic kicker? On the exact same day, Uber’s CTO confirmed that their annual AI budget had already been burned through — by April.

And here’s the really baffling part. At the very moment DeepSeek was slashing prices, OpenAI was hiking them. GPT‑5.5’s three-tier pricing — $5 i n p u t,$ 5input,30 output, $0.5 c a c h e h i t — i s a d o u b l i n g a c r o s s t h e b o a r d c o m p a r e d t o t h e p r e v i o u s G P T ‑ 5.4. G o b a c k e i g h t m o n t h s, a n d G P T ‑ 5 ’ s i n p u t p r i c e w a s j u s t$ 0.5cachehit—isadoublingacrosstheboardcomparedtothepreviousGPT‑5.4.Gobackeightmonths,andGPT‑5’sinputpricewasjust1.25. By April 2026, it had jumped fourfold.

In the same week, two sets of pricing moved in opposite directions, each by orders of magnitude. As one industry analyst put it, “The phrase ‘price war’ doesn’t even cover it anymore.”

And this still isn’t the bottom. Buried in DeepSeek’s official pricing notes is a tiny line that says: once the Ascend 950 super‑nodes ship in volume in the second half of the year, the Pro version’s price will be cut significantly again. In other words, 0.025 RMB might not even be the floor.

This isn’t burning cash to buy market share — real tech is driving the cost down

A lot of people’s first reaction is: at these prices, is DeepSeek just bleeding money to subsidize usage?

It genuinely isn’t. The price cuts are backed by real architectural breakthroughs.

Let me walk you through a core concept: the Mixture of Experts (MoE) architecture.

A traditional large language model is like a single “all‑knowing brain.” Every time you ask it a question, the entire model has to run, no matter what. It’s like forcing a university professor to solve problems ranging from elementary arithmetic to quantum physics, and every single question requires firing up all their brain cells. That’s obscenely expensive.

The MoE approach is completely different. You take a giant model and break it into many “expert modules,” each specialized in a different domain. Ask a coding question, and the system activates the coding expert. Ask a legal question, and it activates the legal expert. Only a small subset of experts is used for each task; the vast majority stay “asleep.” That slashes the amount of computation needed.

DeepSeek V4‑Pro has a total of 1.6 trillion parameters, but for any given task, it only activates 490 billion of them. The parts that aren’t activated consume no compute. You pay only for what you actually use.

On top of this MoE foundation, DeepSeek V4 adds an even deeper layer of optimization. According to the official technical report, when processing a million‑token long context, V4‑Pro needs only 27% of the compute (FLOPs) per token compared to the previous V3.2. The KV cache used to store conversation context has been compressed to just 10% of V3.2’s size. The more extreme V4‑Flash needs only 10% of the compute and 7% of the cache.

Let me translate that: the previous generation, handling a million tokens, basically required running the model at near‑full tilt while gobbling up a ton of video memory. V4, by overhauling the attention mechanism — using a hybrid sparse attention design called “CSA + HCA” — achieves a scenario where context length grows 8x, while compute consumption drops by over 70%.

This is why DeepSeek can afford to push prices down to a couple of cents: not because of subsidies, but because the physical cost of a single API call is genuinely an order of magnitude lower than its competitors’.

Being a hundred times cheaper than OpenAI isn’t a marketing stunt funded by investor cash; it’s architecture‑level innovation that brings the cost down to a point others simply can’t match.

Now that it’s dirt cheap, who’s actually going to make money first?

With prices driven to this extreme, one unavoidable question hangs in the air: who is going to profit from this price war?

First, a dose of cold water: right now, nobody in this industry is truly making money just from selling API access.

Reports have pointed out that the LLM sector has swung from “price wars” to “price hikes,” with revenue and net losses both climbing — and not a single company is genuinely profitable yet. OpenAI is raising prices, Anthropic is tweaking its billing model, DeepSeek is slashing prices. They look like opposite strategies, but they’re all symptoms of the same underlying problem: no commercially viable path has been proven for large language models.

One senior engineer from a major cloud provider explained it simply: “A cache hit means the model ‘remembers’ you’ve asked something similar before and can pull the answer from its memory without doing fresh inference, so it’s cheap. A cache miss means the model is seeing the content for the first time and has to compute it from scratch, so it’s expensive.”

DeepSeek dares to cut the cache‑hit price to one‑tenth because, first, its architecture is so efficient that the base cost of a single token is already very low, and second, they’ve realized that in long‑running tasks, a huge amount of content is repetitive. “System prompts, role definitions, long documents, and tool descriptions” often make up 80–90% of the input, and all of that can be massively reused through caching.

So, with prices now “too cheap to matter,” who has the best shot at cashing in first?

The first group: application companies that can ride this low cost to finally push AI into genuine production environments.
Before, a single AI Agent calling a large model could easily rack up tens of thousands of dollars in cost, which meant it was only viable for demos — never real scale. Now that per‑task costs have plummeted, processing long documents of millions of tokens, or running complex multi‑step agent tasks, has hit a cost threshold that makes real‑world deployment feasible. You no longer have to gut your features or shrink your scope because of budget. These companies don’t carry the enormous expense of training foundational models; they can just call the best model at a ridiculously low cost and pour all their energy and money into building their actual business scenarios.

The second group: platforms that can build true differentiation on top of this “cheap intelligence.”
Tokens themselves are becoming a near‑free commodity. The real moat is no longer the model itself, but everything built on top of it. One founder of a top coding agent startup put it bluntly: V4’s tool‑calling stability and hallucination rate still need to be addressed at the engineering level, and real deployment is impossible without the “scaffolding.” Whoever can weave together the model, agent frameworks, industry‑specific data, and real‑world workflows into a reliable production system will own the pricing power.

The third group, ironically, is the users — businesses and developers.
PingCAP co‑founder and CTO Huang Dongxu said it directly: “I’m moving my Hermes workflow from Claude Opus and GPT‑5.4 over to DeepSeek V4. Most day‑to‑day work really doesn’t need that ultra‑strong coding capability.” He’s already switched completely. When comparable capabilities converge across platforms, the user’s value is unlocked — you spend one‑tenth or even one‑hundredth of the money and get the same output.

The least optimistic position belongs to middle‑layer model companies that have neither genuine architectural breakthroughs nor application‑layer competitive advantages, and whose business is simply reselling API access at a margin. When DeepSeek drives margins close to the bone, the room for these players to survive shrinks dramatically.

Final thoughts

This price war isn’t really about “whose model is strongest” anymore. It’s about “who can make AI actually get used” — plugged into customer service systems, embedded in coding workflows, running through contract reviews, dropped into every real business scenario that actually exists.

There’s a well‑known saying in tech: “For any technology to have a deep impact on human society, it must solve the problems of standardization and cost.” AI has finally reached that stage.

It’s also worth paying attention to that tiny line in DeepSeek’s pricing notes — “Due to current high‑end compute constraints, the Pro service has very limited throughput right now. We expect that once the Ascend 950 super‑nodes ship in volume in the second half of the year, the Pro price will be significantly cut further.” This means DeepSeek is already doing full‑stack adaptation to domestic chips and is using a “chip‑model co‑design” approach to keep lowering the long‑term cost. Huawei has already announced that the entire Ascend super‑node lineup fully supports V4.

The price anchor for the whole industry, going forward, might no longer be set in Silicon Valley.

0.025 RMB per million tokens. For that price, you can have the open‑source model with the largest parameter count in the world process the equivalent of three volumes of The Three‑Body Problem.

When it’s cheap to this degree, the real question is no longer “Can we afford to use it?” but “Once we start using it, how much value can it actually create?”

If you’ve already started rolling AI into your business, drop a comment and tell us: now that the cost has come down this much, what’s the first thing you’d want AI to actually do for you?

Author: kkya

LLM Prices Have Crashed Through the Floor — But Who’s Actually Making Money?