Subscribe for Investment Insights. Stay Ahead.
Investment market and industry insights delivered to you in real-time.
Wayne Lloyd is the founder of NFT Technologies and Consensus Core Technologies He has extensive experience in emerging technologies, with a focus on the intersection of AI, blockchain, and high-performance computing.Â
The AI world was jolted when DeepSeek, a Chinese AI research company, announced breakthroughs in AI training efficiency. Nasdaq was down big and NVIDIA’s stock sold off 17%. The market’s reaction, however, misses two fundamental truths about AI infrastructure.
First, we are firmly in the era of non-deterministic compute, where every AI interaction has a real, measurable cost that cannot be engineered away. What does this mean? When you use traditional software like Microsoft Word or Gmail, adding another user costs virtually nothing – the software follows a fixed path and simply repeats the same operations. But AI is different because each interaction requires the model to think through a unique response, using real computational power every time. Even asking the same question twice might generate different responses, and each response consumes energy, compute cycles, and GPU time. While DeepSeek’s innovations make this computation more efficient, they don’t change this fundamental nature: every AI interaction has a real cost.
The second truth is even more important: the invention of machine intelligence has tapped into the single most unbounded demand known to man because there is virtually no limit to how much intelligence humanity can absorb. Every human interaction, business decision, and creative process can benefit from additional intelligence, and soon people will be using AI tools thousands of times a day – both through intentional workflows and passively through programs they already use. We are in the very early innings of this transition.
DeepSeek claims they trained their model for just $5.6 million with a team of fewer than 200 people. If we take this figure at face value it only covers the final training run. Including experimentation, ablation studies, and architectural testing across multiple scales, the true cost is likely $30-50M minimum.
Also, we’ve seen some public industry skepticism about DeepSeek’s infrastructure claims. Respected semiconductor analyst Dylan Patel stated back in November that the DeepSeek possessed over 50,000 Hopper GPUs – far more than publicly acknowledged.
While this claim remains unverified, it highlights a broader truth: companies often have strategic reasons to be less than fully transparent about their capabilities and resources. The $5.6 million figure, while perhaps technically accurate for the final training run, likely understates their total investment in infrastructure and research.
I get it, ‘200 people and $5M to catch OpenAI’ is a great meme, but it’s a disingenuous way to frame what are genuinely impressive technical breakthroughs by a highly talented and well-funded team. These efficiency gains are real and significant, and since they’ve been open-sourced, they represent fundamental advances that will likely shape how all future LLMs are developed and trained. Let’s examine each of their key innovations:
First, they created a new way to handle numbers in AI training (their FP8 Mixed-Precision Framework). Traditional AI models use large, precise numbers that take up a lot of memory. DeepSeek figured out how to use smaller numbers without losing accuracy – like creating a new kind of digital compression that works specifically for AI.
Second, they developed a system that can predict multiple words at once, rather than one at a time (Multi-Token Prediction). Imagine the difference between reading word-by-word versus taking in whole phrases at a glance. This makes the AI respond nearly twice as fast in many cases.
Third, they revolutionized how AI stores and accesses information (Multi-Head Latent Attention). Instead of keeping everything in active memory – like having every book in a library open at once – they created a smart filing system that only pulls out what’s needed when it’s needed.
Fourth, they improved how multiple computer chips work together (DualPipe Algorithm). It’s like conducting an orchestra where every instrument knows exactly when to play and when to prepare for the next piece, eliminating wasted time and effort.
Finally, they built an AI that works like a panel of experts (Mixture-of-Experts Architecture). It’s like having a team of specialists where only the relevant experts are consulted for each task, instead of having one massive system trying to know and do everything. Their system has 671 billion parameters (think of these as pieces of knowledge), but only uses about 37 billion at once, dramatically reducing the computing power needed.
Before we discuss why these innovations are bullish for GPU demand, it’s crucial to understand a vulnerability that all frontier AI models face, one that DeepSeek admittedly exploited and others will certainly follow. This vulnerability stems from a basic AI development technique called distillation. Distillation is straightforward: you feed inputs to an existing model (the teacher), record its outputs, and then use those input-output pairs to train your new model (the student). While this might sound like simple copying, it’s actually more like learning from worked examples: the student model develops its own understanding based on the teacher’s demonstrations.
This leads us to a powerful dynamic that I’m going to call ‘Frontier Caching’: the more widely used an advanced AI model becomes, the more vulnerable it is to have its capabilities extracted through distillation. Every interaction with models like GPT-4, even a mundane one, produces well-formed, logically structured outputs that become valuable training data for newer models. This is particularly powerful for reinforcement learning, where newer models can begin their training process with access to millions of examples of coherent reasoning and problem solving.
By systematically capturing knowledge from widely available models through distillation, DeepSeek likely created a foundation of clean, structured data that made their reinforcement learning process far more efficient. They talk about this in their technical papers, but the broader implication is profound: the more accessible a model is, the more thoroughly its knowledge can be extracted and repurposed. This creates an interesting dynamic where open source models become particularly valuable targets, not because they represent the absolute frontier of capability, but because they provide unrestricted access to large volumes of clean, structured training data. Any organization could potentially employ this technique, perhaps even more aggressively than DeepSeek did, making widespread adoption a double-edged sword for leading AI models.
This dynamic challenges traditional assumptions about maintaining competitive advantages in AI. Companies pushing the boundaries of AI capability may find their innovations quickly absorbed and repurposed by the broader ecosystem through frontier caching. The result is an acceleration of progress that benefits fast followers perhaps even more than frontier leaders.
These factors – potential infrastructure understatement and sophisticated use of distillation – don’t diminish DeepSeek’s technical achievements. Rather, they help explain how their innovations fit into the broader landscape of AI development. And crucially, they point toward why this news is actually bullish for GPU demand. To those outside the technology industry, this seems like it would reduce infrastructure demand. After all, if something becomes more efficient, shouldn’t we need less of it?
This intuition gets it exactly backward. The history of technology shows us a consistent pattern: when transformative technologies become more efficient, their use explodes. Cloud computing is the perfect example – making computing more efficient and accessible didn’t reduce server demand, it created an explosion of new use cases that drove infrastructure growth to unprecedented (and still growing) levels. Every efficiency improvement unlocked new applications that weren’t previously economical.
The same pattern is emerging with AI, but at an even larger scale. Lower training costs don’t mean fewer models – they mean more organizations training more models. More efficient inference doesn’t mean less computation – it means AI gets embedded into more applications, products, and services. Every efficiency improvement expands the universe of economically viable AI applications. And remember: unlike traditional software, every single one of these AI interactions carries a real computational cost that drives infrastructure investment in datacenter and GPU compute.
Some believe AI will eventually become as cheap to serve as traditional software. Others swing to the opposite extreme, treating every model like it requires Manhattan Project-level resources. Both miss the fundamental reality: no matter how many efficiency breakthroughs emerge, AI computation will never be ‘cheap’ in the traditional software sense. The reason is simple, humans are becoming increasingly dependent on non-deterministic intelligence, and this form of intelligence has real, unavoidable costs. While geopolitical tensions and regulatory frameworks will inevitably shape how this technology develops across different regions, the economic fundamentals remain clear: more participants means more computation, more data centers, and more GPUs.
Subscribe for Investment Insights. Stay Ahead.
Investment market and industry insights delivered to you in real-time.