07 Feb 2025 7 min read

The $600 Billion issue

In this issue, continuing the essay format of the newsletter, we will discuss the Nvidia-DeepSeek news in depth, highlight some noteworthy malicious open source packages, and cover updates from the Linux community. We'll also analyze some high-profile near-acquisitions with an uncertain future.

Billion with a B

First, a brief recap of events since January 20, 2025.

Chinese startup DeepSeek launched it's R1 model, open-sourcing it to showcase their willingness to be a first-class citizen of this open-source-friendly ML landscape.

Along with the many technological innovations behind the creation of the model, DeepSeek also employed some financial innovations crucial to their success.

The Tech first

R1 is a "reasoning" model, which means that it doesn't function like a conventional Large Language Model (LLM). Unlike typical LLMs, which provide answers without revealing how they arrived at them, DeepSeek R1 and other reasoning models expose some intermediate steps, allowing users to peek behind the curtain and trust the output. The speed of cloud-hosted LLM output has been a big critique of these models - that they do not think, only regurgitate.

Moreover, reasoning models act as their own little devil's advocates, posing questions like "What about the alternative...?" and "Ok, but what if my assumption is wrong? How would that change things?" Basically, a reasoning model debates against itself. It's a much better approach to LLM-based knowledge because it shows the process of getting to that knowledge and because it does better at fighting hallucinations.

That doesn't mean such models are free of bias or misinformation. DeepSeek R1 has very deeply embedded cultural biases. This includes a reluctance to detail embarrassing or inculpating information about China, and inaccuracies about its neighbors and adversaries. This reflects the GIGO (Garbage In, Garbage Out) principle often observed in LLMs.

What is new is the training process. It seems that to reduce time to market and to achieve high-quality models with minimal data, DeepSeek took an interesting shortcut. Quite possibly, they just went to Microsoft Copilot and requested it to generate millions of lines of "good data", which they used that to train their models.

Data munging is one of the most expensive steps of an ML pipeline. Let's say you are building a safe-for-work English-language based "common parlance" LLM. You begin with a hundred million lines of reddit comments. First, you need to remove everything that includes other languages, a manageable task with Unicode. Next, you must remove all emojis, which is simple too. Then comes the challenge of removing abusive language and double entendres. While the former is somewhat straightforward, the latter is not. When you get over that hump, you'll need to contend with the need to eliminate uncommon terms, among other tasks.

Months could pass before you have a good data set. Now, you could have started by standing on the shoulders of others - as in, gone to Huggingface and downloaded one of thousands of open source datasets that people have painstakingly built. However, as LibGen controversy showed us, all the text on the internet is still not enough to train today's LLMs, which demand increasingly larger datasets.

A great solution to generating large quantities of sanitized text is to use an LLM itself. Asking an existing LLM to generate good input for your own ML model is called creating "synthetic data". Synthetic data is great because instead of having to deal with a large corpus of unstructured data, you're asking a structured data machine to generate a large combination of responses that satisfy your request while appearing unique.

But it depends on how good your query is. This is where DeepSeek seems to have faultered. Ask it who it is, and R1 seems to respond that it is Microsoft Copilot, built in collaboration with OpenAI. It does so even after being told that it is DeepSeek R1. Not so good at reasoning now, is it?

The Finance, next

But, let's come to the financial innovations. This synthetic data issue combined with a geopolitical issue, creating additional drama. Some time ago, the US banned China from getting the latest Nvidia ML-training GPUs.

Consequently, China (and thus DeepSeek) could not get their hands on the Nvidia A100s, which OpenAI and GrokAI have been using extensively to train their biggest models. DeepSeek had to depend on the H800 and H20 GPUs. The former, they would have had to purchase legally before 2024, and the latter are still legal but much less powerful. Though if some folks are to be believed, DeepSeek is lying about what chips they used and bought black market Nvidia chips for their model training.

But with the higher quality data, the chip setback seems to have mattered less. DeepSeek deployed two other innovations to build the R1. First, it used much cheaper human labor to do reinforcement learning on its models. To quote MIT Technology Review -

Reinforcement Learning with Human Feedback (RLHF), is what makes chatbots like ChatGPT so slick.

The human component is very important and labor is more affordable in China than in the US. But even that wasn't enough. DeepSeek implemented a technique demonstrated by Google DeepMind in 2016, automating its reinforcement-learning loop, so that no humans were involved in the second training phase. This sped up model training and made it much cheaper.

These innovations (and more, which I think you should read up in the MIT article I've linked to) helped DeepSeek train a model in millions that's similar in performance to models built by US companies that cost them tens or hundreds of millions of dollars. I'm not attaching real figures, because they are deeply disputed. DeepSeek doesn't need to reveal all research, infrastructure, and training costs, so they're not doing so. Much like an LLM, we just have to take their word for it.

But there's a another side to the story. The target of DeepSeek's showstopper seems to be Nvidia. Hey, if you can train models on the cheaper H800 GPUs (older gen, thirty grand a pop) instead of the more expensive H100 GPUs (forty grand a pop), then why put the extra money in Nvidia's pocket?

But why not target other LLM companies instead? After all, if DeepSeek succeeds, it would be the market leader in LLMs. Developers would line up to use their Chatbots and APIs instead of OpenAI's or xAI's.

The answer is that most of the competition is private. So there's no way to drive down their market relevance financially. No stock to short, no public earnings to hinder. Also, since DeepSeek intended to release their models as open source, anyone can just download the models and run them (as many have, including Perplexity and Jeff Geerling on his RasPi). So the only real targets of the financial story are the source of the GPUs (Nvidia), and the source of the infrastructure (Microsoft). Yes, while everyone was talking about Nvidia's meteoric fall of almost 17%, Microsoft lost a whopping 12% of stock value. While Nvidia's wipe out of $600 Billion is the largest in world stock market history, Microsoft also lost $320 Billion in market cap due to this reasoning model.

That's where I'm going to pause for now. There's a lot happening in this space, including all the new innovation that DeepSeek has brought forth. We've long expected that a completely new paradigm will change how LLMs are trained, making them much cheaper. The R1 model is not perfect, but it's shaken up the entire AI industry and its ancillary industries too.

Elsewhere

There are at least two other things you need to know.

First, there are already malicious open source packages related to DeepSeek floating around. According to research done by Positive Technologies, PyPI packages deepseeek (notice the extra e in seek) and deepseekai are malicious packages that steal developer credentials and environment variables, shipping them to a Command and Control server elsewhere. This could lead to hackers accessing Cloud accounts and private git repos, as they did in the EmeraldWhale attack.

While typosquatting can be mitigated through developer awareness and approved package solutions, package registries are never quick enough to remove threats. In this case, the packages were downloaded 222 times before removal.

Second, Linux has been losing driver maintainers/developers to various reasons. The latest one is "a Qualcomm Atheros engineer" who for much of the last decade has been single-handedly maintaining WiFi drivers for Linux. This is a pretty big hit but not an isolated incident. Recently, the sole developer working on the drivers for some types of displays stepped down due to their health.

These departures create vulnerabilities in the Linux ecosystem, potentially allowing malicious actors to distribute fake driver updates containing harmful code. There have been plenty of scenarios where the software supply chain has been disrupted by malicious packages either due to account hijacking or developers simply unpublishing critical dependencies.

In the end, the only way to prevent such scenarios is to prevent developers from getting their hands on bleeding edge open source packages. Till the open source community has had time to look for zero days, enterprises should not trust open source packages or updates.

M&AOps

Let's talk about Merger and Aquisitions for a second. It's been a while since IBM declared their love for HashiCorp is going to cost them $7 Billion. The UK's Competition and Markets Authority has thoughts on that. Also, the US DOJ has filed a lawsuit to prevent HPE from acquiring Juniper Networks for $14 Billion.

In general, the next few years are going to be M&A happy. Some restrictions will come from external sources like the UK and the EU. But largely, the US will have a massive few years for M&A. This means more tech movement, more consolidation, and more innovation a few years down the line as folks split from the acquired companies to do their own work.

Of course, when HashiCorp's acquisition goes through, folks will be looking for alternatives. Thanks to the Business Source License "situation" of 2023, we've already got OpenTofu. So many folks will leverage that, while enterprises who need it, will seek out support from IBM.

Aside

I feel like we need a name for the companies that are at the forefront of consumer-focused LLMs. Let's see, the companies in this cohort are -

Microsoft

Apple

Google

OpenAI

Fin

That's all I have for this newsletter folks. Hope you have a great weekend ahead. If you liked what you read, share it with someone. If you have any feedback, just hit reply!