DeepSeek: Smaller, Cheaper, Better?

OpenAI, Google, Meta, and Anthropic have been dominating in the space of large language models.

With the cost of over $100 Million for training GPT-4, not many enterprises can compete at this budget level. But what if there was a model that cost under $6 Million to train that matches GPT-4o’s performance?

DeepSeek-V3, an open source Large Language Model, was released on December 26, 2024 by High-Flyer, a Chinese hedge fund and AI company. The size of the model is 671B parameters. In comparison, Llama has a 405B parameters open source model and the proprietary GPT-4 is rumored to have over 1 Trillion parameters.

/Performance and Cost

On many benchmarks, DeepSeek levels and outperforms many people’s go-to model GPT4o.

Another intuitive and useful way to compare the models’ performance is by asking users to blindly vote on the outputs from two models side-by-side. Here’s the top 8 LLMs on the leaderboard on the chatbot arena website. You will notice that DeepSeek-V3 is the only open source model among them. It’s also smaller than many of them and the cheapest of all.

(Fig. 2) A screencap of the Chatbot Arena Leaderboard as of January 2025

The price to use these models through API is measured by token counts. As of January 2025, The cost for ranked 8th Gemini-1.5 Pro is $1.25 for input per 1 million tokens and $5.00 for output per 1 Million tokens.

For GPT-4o, the input and output cost per 1 million tokens are $2.50 and $10.00. DeepSeek-V3 comes in with only $0.014 and $0.28 per 1 million input and output tokens! This means DeepSeek-V3 is 0.5% and 3% the price of GPT-4o’s.

/How did DeepSeek Do This?

DeepSeek used a few techniques to improve the architecture and lower the training cost.

DeepSeek-V3 is a Mixture-of-Experts (MoE) language model. It means that the model is made of many mini models that have knowledge in specific domains. When training on texts from the internet, not all 671B parameters are being updated. Only the mini models related to the information are being updated.

According to the technical paper, each token only updates up to 5.5% of the entire model. Think of it as when the model is training on cooking recipes, the cooking knowledge is updating but the domain knowledge on building space rockets might be untouched.

Another thing the researchers tried is reducing the precision in numbers during some parts of the training to reduce memory consumption. In everyday language that means, instead of using long digits for each number when doing calculations they truncate it to shorter digits. For example, 1.2468 x 1.3579 -> 1.2 x 1.3. Normally, to train LLM the precision is FP32 (more memory) to retain information. However, DeepSeek reduced it to FP8 (¼ the memory). This is the first time people have tried this when training this large size LLM.

Evidently, the result still holds up.

/DeepSeek: A Disruptor?

The threshold to train Large Language Models has been rising steadily and priced the competition out. Or maybe that is what we have thought.

DeepSeek-V3 costs under $6 Million and was trained using only 2 thousand H800 GPUs which is a technically downgraded version of H100 that Nvidia makes for China to bypass US sanctions. Meanwhile, the Llama 405B model was trained on 16 thousand H100 with an estimated over $60 Million cost that has visibly worse performance.

This is a sign that creative solutions can still level the hardware and resource advantages that big techs have to an extent when it comes to training LLMs.

Zhenbo Yan

Zhenbo is a software developer with experience in full-stack mobile development and networking, currently working on high-performance computing software at a BioTech institute. He is interested in exploring the application of LLMs and emerging AI tools in medical research.

Next
Next

Dreamscapes & PNGS: Why AI Art Feels Like a Portal to Forgotten Dreams