How DeepSeek-R1 was created?

This article summarizes the innovations and optimizations in DeepSeek series models for large-scale training. The breakthroughs of DeepSeek are primarily reflected in model and algorithm innovations, software and hardware collaborative optimization, and the improvement of overall training efficiency...

Full description

Saved in:

Bibliographic Details
Published in	Shenzhen da xue xue bao. Li gong ban Vol. 42; no. 2; pp. 226 - 232
Main Author	ZHANG Huimin
Format	Journal Article
Language	English
Published	Science Press (China Science Publishing & Media Ltd.) 01.03.2025
Subjects	artificial intelligence deepseek group relative policy optimization large language model mixed-precision training mixture of experts architecture multi-head latent attention mechanism multi-token prediction
Online Access	Get full text

Cover

Loading…

More Information
Summary:	This article summarizes the innovations and optimizations in DeepSeek series models for large-scale training. The breakthroughs of DeepSeek are primarily reflected in model and algorithm innovations, software and hardware collaborative optimization, and the improvement of overall training efficiency. The DeepSeek-V3 adopts a mixture of experts (MoE) architecture, achieving efficient utilization of computing resources through fine-grained design and shared expert strategies. The sparse activation mechanism and lossless load balancing strategy in the MoE architecture significantly enhance the efficiency and performance of model training, especially when handling large-scale data and complex tasks. The innovative multi-head latent attention (MLA) mechanism reduces memory usage and accelerates the inference process, thus lowering training and inference costs. In DeepSeek-V3's training, the introduction of multi-token prediction (MTP) and 8-bit floating-point (FP8) mixed-precision training technologies improves the model's contextual understanding and training efficiency, while optimizing parallel thread execution (PTX) code significantly enhances the computation efficiency of graphics processing units (GPUs). In training the DeepSeek-R1-Zero model, group relative policy optimization (GRPO) is used for pure reinforcement learning, by passing the traditional supervised fine-tuning and human feedback stages, leading to a significant improvement in inference capabilities. Overall, DeepSeek series models has achieved significant advantages in the field of artificial intelligence through multiple innovations, setting a new industry benchmark.
ISSN:	1000-2618
DOI:	10.3724/SP.J.1249.2025.02226