How DeepSeek-R1 was created?

This article summarizes the innovations and optimizations in DeepSeek series models for large-scale training. The breakthroughs of DeepSeek are primarily reflected in model and algorithm innovations, software and hardware collaborative optimization, and the improvement of overall training efficiency...

Full description

Saved in:
Bibliographic Details
Published inShenzhen da xue xue bao. Li gong ban Vol. 42; no. 2; pp. 226 - 232
Main Author ZHANG Huimin
Format Journal Article
LanguageEnglish
Published Science Press (China Science Publishing & Media Ltd.) 01.03.2025
Subjects
Online AccessGet full text

Cover

Loading…
Abstract This article summarizes the innovations and optimizations in DeepSeek series models for large-scale training. The breakthroughs of DeepSeek are primarily reflected in model and algorithm innovations, software and hardware collaborative optimization, and the improvement of overall training efficiency. The DeepSeek-V3 adopts a mixture of experts (MoE) architecture, achieving efficient utilization of computing resources through fine-grained design and shared expert strategies. The sparse activation mechanism and lossless load balancing strategy in the MoE architecture significantly enhance the efficiency and performance of model training, especially when handling large-scale data and complex tasks. The innovative multi-head latent attention (MLA) mechanism reduces memory usage and accelerates the inference process, thus lowering training and inference costs. In DeepSeek-V3's training, the introduction of multi-token prediction (MTP) and 8-bit floating-point (FP8) mixed-precision training technologies improves the model's contextual understanding and training efficiency, while optimizing parallel thread execution (PTX) code significantly enhances the computation efficiency of graphics processing units (GPUs). In training the DeepSeek-R1-Zero model, group relative policy optimization (GRPO) is used for pure reinforcement learning, by passing the traditional supervised fine-tuning and human feedback stages, leading to a significant improvement in inference capabilities. Overall, DeepSeek series models has achieved significant advantages in the field of artificial intelligence through multiple innovations, setting a new industry benchmark.
AbstractList This article summarizes the innovations and optimizations in DeepSeek series models for large-scale training. The breakthroughs of DeepSeek are primarily reflected in model and algorithm innovations, software and hardware collaborative optimization, and the improvement of overall training efficiency. The DeepSeek-V3 adopts a mixture of experts (MoE) architecture, achieving efficient utilization of computing resources through fine-grained design and shared expert strategies. The sparse activation mechanism and lossless load balancing strategy in the MoE architecture significantly enhance the efficiency and performance of model training, especially when handling large-scale data and complex tasks. The innovative multi-head latent attention (MLA) mechanism reduces memory usage and accelerates the inference process, thus lowering training and inference costs. In DeepSeek-V3's training, the introduction of multi-token prediction (MTP) and 8-bit floating-point (FP8) mixed-precision training technologies improves the model's contextual understanding and training efficiency, while optimizing parallel thread execution (PTX) code significantly enhances the computation efficiency of graphics processing units (GPUs). In training the DeepSeek-R1-Zero model, group relative policy optimization (GRPO) is used for pure reinforcement learning, by passing the traditional supervised fine-tuning and human feedback stages, leading to a significant improvement in inference capabilities. Overall, DeepSeek series models has achieved significant advantages in the field of artificial intelligence through multiple innovations, setting a new industry benchmark.
Author ZHANG Huimin
Author_xml – sequence: 1
  fullname: ZHANG Huimin
BookMark eNotjstKw0AUQGdRwbb6B13kBxLv3HmvROqjlYJidR0md-5Iam1KUij-vaW6OnAWhzMRo123YyFmEirlUN-sX6vnSqIOFQKaChDRjsRYAkCJVvpLMRmGDYAGpdVYzBbdsbhn3q-Zv8o3WRzjUFDP8cDp9kpc5Lgd-PqfU_Hx-PA-X5Srl6fl_G5VkjTOlj4kiYHZ2wSNTSqanDI5CsEa1Og1MVuydMJpCcFhzoYoB9UEQEY1Fcu_buript737Xfsf-outvVZdP1nHftDS1uuQTOpxjTGoNEUg08mE7oUjPLWxax-AdQfSt4
ContentType Journal Article
DBID DOA
DOI 10.3724/SP.J.1249.2025.02226
DatabaseName Directory of Open Access Journals (DOAJ)
DatabaseTitleList
Database_xml – sequence: 1
  dbid: DOA
  name: DOAJ Directory of Open Access Journals
  url: https://www.doaj.org/
  sourceTypes: Open Website
DeliveryMethod fulltext_linktorsrc
Discipline Sciences (General)
EndPage 232
ExternalDocumentID oai_doaj_org_article_04ec3b5b55254ca98d5fc27d953867af
GroupedDBID -03
ALMA_UNASSIGNED_HOLDINGS
CCEZO
CEKLB
GROUPED_DOAJ
ID FETCH-LOGICAL-c1576-89d129ee86d0b6d3a5fdfc7c996524284cee6c6ccee0252072ff5ccf93b902e23
IEDL.DBID DOA
ISSN 1000-2618
IngestDate Wed Aug 27 01:28:31 EDT 2025
IsDoiOpenAccess true
IsOpenAccess true
IsPeerReviewed false
IsScholarly true
Issue 2
Language English
LinkModel DirectLink
MergedId FETCHMERGED-LOGICAL-c1576-89d129ee86d0b6d3a5fdfc7c996524284cee6c6ccee0252072ff5ccf93b902e23
OpenAccessLink https://doaj.org/article/04ec3b5b55254ca98d5fc27d953867af
PageCount 7
ParticipantIDs doaj_primary_oai_doaj_org_article_04ec3b5b55254ca98d5fc27d953867af
PublicationCentury 2000
PublicationDate 2025-03-01
PublicationDateYYYYMMDD 2025-03-01
PublicationDate_xml – month: 03
  year: 2025
  text: 2025-03-01
  day: 01
PublicationDecade 2020
PublicationTitle Shenzhen da xue xue bao. Li gong ban
PublicationYear 2025
Publisher Science Press (China Science Publishing & Media Ltd.)
Publisher_xml – name: Science Press (China Science Publishing & Media Ltd.)
SSID ssj0040343
Score 2.302663
Snippet This article summarizes the innovations and optimizations in DeepSeek series models for large-scale training. The breakthroughs of DeepSeek are primarily...
SourceID doaj
SourceType Open Website
StartPage 226
SubjectTerms artificial intelligence
deepseek
group relative policy optimization
large language model
mixed-precision training
mixture of experts architecture
multi-head latent attention mechanism
multi-token prediction
Title How DeepSeek-R1 was created?
URI https://doaj.org/article/04ec3b5b55254ca98d5fc27d953867af
Volume 42
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
link http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwrV1LSwMxEA7iyYtYH_iosgcPekibzWs3J_FVSkERa6G3splMLkJbbKV_38nuCnryYi6BHELyheSbDybfMHZJpBUc0TYPUAauC5XzyqmK2xJiMCLmINLn5KdnO5zo0dRMf5T6SjlhjT1wA1xfaATljTeGpAxUrgwmgiyCo5tqiyqm15c471tMNW-wFqpNrReCk0Yom09zqpC6P37pjXqp5DKJQ2l6SfD8NuyvmWWwx3bbkDC7bZbSYVs432ed9tKtsqvWGfr6gHWHi032gLgcI77z1zzbVKusDvsw3ByyyeDx7X7I2_IGHHKK8nnpApEtYmmD8DaoysQQoQBSIIaIs9TEXxYsUEcLlaKQMRqA6JR3QqJUR2x7vpjjMcuE8bnEWIBC0OipReWlRKspmstzd8Lu0v5my8bBYpY8pesBQnrWIj37C-nT_5jkjO0k2Jssri7bXn984jnR-tpf1Cf4BUi1m7k
linkProvider Directory of Open Access Journals
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=How+DeepSeek-R1+was+created%3F&rft.jtitle=Shenzhen+da+xue+xue+bao.+Li+gong+ban&rft.au=ZHANG+Huimin&rft.date=2025-03-01&rft.pub=Science+Press+%28China+Science+Publishing+%26+Media+Ltd.%29&rft.issn=1000-2618&rft.volume=42&rft.issue=2&rft.spage=226&rft.epage=232&rft_id=info:doi/10.3724%2FSP.J.1249.2025.02226&rft.externalDBID=DOA&rft.externalDocID=oai_doaj_org_article_04ec3b5b55254ca98d5fc27d953867af
thumbnail_l http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=1000-2618&client=summon
thumbnail_m http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=1000-2618&client=summon
thumbnail_s http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=1000-2618&client=summon