How DeepSeek-R1 was created?
This article summarizes the innovations and optimizations in DeepSeek series models for large-scale training. The breakthroughs of DeepSeek are primarily reflected in model and algorithm innovations, software and hardware collaborative optimization, and the improvement of overall training efficiency...
Saved in:
Published in | Shenzhen da xue xue bao. Li gong ban Vol. 42; no. 2; pp. 226 - 232 |
---|---|
Main Author | |
Format | Journal Article |
Language | English |
Published |
Science Press (China Science Publishing & Media Ltd.)
01.03.2025
|
Subjects | |
Online Access | Get full text |
Cover
Loading…
Abstract | This article summarizes the innovations and optimizations in DeepSeek series models for large-scale training. The breakthroughs of DeepSeek are primarily reflected in model and algorithm innovations, software and hardware collaborative optimization, and the improvement of overall training efficiency. The DeepSeek-V3 adopts a mixture of experts (MoE) architecture, achieving efficient utilization of computing resources through fine-grained design and shared expert strategies. The sparse activation mechanism and lossless load balancing strategy in the MoE architecture significantly enhance the efficiency and performance of model training, especially when handling large-scale data and complex tasks. The innovative multi-head latent attention (MLA) mechanism reduces memory usage and accelerates the inference process, thus lowering training and inference costs. In DeepSeek-V3's training, the introduction of multi-token prediction (MTP) and 8-bit floating-point (FP8) mixed-precision training technologies improves the model's contextual understanding and training efficiency, while optimizing parallel thread execution (PTX) code significantly enhances the computation efficiency of graphics processing units (GPUs). In training the DeepSeek-R1-Zero model, group relative policy optimization (GRPO) is used for pure reinforcement learning, by passing the traditional supervised fine-tuning and human feedback stages, leading to a significant improvement in inference capabilities. Overall, DeepSeek series models has achieved significant advantages in the field of artificial intelligence through multiple innovations, setting a new industry benchmark. |
---|---|
AbstractList | This article summarizes the innovations and optimizations in DeepSeek series models for large-scale training. The breakthroughs of DeepSeek are primarily reflected in model and algorithm innovations, software and hardware collaborative optimization, and the improvement of overall training efficiency. The DeepSeek-V3 adopts a mixture of experts (MoE) architecture, achieving efficient utilization of computing resources through fine-grained design and shared expert strategies. The sparse activation mechanism and lossless load balancing strategy in the MoE architecture significantly enhance the efficiency and performance of model training, especially when handling large-scale data and complex tasks. The innovative multi-head latent attention (MLA) mechanism reduces memory usage and accelerates the inference process, thus lowering training and inference costs. In DeepSeek-V3's training, the introduction of multi-token prediction (MTP) and 8-bit floating-point (FP8) mixed-precision training technologies improves the model's contextual understanding and training efficiency, while optimizing parallel thread execution (PTX) code significantly enhances the computation efficiency of graphics processing units (GPUs). In training the DeepSeek-R1-Zero model, group relative policy optimization (GRPO) is used for pure reinforcement learning, by passing the traditional supervised fine-tuning and human feedback stages, leading to a significant improvement in inference capabilities. Overall, DeepSeek series models has achieved significant advantages in the field of artificial intelligence through multiple innovations, setting a new industry benchmark. |
Author | ZHANG Huimin |
Author_xml | – sequence: 1 fullname: ZHANG Huimin |
BookMark | eNotjstKw0AUQGdRwbb6B13kBxLv3HmvROqjlYJidR0md-5Iam1KUij-vaW6OnAWhzMRo123YyFmEirlUN-sX6vnSqIOFQKaChDRjsRYAkCJVvpLMRmGDYAGpdVYzBbdsbhn3q-Zv8o3WRzjUFDP8cDp9kpc5Lgd-PqfU_Hx-PA-X5Srl6fl_G5VkjTOlj4kiYHZ2wSNTSqanDI5CsEa1Og1MVuydMJpCcFhzoYoB9UEQEY1Fcu_buript737Xfsf-outvVZdP1nHftDS1uuQTOpxjTGoNEUg08mE7oUjPLWxax-AdQfSt4 |
ContentType | Journal Article |
DBID | DOA |
DOI | 10.3724/SP.J.1249.2025.02226 |
DatabaseName | Directory of Open Access Journals (DOAJ) |
DatabaseTitleList | |
Database_xml | – sequence: 1 dbid: DOA name: DOAJ Directory of Open Access Journals url: https://www.doaj.org/ sourceTypes: Open Website |
DeliveryMethod | fulltext_linktorsrc |
Discipline | Sciences (General) |
EndPage | 232 |
ExternalDocumentID | oai_doaj_org_article_04ec3b5b55254ca98d5fc27d953867af |
GroupedDBID | -03 ALMA_UNASSIGNED_HOLDINGS CCEZO CEKLB GROUPED_DOAJ |
ID | FETCH-LOGICAL-c1576-89d129ee86d0b6d3a5fdfc7c996524284cee6c6ccee0252072ff5ccf93b902e23 |
IEDL.DBID | DOA |
ISSN | 1000-2618 |
IngestDate | Wed Aug 27 01:28:31 EDT 2025 |
IsDoiOpenAccess | true |
IsOpenAccess | true |
IsPeerReviewed | false |
IsScholarly | true |
Issue | 2 |
Language | English |
LinkModel | DirectLink |
MergedId | FETCHMERGED-LOGICAL-c1576-89d129ee86d0b6d3a5fdfc7c996524284cee6c6ccee0252072ff5ccf93b902e23 |
OpenAccessLink | https://doaj.org/article/04ec3b5b55254ca98d5fc27d953867af |
PageCount | 7 |
ParticipantIDs | doaj_primary_oai_doaj_org_article_04ec3b5b55254ca98d5fc27d953867af |
PublicationCentury | 2000 |
PublicationDate | 2025-03-01 |
PublicationDateYYYYMMDD | 2025-03-01 |
PublicationDate_xml | – month: 03 year: 2025 text: 2025-03-01 day: 01 |
PublicationDecade | 2020 |
PublicationTitle | Shenzhen da xue xue bao. Li gong ban |
PublicationYear | 2025 |
Publisher | Science Press (China Science Publishing & Media Ltd.) |
Publisher_xml | – name: Science Press (China Science Publishing & Media Ltd.) |
SSID | ssj0040343 |
Score | 2.302663 |
Snippet | This article summarizes the innovations and optimizations in DeepSeek series models for large-scale training. The breakthroughs of DeepSeek are primarily... |
SourceID | doaj |
SourceType | Open Website |
StartPage | 226 |
SubjectTerms | artificial intelligence deepseek group relative policy optimization large language model mixed-precision training mixture of experts architecture multi-head latent attention mechanism multi-token prediction |
Title | How DeepSeek-R1 was created? |
URI | https://doaj.org/article/04ec3b5b55254ca98d5fc27d953867af |
Volume | 42 |
hasFullText | 1 |
inHoldings | 1 |
isFullTextHit | |
isPrint | |
link | http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwrV1LSwMxEA7iyYtYH_iosgcPekibzWs3J_FVSkERa6G3splMLkJbbKV_38nuCnryYi6BHELyheSbDybfMHZJpBUc0TYPUAauC5XzyqmK2xJiMCLmINLn5KdnO5zo0dRMf5T6SjlhjT1wA1xfaATljTeGpAxUrgwmgiyCo5tqiyqm15c471tMNW-wFqpNrReCk0Yom09zqpC6P37pjXqp5DKJQ2l6SfD8NuyvmWWwx3bbkDC7bZbSYVs432ed9tKtsqvWGfr6gHWHi032gLgcI77z1zzbVKusDvsw3ByyyeDx7X7I2_IGHHKK8nnpApEtYmmD8DaoysQQoQBSIIaIs9TEXxYsUEcLlaKQMRqA6JR3QqJUR2x7vpjjMcuE8bnEWIBC0OipReWlRKspmstzd8Lu0v5my8bBYpY8pesBQnrWIj37C-nT_5jkjO0k2Jssri7bXn984jnR-tpf1Cf4BUi1m7k |
linkProvider | Directory of Open Access Journals |
openUrl | ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=How+DeepSeek-R1+was+created%3F&rft.jtitle=Shenzhen+da+xue+xue+bao.+Li+gong+ban&rft.au=ZHANG+Huimin&rft.date=2025-03-01&rft.pub=Science+Press+%28China+Science+Publishing+%26+Media+Ltd.%29&rft.issn=1000-2618&rft.volume=42&rft.issue=2&rft.spage=226&rft.epage=232&rft_id=info:doi/10.3724%2FSP.J.1249.2025.02226&rft.externalDBID=DOA&rft.externalDocID=oai_doaj_org_article_04ec3b5b55254ca98d5fc27d953867af |
thumbnail_l | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=1000-2618&client=summon |
thumbnail_m | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=1000-2618&client=summon |
thumbnail_s | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=1000-2618&client=summon |