HyperRouter: Towards Efficient Training and Inference of Sparse Mixture of Experts

By routing input tokens to only a few split experts, Sparse Mixture-of-Experts has enabled efficient training of large language models. Recent findings suggest that fixing the routers can achieve competitive performance by alleviating the collapsing problem, where all experts eventually learn simila...

Full description

Saved in:
Bibliographic Details
Main Authors Do, Giang, Le, Khiem, Pham, Quang, Nguyen, TrungTin, Doan, Thanh-Nam, Nguyen, Bint T, Liu, Chenghao, Ramasamy, Savitha, Li, Xiaoli, Hoi, Steven
Format Journal Article
LanguageEnglish
Published 12.12.2023
Subjects
Online AccessGet full text

Cover

Loading…
Abstract By routing input tokens to only a few split experts, Sparse Mixture-of-Experts has enabled efficient training of large language models. Recent findings suggest that fixing the routers can achieve competitive performance by alleviating the collapsing problem, where all experts eventually learn similar representations. However, this strategy has two key limitations: (i) the policy derived from random routers might be sub-optimal, and (ii) it requires extensive resources during training and evaluation, leading to limited efficiency gains. This work introduces \HyperRout, which dynamically generates the router's parameters through a fixed hypernetwork and trainable embeddings to achieve a balance between training the routers and freezing them to learn an improved routing policy. Extensive experiments across a wide range of tasks demonstrate the superior performance and efficiency gains of \HyperRouter compared to existing routing methods. Our implementation is publicly available at {\url{{https://github.com/giangdip2410/HyperRouter}}}.
AbstractList By routing input tokens to only a few split experts, Sparse Mixture-of-Experts has enabled efficient training of large language models. Recent findings suggest that fixing the routers can achieve competitive performance by alleviating the collapsing problem, where all experts eventually learn similar representations. However, this strategy has two key limitations: (i) the policy derived from random routers might be sub-optimal, and (ii) it requires extensive resources during training and evaluation, leading to limited efficiency gains. This work introduces \HyperRout, which dynamically generates the router's parameters through a fixed hypernetwork and trainable embeddings to achieve a balance between training the routers and freezing them to learn an improved routing policy. Extensive experiments across a wide range of tasks demonstrate the superior performance and efficiency gains of \HyperRouter compared to existing routing methods. Our implementation is publicly available at {\url{{https://github.com/giangdip2410/HyperRouter}}}.
Author Nguyen, Bint T
Le, Khiem
Doan, Thanh-Nam
Ramasamy, Savitha
Li, Xiaoli
Nguyen, TrungTin
Hoi, Steven
Pham, Quang
Liu, Chenghao
Do, Giang
Author_xml – sequence: 1
  givenname: Giang
  surname: Do
  fullname: Do, Giang
– sequence: 2
  givenname: Khiem
  surname: Le
  fullname: Le, Khiem
– sequence: 3
  givenname: Quang
  surname: Pham
  fullname: Pham, Quang
– sequence: 4
  givenname: TrungTin
  surname: Nguyen
  fullname: Nguyen, TrungTin
– sequence: 5
  givenname: Thanh-Nam
  surname: Doan
  fullname: Doan, Thanh-Nam
– sequence: 6
  givenname: Bint T
  surname: Nguyen
  fullname: Nguyen, Bint T
– sequence: 7
  givenname: Chenghao
  surname: Liu
  fullname: Liu, Chenghao
– sequence: 8
  givenname: Savitha
  surname: Ramasamy
  fullname: Ramasamy, Savitha
– sequence: 9
  givenname: Xiaoli
  surname: Li
  fullname: Li, Xiaoli
– sequence: 10
  givenname: Steven
  surname: Hoi
  fullname: Hoi, Steven
BackLink https://doi.org/10.48550/arXiv.2312.07035$$DView paper in arXiv
BookMark eNotz7FOwzAUhWEPMEDhAZjwCyTYce02bKgKtFIRUskeXTvHyBI4kZNC-vZAYDrSGX7pu2RnsYtg7EaKfLnWWtxRmsJnXihZ5GIllL5gh-2pRzp0xxHpntfdF6V24JX3wQXEkdeJQgzxjVNs-S56JEQH3nn-2lMawJ_DNB7T_FTTT2ocrti5p_cB1_-7YPVjVW-22f7labd52GdkVjqzJenlulXaqlIaYz05BQhlIA2c1VJSqwGQcAWE9yRbOCqs86aELqEW7PYvO6OaPoUPSqfmF9fMOPUNBzZNvQ
ContentType Journal Article
Copyright http://creativecommons.org/licenses/by/4.0
Copyright_xml – notice: http://creativecommons.org/licenses/by/4.0
DBID AKY
GOX
DOI 10.48550/arxiv.2312.07035
DatabaseName arXiv Computer Science
arXiv.org
DatabaseTitleList
Database_xml – sequence: 1
  dbid: GOX
  name: arXiv.org
  url: http://arxiv.org/find
  sourceTypes: Open Access Repository
DeliveryMethod fulltext_linktorsrc
ExternalDocumentID 2312_07035
GroupedDBID AKY
GOX
ID FETCH-LOGICAL-a675-b9a548d35b39166bfac3ee036e16ecb511ad5eeea0c2e0ffa1deca2bcf69e59e3
IEDL.DBID GOX
IngestDate Mon Jan 08 05:43:51 EST 2024
IsDoiOpenAccess true
IsOpenAccess true
IsPeerReviewed false
IsScholarly false
Language English
LinkModel DirectLink
MergedId FETCHMERGED-LOGICAL-a675-b9a548d35b39166bfac3ee036e16ecb511ad5eeea0c2e0ffa1deca2bcf69e59e3
OpenAccessLink https://arxiv.org/abs/2312.07035
ParticipantIDs arxiv_primary_2312_07035
PublicationCentury 2000
PublicationDate 2023-12-12
PublicationDateYYYYMMDD 2023-12-12
PublicationDate_xml – month: 12
  year: 2023
  text: 2023-12-12
  day: 12
PublicationDecade 2020
PublicationYear 2023
Score 1.9058535
SecondaryResourceType preprint
Snippet By routing input tokens to only a few split experts, Sparse Mixture-of-Experts has enabled efficient training of large language models. Recent findings suggest...
SourceID arxiv
SourceType Open Access Repository
SubjectTerms Computer Science - Artificial Intelligence
Computer Science - Learning
Title HyperRouter: Towards Efficient Training and Inference of Sparse Mixture of Experts
URI https://arxiv.org/abs/2312.07035
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
link http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwdV09T8MwED21nVgQCFD5lAfWgOPEacKGUEtBKkgQpG6VP85Sl7RKWtSfz9lJBQvrxdOL5XvPvnsHcJtya9KcezM80qqp1FlUJIiRkrGis5BynPb3kLO3bPqVvs7lvAds3wuj6t3yu_UH1s09kQ9x5zel7ENfCF-y9fw-bx8ngxVXt_53HXHMEPqTJCZHcNixO_bY_o5j6GF1Ah9T0nq1L73B-oGVoU61YePg3UBHPiu7KQ2MRD172TfgsZVjn2uSnchmy52_5_eR4Ey8aU6hnIzLp2nUjTKIFDHySBeKlIFNpPZ9rpl2yhAglDwwztBoIj3KSkRU3AjkzqnYolFCG5cVKAtMzmBQrSocAqMwz1GMcjPiqbWGsomTOucjV2giW_ochgGAxbp1q1h4bBYBm4v_P13CgZ-j7us0YnEFg029xWvKtht9EyD_AVbTgLk
link.rule.ids 228,230,783,888
linkProvider Cornell University
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=HyperRouter%3A+Towards+Efficient+Training+and+Inference+of+Sparse+Mixture+of+Experts&rft.au=Do%2C+Giang&rft.au=Le%2C+Khiem&rft.au=Pham%2C+Quang&rft.au=Nguyen%2C+TrungTin&rft.date=2023-12-12&rft_id=info:doi/10.48550%2Farxiv.2312.07035&rft.externalDocID=2312_07035