HyperRouter: Towards Efficient Training and Inference of Sparse Mixture of Experts

By routing input tokens to only a few split experts, Sparse Mixture-of-Experts has enabled efficient training of large language models. Recent findings suggest that fixing the routers can achieve competitive performance by alleviating the collapsing problem, where all experts eventually learn simila...

Full description

Saved in:

Bibliographic Details
Main Authors	Do, Giang, Le, Khiem, Pham, Quang, Nguyen, TrungTin, Doan, Thanh-Nam, Nguyen, Bint T, Liu, Chenghao, Ramasamy, Savitha, Li, Xiaoli, Hoi, Steven
Format	Journal Article
Language	English
Published	12.12.2023
Subjects	Computer Science - Artificial Intelligence Computer Science - Learning
Online Access	Get full text

Cover

Loading…

Abstract	By routing input tokens to only a few split experts, Sparse Mixture-of-Experts has enabled efficient training of large language models. Recent findings suggest that fixing the routers can achieve competitive performance by alleviating the collapsing problem, where all experts eventually learn similar representations. However, this strategy has two key limitations: (i) the policy derived from random routers might be sub-optimal, and (ii) it requires extensive resources during training and evaluation, leading to limited efficiency gains. This work introduces \HyperRout, which dynamically generates the router's parameters through a fixed hypernetwork and trainable embeddings to achieve a balance between training the routers and freezing them to learn an improved routing policy. Extensive experiments across a wide range of tasks demonstrate the superior performance and efficiency gains of \HyperRouter compared to existing routing methods. Our implementation is publicly available at {\url{{https://github.com/giangdip2410/HyperRouter}}}.
AbstractList	By routing input tokens to only a few split experts, Sparse Mixture-of-Experts has enabled efficient training of large language models. Recent findings suggest that fixing the routers can achieve competitive performance by alleviating the collapsing problem, where all experts eventually learn similar representations. However, this strategy has two key limitations: (i) the policy derived from random routers might be sub-optimal, and (ii) it requires extensive resources during training and evaluation, leading to limited efficiency gains. This work introduces \HyperRout, which dynamically generates the router's parameters through a fixed hypernetwork and trainable embeddings to achieve a balance between training the routers and freezing them to learn an improved routing policy. Extensive experiments across a wide range of tasks demonstrate the superior performance and efficiency gains of \HyperRouter compared to existing routing methods. Our implementation is publicly available at {\url{{https://github.com/giangdip2410/HyperRouter}}}.
Author	Nguyen, Bint T Le, Khiem Doan, Thanh-Nam Ramasamy, Savitha Li, Xiaoli Nguyen, TrungTin Hoi, Steven Pham, Quang Liu, Chenghao Do, Giang
Author_xml	– sequence: 1 givenname: Giang surname: Do fullname: Do, Giang – sequence: 2 givenname: Khiem surname: Le fullname: Le, Khiem – sequence: 3 givenname: Quang surname: Pham fullname: Pham, Quang – sequence: 4 givenname: TrungTin surname: Nguyen fullname: Nguyen, TrungTin – sequence: 5 givenname: Thanh-Nam surname: Doan fullname: Doan, Thanh-Nam – sequence: 6 givenname: Bint T surname: Nguyen fullname: Nguyen, Bint T – sequence: 7 givenname: Chenghao surname: Liu fullname: Liu, Chenghao – sequence: 8 givenname: Savitha surname: Ramasamy fullname: Ramasamy, Savitha – sequence: 9 givenname: Xiaoli surname: Li fullname: Li, Xiaoli – sequence: 10 givenname: Steven surname: Hoi fullname: Hoi, Steven
BackLink	https://doi.org/10.48550/arXiv.2312.07035$$DView paper in arXiv
BookMark	eNotz7FOwzAUhWEPMEDhAZjwCyTYce02bKgKtFIRUskeXTvHyBI4kZNC-vZAYDrSGX7pu2RnsYtg7EaKfLnWWtxRmsJnXihZ5GIllL5gh-2pRzp0xxHpntfdF6V24JX3wQXEkdeJQgzxjVNs-S56JEQH3nn-2lMawJ_DNB7T_FTTT2ocrti5p_cB1_-7YPVjVW-22f7labd52GdkVjqzJenlulXaqlIaYz05BQhlIA2c1VJSqwGQcAWE9yRbOCqs86aELqEW7PYvO6OaPoUPSqfmF9fMOPUNBzZNvQ
ContentType	Journal Article
Copyright	http://creativecommons.org/licenses/by/4.0
Copyright_xml	– notice: http://creativecommons.org/licenses/by/4.0
DBID	AKY GOX
DOI	10.48550/arxiv.2312.07035
DatabaseName	arXiv Computer Science arXiv.org
DatabaseTitleList
Database_xml	– sequence: 1 dbid: GOX name: arXiv.org url: http://arxiv.org/find sourceTypes: Open Access Repository
DeliveryMethod	fulltext_linktorsrc
ExternalDocumentID	2312_07035
GroupedDBID	AKY GOX
ID	FETCH-LOGICAL-a675-b9a548d35b39166bfac3ee036e16ecb511ad5eeea0c2e0ffa1deca2bcf69e59e3
IEDL.DBID	GOX
IngestDate	Mon Jan 08 05:43:51 EST 2024
IsDoiOpenAccess	true
IsOpenAccess	true
IsPeerReviewed	false
IsScholarly	false
Language	English
LinkModel	DirectLink
MergedId	FETCHMERGED-LOGICAL-a675-b9a548d35b39166bfac3ee036e16ecb511ad5eeea0c2e0ffa1deca2bcf69e59e3
OpenAccessLink	https://arxiv.org/abs/2312.07035
ParticipantIDs	arxiv_primary_2312_07035
PublicationCentury	2000
PublicationDate	2023-12-12
PublicationDateYYYYMMDD	2023-12-12
PublicationDate_xml	– month: 12 year: 2023 text: 2023-12-12 day: 12
PublicationDecade	2020
PublicationYear	2023
Score	1.9058535
SecondaryResourceType	preprint
Snippet	By routing input tokens to only a few split experts, Sparse Mixture-of-Experts has enabled efficient training of large language models. Recent findings suggest...
SourceID	arxiv
SourceType	Open Access Repository
SubjectTerms	Computer Science - Artificial Intelligence Computer Science - Learning
Title	HyperRouter: Towards Efficient Training and Inference of Sparse Mixture of Experts
URI	https://arxiv.org/abs/2312.07035
hasFullText	1
inHoldings	1
isFullTextHit
isPrint
link	http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwdV09T8MwED21nVgQCFD5lAfWgOPEacKGUEtBKkgQpG6VP85Sl7RKWtSfz9lJBQvrxdOL5XvPvnsHcJtya9KcezM80qqp1FlUJIiRkrGis5BynPb3kLO3bPqVvs7lvAds3wuj6t3yu_UH1s09kQ9x5zel7ENfCF-y9fw-bx8ngxVXt_53HXHMEPqTJCZHcNixO_bY_o5j6GF1Ah9T0nq1L73B-oGVoU61YePg3UBHPiu7KQ2MRD172TfgsZVjn2uSnchmy52_5_eR4Ey8aU6hnIzLp2nUjTKIFDHySBeKlIFNpPZ9rpl2yhAglDwwztBoIj3KSkRU3AjkzqnYolFCG5cVKAtMzmBQrSocAqMwz1GMcjPiqbWGsomTOucjV2giW_ochgGAxbp1q1h4bBYBm4v_P13CgZ-j7us0YnEFg029xWvKtht9EyD_AVbTgLk
link.rule.ids	228,230,783,888
linkProvider	Cornell University
openUrl	ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=HyperRouter%3A+Towards+Efficient+Training+and+Inference+of+Sparse+Mixture+of+Experts&rft.au=Do%2C+Giang&rft.au=Le%2C+Khiem&rft.au=Pham%2C+Quang&rft.au=Nguyen%2C+TrungTin&rft.date=2023-12-12&rft_id=info:doi/10.48550%2Farxiv.2312.07035&rft.externalDocID=2312_07035