The Devil is in the Tails: How Long-Tailed Code Distributions Impact Large Language Models
Learning-based techniques, especially advanced Large Language Models (LLMs) for code, have gained considerable popularity in various software engineering (SE) tasks. However, most existing works focus on designing better learning-based models and pay less attention to the properties of datasets. Lea...
Saved in:
Published in | IEEE/ACM International Conference on Automated Software Engineering : [proceedings] pp. 40 - 52 |
---|---|
Main Authors | , , , , , |
Format | Conference Proceeding |
Language | English |
Published |
IEEE
11.09.2023
|
Subjects | |
Online Access | Get full text |
ISSN | 2643-1572 |
DOI | 10.1109/ASE56229.2023.00157 |
Cover
Abstract | Learning-based techniques, especially advanced Large Language Models (LLMs) for code, have gained considerable popularity in various software engineering (SE) tasks. However, most existing works focus on designing better learning-based models and pay less attention to the properties of datasets. Learning-based models, including popular LLMs for code, heavily rely on data, and the data's properties (e.g., data distribution) could significantly affect their behavior. We conducted an exploratory study on the distribution of SE data and found that such data usually follows a skewed distribution (i.e., long-tailed distribution) where a small number of classes have an extensive collection of samples, while a large number of classes have very few samples. We investigate three distinct SE tasks and analyze the impacts of long-tailed distribution on the performance of LLMs for code. Our experimental results reveal that the long-tailed distribution has a substantial impact on the effectiveness of LLMs for code. Specifically, LLMs for code perform between 30.0% and 254.0% worse on data samples associated with infrequent labels compared to data samples of frequent labels. Our study provides a better understanding of the effects of long-tailed distributions on popular LLMs for code and insights for the future development of SE automation. |
---|---|
AbstractList | Learning-based techniques, especially advanced Large Language Models (LLMs) for code, have gained considerable popularity in various software engineering (SE) tasks. However, most existing works focus on designing better learning-based models and pay less attention to the properties of datasets. Learning-based models, including popular LLMs for code, heavily rely on data, and the data's properties (e.g., data distribution) could significantly affect their behavior. We conducted an exploratory study on the distribution of SE data and found that such data usually follows a skewed distribution (i.e., long-tailed distribution) where a small number of classes have an extensive collection of samples, while a large number of classes have very few samples. We investigate three distinct SE tasks and analyze the impacts of long-tailed distribution on the performance of LLMs for code. Our experimental results reveal that the long-tailed distribution has a substantial impact on the effectiveness of LLMs for code. Specifically, LLMs for code perform between 30.0% and 254.0% worse on data samples associated with infrequent labels compared to data samples of frequent labels. Our study provides a better understanding of the effects of long-tailed distributions on popular LLMs for code and insights for the future development of SE automation. |
Author | Zhout, Xin Xu, Bowen Lo, David Han, DongGyun Kim, Kisub Liu, Jiakun |
Author_xml | – sequence: 1 givenname: Xin surname: Zhout fullname: Zhout, Xin email: xinzhou.2020@phdcs.smu.edu.sg organization: Singapore Management University,Singapore – sequence: 2 givenname: Kisub surname: Kim fullname: Kim, Kisub email: kisubkim@smu.edu.sg organization: Singapore Management University,Singapore – sequence: 3 givenname: Bowen surname: Xu fullname: Xu, Bowen email: bowenxu.2017@phdcs.smu.edu.sg organization: Singapore Management University,Singapore – sequence: 4 givenname: Jiakun surname: Liu fullname: Liu, Jiakun email: jkliu@smu.edu.sg organization: Singapore Management University,Singapore – sequence: 5 givenname: DongGyun surname: Han fullname: Han, DongGyun email: donggyun.han@rhul.ac.uk organization: University of London,Royal Holloway,UK – sequence: 6 givenname: David surname: Lo fullname: Lo, David email: davidlo@smu.edu.sg organization: Singapore Management University,Singapore |
BookMark | eNotkMFOwzAQRA0Cibb0C-DgH0iw13ac5VaVQisFcSAnLpWTLMUoTao4BfXvMYLLzj7NaLWaKbvo-o4Yu5EilVLg3eJ1ZTIATEGASoWQxp6xOVrMlREKEDN9ziaQaZVEC67YNIRPIUwEO2Fv5QfxB_ryLfeB-46PkUvn23DP1_03L_pul_wyNXzZNzHrwzj46jj6vgt8sz-4euSFG3YUZ7c7urg8x2Abrtnlu2sDzf91xsrHVblcJ8XL02a5KBIHuR4T1KScIwJpc3JKiUaD1ISmqjTWxuqqqbM6AyvJoM1Qx9-BGo3CkJGoZuz276wnou1h8Hs3nLZSQCwAlfoBNFtSvw |
CODEN | IEEPAD |
ContentType | Conference Proceeding |
DBID | 6IE 6IL CBEJK RIE RIL |
DOI | 10.1109/ASE56229.2023.00157 |
DatabaseName | IEEE Electronic Library (IEL) Conference Proceedings IEEE Proceedings Order Plan All Online (POP All Online) 1998-present by volume IEEE Xplore All Conference Proceedings IEEE Electronic Library (IEL) IEEE Proceedings Order Plans (POP All) 1998-Present |
DatabaseTitleList | |
Database_xml | – sequence: 1 dbid: RIE name: IEEE Electronic Library (IEL) url: https://proxy.k.utb.cz/login?url=https://ieeexplore.ieee.org/ sourceTypes: Publisher |
DeliveryMethod | fulltext_linktorsrc |
Discipline | Computer Science |
EISBN | 9798350329964 |
EISSN | 2643-1572 |
EndPage | 52 |
ExternalDocumentID | 10298393 |
Genre | orig-research |
GroupedDBID | 6IE 6IF 6IH 6IK 6IL 6IM 6IN 6J9 AAJGR AAWTH ABLEC ACREN ADYOE ADZIZ AFYQB ALMA_UNASSIGNED_HOLDINGS AMTXH BEFXN BFFAM BGNUA BKEBE BPEOZ CBEJK CHZPO IEGSK IPLJI M43 OCL RIE RIL |
ID | FETCH-LOGICAL-a284t-94e3aaee2178ea330d4214e95bb49c574bdc6c6271e5976940052ed4905e5193 |
IEDL.DBID | RIE |
IngestDate | Wed Aug 27 02:32:41 EDT 2025 |
IsPeerReviewed | false |
IsScholarly | true |
Language | English |
LinkModel | DirectLink |
MergedId | FETCHMERGED-LOGICAL-a284t-94e3aaee2178ea330d4214e95bb49c574bdc6c6271e5976940052ed4905e5193 |
PageCount | 13 |
ParticipantIDs | ieee_primary_10298393 |
PublicationCentury | 2000 |
PublicationDate | 2023-Sept.-11 |
PublicationDateYYYYMMDD | 2023-09-11 |
PublicationDate_xml | – month: 09 year: 2023 text: 2023-Sept.-11 day: 11 |
PublicationDecade | 2020 |
PublicationTitle | IEEE/ACM International Conference on Automated Software Engineering : [proceedings] |
PublicationTitleAbbrev | ASE |
PublicationYear | 2023 |
Publisher | IEEE |
Publisher_xml | – name: IEEE |
SSID | ssj0051577 ssib057256115 |
Score | 2.3235104 |
Snippet | Learning-based techniques, especially advanced Large Language Models (LLMs) for code, have gained considerable popularity in various software engineering (SE)... |
SourceID | ieee |
SourceType | Publisher |
StartPage | 40 |
SubjectTerms | Automation Behavioral sciences Codes Data models Software engineering Tail Task analysis |
Title | The Devil is in the Tails: How Long-Tailed Code Distributions Impact Large Language Models |
URI | https://ieeexplore.ieee.org/document/10298393 |
hasFullText | 1 |
inHoldings | 1 |
isFullTextHit | |
isPrint | |
link | http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV1LS8NAEF5sT57qo-KbPXhNzSb7SLxJbalSi2CF4qXsbqZSLImYFMFf72weFQXBS9iEHJadWb6Z3fnmI-QC0A-4BY2RG0iPS2uckLvvcYP4ZpQKbclKu5_I0RO_m4lZTVYvuTAAUBafQc8Ny7v8JLNrd1SGOzyIEdDDFmmhn1VkrcZ5hELwZmwT-yJOK1W3GWJ-fHn9OECoDxw3JXBNTZn4KahS4smwQybNTKoyktfeujA9-_mrSeO_p7pDut_UPfqwAaVdsgXpHuk02g203sr75Bn9g94gKq7oMqfLlGIgSKd6ucqv6Cj7oOMsffHcOyS0nyX4r-uwW4tj5fS2JFfSsasjx2d15kmdsNoq75LpcDDtj7xaZ8HTCE4FGgdCrQEwO4lAh6Gf8IBxiIUxPLZCcZNYaWWgGGD6IZ2Uuggg4bEvwAWAB6SdZikcEhqEKtFRHFmuF5h321hpoyKxkCxahJEMj0jXLdX8reqkMW9W6fiP7ydk25nL1WcwdkraxfsazjAIKMx5afwvKEutxA |
linkProvider | IEEE |
linkToHtml | http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV3PS8MwFA46D3qaPyb-NgevnU2bNK03mZNNuyFYYXgZSfomw9GK6xD8631pu4mC4KW0pYeQvPC9l77v-wi5AIwDbkBh5gaBwwOjrZG763CN-Kal9E3JShsMg94TvxuJUU1WL7kwAFA2n0Hb3pb_8tPcLOxRGe5wL0JA99fJBgI_FxVdaxk-QiJ8M7bKfhGppayFhpgbXV4_dhHsPctO8aysKRM_LVVKRLltkuFyLFUjyWt7Uei2-fwl0_jvwW6T1jd5jz6sYGmHrEG2S5pL9wZab-Y98owRQm8QF2d0OqfTjGIqSBM1nc2vaC__oHGevTj2GVLayVP81mrs1vZYc9ov6ZU0tp3keK1OPam1VpvNWyS57SadnlM7LTgK4anA5QFfKQCsT0JQvu-m3GMcIqE1j4yQXKcmMIEnGWABElgzdeFByiNXgE0B90kjyzM4INTzZarCKDRcTbDyNpFUWoZiErBw4oeBf0hadqrGb5WWxng5S0d_vD8nm71kEI_j_vD-mGzZpbPdGoydkEbxvoBTTAkKfVYGwhd0ErER |
openUrl | ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Abook&rft.genre=proceeding&rft.title=IEEE%2FACM+International+Conference+on+Automated+Software+Engineering+%3A+%5Bproceedings%5D&rft.atitle=The+Devil+is+in+the+Tails%3A+How+Long-Tailed+Code+Distributions+Impact+Large+Language+Models&rft.au=Zhout%2C+Xin&rft.au=Kim%2C+Kisub&rft.au=Xu%2C+Bowen&rft.au=Liu%2C+Jiakun&rft.date=2023-09-11&rft.pub=IEEE&rft.eissn=2643-1572&rft.spage=40&rft.epage=52&rft_id=info:doi/10.1109%2FASE56229.2023.00157&rft.externalDocID=10298393 |