The Devil is in the Tails: How Long-Tailed Code Distributions Impact Large Language Models

Learning-based techniques, especially advanced Large Language Models (LLMs) for code, have gained considerable popularity in various software engineering (SE) tasks. However, most existing works focus on designing better learning-based models and pay less attention to the properties of datasets. Lea...

Full description

Saved in:
Bibliographic Details
Published inIEEE/ACM International Conference on Automated Software Engineering : [proceedings] pp. 40 - 52
Main Authors Zhout, Xin, Kim, Kisub, Xu, Bowen, Liu, Jiakun, Han, DongGyun, Lo, David
Format Conference Proceeding
LanguageEnglish
Published IEEE 11.09.2023
Subjects
Online AccessGet full text
ISSN2643-1572
DOI10.1109/ASE56229.2023.00157

Cover

Abstract Learning-based techniques, especially advanced Large Language Models (LLMs) for code, have gained considerable popularity in various software engineering (SE) tasks. However, most existing works focus on designing better learning-based models and pay less attention to the properties of datasets. Learning-based models, including popular LLMs for code, heavily rely on data, and the data's properties (e.g., data distribution) could significantly affect their behavior. We conducted an exploratory study on the distribution of SE data and found that such data usually follows a skewed distribution (i.e., long-tailed distribution) where a small number of classes have an extensive collection of samples, while a large number of classes have very few samples. We investigate three distinct SE tasks and analyze the impacts of long-tailed distribution on the performance of LLMs for code. Our experimental results reveal that the long-tailed distribution has a substantial impact on the effectiveness of LLMs for code. Specifically, LLMs for code perform between 30.0% and 254.0% worse on data samples associated with infrequent labels compared to data samples of frequent labels. Our study provides a better understanding of the effects of long-tailed distributions on popular LLMs for code and insights for the future development of SE automation.
AbstractList Learning-based techniques, especially advanced Large Language Models (LLMs) for code, have gained considerable popularity in various software engineering (SE) tasks. However, most existing works focus on designing better learning-based models and pay less attention to the properties of datasets. Learning-based models, including popular LLMs for code, heavily rely on data, and the data's properties (e.g., data distribution) could significantly affect their behavior. We conducted an exploratory study on the distribution of SE data and found that such data usually follows a skewed distribution (i.e., long-tailed distribution) where a small number of classes have an extensive collection of samples, while a large number of classes have very few samples. We investigate three distinct SE tasks and analyze the impacts of long-tailed distribution on the performance of LLMs for code. Our experimental results reveal that the long-tailed distribution has a substantial impact on the effectiveness of LLMs for code. Specifically, LLMs for code perform between 30.0% and 254.0% worse on data samples associated with infrequent labels compared to data samples of frequent labels. Our study provides a better understanding of the effects of long-tailed distributions on popular LLMs for code and insights for the future development of SE automation.
Author Zhout, Xin
Xu, Bowen
Lo, David
Han, DongGyun
Kim, Kisub
Liu, Jiakun
Author_xml – sequence: 1
  givenname: Xin
  surname: Zhout
  fullname: Zhout, Xin
  email: xinzhou.2020@phdcs.smu.edu.sg
  organization: Singapore Management University,Singapore
– sequence: 2
  givenname: Kisub
  surname: Kim
  fullname: Kim, Kisub
  email: kisubkim@smu.edu.sg
  organization: Singapore Management University,Singapore
– sequence: 3
  givenname: Bowen
  surname: Xu
  fullname: Xu, Bowen
  email: bowenxu.2017@phdcs.smu.edu.sg
  organization: Singapore Management University,Singapore
– sequence: 4
  givenname: Jiakun
  surname: Liu
  fullname: Liu, Jiakun
  email: jkliu@smu.edu.sg
  organization: Singapore Management University,Singapore
– sequence: 5
  givenname: DongGyun
  surname: Han
  fullname: Han, DongGyun
  email: donggyun.han@rhul.ac.uk
  organization: University of London,Royal Holloway,UK
– sequence: 6
  givenname: David
  surname: Lo
  fullname: Lo, David
  email: davidlo@smu.edu.sg
  organization: Singapore Management University,Singapore
BookMark eNotkMFOwzAQRA0Cibb0C-DgH0iw13ac5VaVQisFcSAnLpWTLMUoTao4BfXvMYLLzj7NaLWaKbvo-o4Yu5EilVLg3eJ1ZTIATEGASoWQxp6xOVrMlREKEDN9ziaQaZVEC67YNIRPIUwEO2Fv5QfxB_ryLfeB-46PkUvn23DP1_03L_pul_wyNXzZNzHrwzj46jj6vgt8sz-4euSFG3YUZ7c7urg8x2Abrtnlu2sDzf91xsrHVblcJ8XL02a5KBIHuR4T1KScIwJpc3JKiUaD1ISmqjTWxuqqqbM6AyvJoM1Qx9-BGo3CkJGoZuz276wnou1h8Hs3nLZSQCwAlfoBNFtSvw
CODEN IEEPAD
ContentType Conference Proceeding
DBID 6IE
6IL
CBEJK
RIE
RIL
DOI 10.1109/ASE56229.2023.00157
DatabaseName IEEE Electronic Library (IEL) Conference Proceedings
IEEE Proceedings Order Plan All Online (POP All Online) 1998-present by volume
IEEE Xplore All Conference Proceedings
IEEE Electronic Library (IEL)
IEEE Proceedings Order Plans (POP All) 1998-Present
DatabaseTitleList
Database_xml – sequence: 1
  dbid: RIE
  name: IEEE Electronic Library (IEL)
  url: https://proxy.k.utb.cz/login?url=https://ieeexplore.ieee.org/
  sourceTypes: Publisher
DeliveryMethod fulltext_linktorsrc
Discipline Computer Science
EISBN 9798350329964
EISSN 2643-1572
EndPage 52
ExternalDocumentID 10298393
Genre orig-research
GroupedDBID 6IE
6IF
6IH
6IK
6IL
6IM
6IN
6J9
AAJGR
AAWTH
ABLEC
ACREN
ADYOE
ADZIZ
AFYQB
ALMA_UNASSIGNED_HOLDINGS
AMTXH
BEFXN
BFFAM
BGNUA
BKEBE
BPEOZ
CBEJK
CHZPO
IEGSK
IPLJI
M43
OCL
RIE
RIL
ID FETCH-LOGICAL-a284t-94e3aaee2178ea330d4214e95bb49c574bdc6c6271e5976940052ed4905e5193
IEDL.DBID RIE
IngestDate Wed Aug 27 02:32:41 EDT 2025
IsPeerReviewed false
IsScholarly true
Language English
LinkModel DirectLink
MergedId FETCHMERGED-LOGICAL-a284t-94e3aaee2178ea330d4214e95bb49c574bdc6c6271e5976940052ed4905e5193
PageCount 13
ParticipantIDs ieee_primary_10298393
PublicationCentury 2000
PublicationDate 2023-Sept.-11
PublicationDateYYYYMMDD 2023-09-11
PublicationDate_xml – month: 09
  year: 2023
  text: 2023-Sept.-11
  day: 11
PublicationDecade 2020
PublicationTitle IEEE/ACM International Conference on Automated Software Engineering : [proceedings]
PublicationTitleAbbrev ASE
PublicationYear 2023
Publisher IEEE
Publisher_xml – name: IEEE
SSID ssj0051577
ssib057256115
Score 2.3235104
Snippet Learning-based techniques, especially advanced Large Language Models (LLMs) for code, have gained considerable popularity in various software engineering (SE)...
SourceID ieee
SourceType Publisher
StartPage 40
SubjectTerms Automation
Behavioral sciences
Codes
Data models
Software engineering
Tail
Task analysis
Title The Devil is in the Tails: How Long-Tailed Code Distributions Impact Large Language Models
URI https://ieeexplore.ieee.org/document/10298393
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
link http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV1LS8NAEF5sT57qo-KbPXhNzSb7SLxJbalSi2CF4qXsbqZSLImYFMFf72weFQXBS9iEHJadWb6Z3fnmI-QC0A-4BY2RG0iPS2uckLvvcYP4ZpQKbclKu5_I0RO_m4lZTVYvuTAAUBafQc8Ny7v8JLNrd1SGOzyIEdDDFmmhn1VkrcZ5hELwZmwT-yJOK1W3GWJ-fHn9OECoDxw3JXBNTZn4KahS4smwQybNTKoyktfeujA9-_mrSeO_p7pDut_UPfqwAaVdsgXpHuk02g203sr75Bn9g94gKq7oMqfLlGIgSKd6ucqv6Cj7oOMsffHcOyS0nyX4r-uwW4tj5fS2JFfSsasjx2d15kmdsNoq75LpcDDtj7xaZ8HTCE4FGgdCrQEwO4lAh6Gf8IBxiIUxPLZCcZNYaWWgGGD6IZ2Uuggg4bEvwAWAB6SdZikcEhqEKtFRHFmuF5h321hpoyKxkCxahJEMj0jXLdX8reqkMW9W6fiP7ydk25nL1WcwdkraxfsazjAIKMx5afwvKEutxA
linkProvider IEEE
linkToHtml http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV3PS8MwFA46D3qaPyb-NgevnU2bNK03mZNNuyFYYXgZSfomw9GK6xD8631pu4mC4KW0pYeQvPC9l77v-wi5AIwDbkBh5gaBwwOjrZG763CN-Kal9E3JShsMg94TvxuJUU1WL7kwAFA2n0Hb3pb_8tPcLOxRGe5wL0JA99fJBgI_FxVdaxk-QiJ8M7bKfhGppayFhpgbXV4_dhHsPctO8aysKRM_LVVKRLltkuFyLFUjyWt7Uei2-fwl0_jvwW6T1jd5jz6sYGmHrEG2S5pL9wZab-Y98owRQm8QF2d0OqfTjGIqSBM1nc2vaC__oHGevTj2GVLayVP81mrs1vZYc9ov6ZU0tp3keK1OPam1VpvNWyS57SadnlM7LTgK4anA5QFfKQCsT0JQvu-m3GMcIqE1j4yQXKcmMIEnGWABElgzdeFByiNXgE0B90kjyzM4INTzZarCKDRcTbDyNpFUWoZiErBw4oeBf0hadqrGb5WWxng5S0d_vD8nm71kEI_j_vD-mGzZpbPdGoydkEbxvoBTTAkKfVYGwhd0ErER
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Abook&rft.genre=proceeding&rft.title=IEEE%2FACM+International+Conference+on+Automated+Software+Engineering+%3A+%5Bproceedings%5D&rft.atitle=The+Devil+is+in+the+Tails%3A+How+Long-Tailed+Code+Distributions+Impact+Large+Language+Models&rft.au=Zhout%2C+Xin&rft.au=Kim%2C+Kisub&rft.au=Xu%2C+Bowen&rft.au=Liu%2C+Jiakun&rft.date=2023-09-11&rft.pub=IEEE&rft.eissn=2643-1572&rft.spage=40&rft.epage=52&rft_id=info:doi/10.1109%2FASE56229.2023.00157&rft.externalDocID=10298393