SPAW-SMOTE: Space Partitioning Adaptive Weighted Synthetic Minority Oversampling Technique For Imbalanced Data Set Learning
The problem of data imbalance is common in reality, which greatly affects the performance of classifiers. Most of the solutions are to balance the data set by generating new minority class samples, which are faced with the problems of selecting the appropriate area for generating samples, fuzzy clas...
Saved in:
Published in | Computer journal Vol. 67; no. 5; pp. 1747 - 1762 |
---|---|
Main Authors | , , , , , |
Format | Journal Article |
Language | English |
Published |
Oxford University Press
22.06.2024
|
Subjects | |
Online Access | Get full text |
Cover
Loading…
Abstract | The problem of data imbalance is common in reality, which greatly affects the performance of classifiers. Most of the solutions are to balance the data set by generating new minority class samples, which are faced with the problems of selecting the appropriate area for generating samples, fuzzy classification boundary and uneven distribution of samples. To solve these problems, we propose a novel oversampling algorithm named space partitioning adaptive weighted synthetic minority oversampling technique (SPAW-SMOTE). We first divide the data space into boundary space and non-boundary space based on spatial partitioning techniques. The number of samples to be generated is assigned to different spaces by the designed adaptive weighting algorithm, which is used to solve the problems of uneven distribution of samples and easy to blur the classification boundary. Finally, we also endeavor to develop a new generation algorithm to reduce the probability of overlapping samples generated when synthesizing new samples and to ensure the diversity of new samples. Experimental results on 18 real-world data sets show that the average performance (G-mean, F1-measure and Area Under Curve) of SPAW-SMOTE is significantly better than other existing oversampling techniques. |
---|---|
AbstractList | The problem of data imbalance is common in reality, which greatly affects the performance of classifiers. Most of the solutions are to balance the data set by generating new minority class samples, which are faced with the problems of selecting the appropriate area for generating samples, fuzzy classification boundary and uneven distribution of samples. To solve these problems, we propose a novel oversampling algorithm named space partitioning adaptive weighted synthetic minority oversampling technique (SPAW-SMOTE). We first divide the data space into boundary space and non-boundary space based on spatial partitioning techniques. The number of samples to be generated is assigned to different spaces by the designed adaptive weighting algorithm, which is used to solve the problems of uneven distribution of samples and easy to blur the classification boundary. Finally, we also endeavor to develop a new generation algorithm to reduce the probability of overlapping samples generated when synthesizing new samples and to ensure the diversity of new samples. Experimental results on 18 real-world data sets show that the average performance (G-mean, F1-measure and Area Under Curve) of SPAW-SMOTE is significantly better than other existing oversampling techniques. |
Author | Fang, Wenbo Lan, Xiaolong He, Junjiang Li, Tao Li, Yihong Zhang, Qiang |
Author_xml | – sequence: 1 givenname: Qiang surname: Zhang fullname: Zhang, Qiang – sequence: 2 givenname: Junjiang surname: He fullname: He, Junjiang email: hejunjiang@scu.edu.cn – sequence: 3 givenname: Tao surname: Li fullname: Li, Tao – sequence: 4 givenname: Xiaolong surname: Lan fullname: Lan, Xiaolong – sequence: 5 givenname: Wenbo surname: Fang fullname: Fang, Wenbo – sequence: 6 givenname: Yihong surname: Li fullname: Li, Yihong |
BookMark | eNqFkD1vwjAURa2qlQq0a2evHQLPBkzSDVFokUAghYox8scDjBIndQwq6p8vCPZOd7nnDKdJ7l3pkJAXBm0GSbejy2Lv8o76kQaS-I40WE9AxEEM7kkDgEHUExweSbOu9wDAIREN8psuh-sonS9W4zeaVlIjXUofbLCls25Lh0ZWwR6RrtFudwENTU8u7DBYTefWld6GE10c0deyqPILsUK9c_b7gHRSejotlMyl02fwXQZJUwx0htJf5E_kYSPzGp9v2yJfk_Fq9BnNFh_T0XAWac7jECkmEiEGyvTjzYCZLuspFKavmDLIBeqYYR9QA-fcJBoM17FBJZK4p1WXC-i2SPvq1b6sa4-brPK2kP6UMcgu6bJruuyW7gy8XoHyUP33_QORCnY0 |
Cites_doi | 10.1002/9781118548387 |
ContentType | Journal Article |
Copyright | The British Computer Society 2023. All rights reserved. For permissions, please e-mail: journals.permissions@oup.com 2023 |
Copyright_xml | – notice: The British Computer Society 2023. All rights reserved. For permissions, please e-mail: journals.permissions@oup.com 2023 |
DBID | AAYXX CITATION |
DOI | 10.1093/comjnl/bxad098 |
DatabaseName | CrossRef |
DatabaseTitle | CrossRef |
DatabaseTitleList | CrossRef |
DeliveryMethod | fulltext_linktorsrc |
Discipline | Computer Science |
EISSN | 1460-2067 |
EndPage | 1762 |
ExternalDocumentID | 10_1093_comjnl_bxad098 10.1093/comjnl/bxad098 |
GroupedDBID | -E4 -~X .2P .DC .I3 0R~ 123 18M 1OL 1TH 29F 3R3 4.4 41~ 48X 5VS 5WA 6J9 6TJ 70D 85S 9M8 AAIJN AAJKP AAJQQ AAMVS AAOGV AAPQZ AAPXW AARHZ AAUAY AAUQX AAVAP AAYOK ABAZT ABDFA ABDTM ABEFU ABEJV ABEUO ABGNP ABIXL ABNKS ABPTD ABQLI ABSMQ ABVGC ABVLG ABXVV ABZBJ ACBEA ACFRR ACGFS ACGOD ACIWK ACNCT ACUFI ACUTJ ACUXJ ACVCV ACYTK ADEYI ADEZT ADGZP ADHKW ADHZD ADIPN ADMLS ADOCK ADQBN ADRDM ADRTK ADVEK ADYJX ADYVW ADZXQ AECKG AEGPL AEGXH AEJOX AEKKA AEKSI AEMDU AENEX AENZO AEPUE AETBJ AEWNT AFFZL AFIYH AFOFC AGINJ AGKEF AGMDO AGORE AGSYK AHGBF AHXPO AI. AIDUJ AIJHB AJBYB AJEEA AJEUX AJNCP ALMA_UNASSIGNED_HOLDINGS ALTZX ALUQC ALXQX ANAKG APIBT APJGH APWMN ASAOO ATDFG ATGXG AXUDD AZVOD BAYMD BCRHZ BEFXN BEYMZ BFFAM BGNUA BHONS BKEBE BPEOZ BQUQU BTQHN CAG CDBKE COF CS3 CXTWN CZ4 DAKXR DFGAJ DILTD DU5 D~K EBS EE~ EJD F9B FA8 FLIZI FLUFQ FOEOM GAUVT GJXCC H13 H5~ HAR HW0 HZ~ H~9 IOX J21 JAVBF JXSIZ KBUDW KOP KSI KSN M-Z MBTAY ML0 MVM N9A NGC NMDNZ NOMLY NU- O0~ O9- OCL ODMLO OJQWA OJZSN OWPYF O~Y P2P PAFKI PEELM PQQKQ Q1. Q5Y R44 RD5 RNI ROL ROX ROZ RUSNO RW1 RXO RZO SC5 TAE TJP TN5 VH1 VOH WH7 WHG X7H XJT XOL XSW YAYTL YKOAZ YXANX ZKX ZY4 ~91 AAYXX CITATION |
ID | FETCH-LOGICAL-c228t-b169667bd58f71d314be6d5b1bde26ec81e50ec0222d9c0d2c8deb6984cb32603 |
ISSN | 0010-4620 |
IngestDate | Tue Jul 01 02:55:11 EDT 2025 Mon Jun 30 08:34:52 EDT 2025 |
IsPeerReviewed | true |
IsScholarly | true |
Issue | 5 |
Keywords | imbalance data adaptive spatial weight classification oversampling |
Language | English |
License | This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://academic.oup.com/pages/standard-publication-reuse-rights) https://academic.oup.com/pages/standard-publication-reuse-rights |
LinkModel | OpenURL |
MergedId | FETCHMERGED-LOGICAL-c228t-b169667bd58f71d314be6d5b1bde26ec81e50ec0222d9c0d2c8deb6984cb32603 |
PageCount | 16 |
ParticipantIDs | crossref_primary_10_1093_comjnl_bxad098 oup_primary_10_1093_comjnl_bxad098 |
ProviderPackageCode | CITATION AAYXX |
PublicationCentury | 2000 |
PublicationDate | 2024-06-22 |
PublicationDateYYYYMMDD | 2024-06-22 |
PublicationDate_xml | – month: 06 year: 2024 text: 2024-06-22 day: 22 |
PublicationDecade | 2020 |
PublicationTitle | Computer journal |
PublicationYear | 2024 |
Publisher | Oxford University Press |
Publisher_xml | – name: Oxford University Press |
References | He (2024062312365471300_ref12) 2008 Batista (2024062312365471300_ref21) 2004; 6 Radwan (2024062312365471300_ref26) 2017 Han (2024062312365471300_ref11) 2005 Tao (2024062312365471300_ref10) 2021; 234 Thabtah (2024062312365471300_ref7) 2020; 513 Ma (2024062312365471300_ref18) 2017; 18 Liu (2024062312365471300_ref5) 2021; 106 Sáez (2024062312365471300_ref25) 2015; 291 Koziarski (2024062312365471300_ref27) 2017; 27 Barua (2024062312365471300_ref15) 2012; 26 Kaur (2024062312365471300_ref1) 2019 Ijaz (2024062312365471300_ref19) 2018; 8 Kovács (2024062312365471300_ref30) 2019; 83 Fernández (2024062312365471300_ref8) 2018; 61 Bispo (2024062312365471300_ref24) 2018 Haixiang (2024062312365471300_ref6) 2017; 73 Pedregosa (2024062312365471300_ref33) 2011; 12 Vasighizaker (2024062312365471300_ref3) 2018; 76 Guan (2024062312365471300_ref22) 2021; 51 Li (2024062312365471300_ref28) 2021; 228 Chawla (2024062312365471300_ref9) 2002; 16 Tang (2024062312365471300_ref14) 2015 Douzas (2024062312365471300_ref20) 2018; 465 Jurgovsky (2024062312365471300_ref4) 2018; 100 Pruengkarn (2024062312365471300_ref16) 2017 Alcalá-Fdez (2024062312365471300_ref34) 2011; 17 Ramentol (2024062312365471300_ref23) 2012; 33 Hosmer (2024062312365471300_ref32) 2013 Lin (2024062312365471300_ref2) 2020 Cortes (2024062312365471300_ref31) 1995; 20 Tao (2024062312365471300_ref17) 2020; 519 Barua (2024062312365471300_ref13) 2013 Gazzah (2024062312365471300_ref29) 2008 |
References_xml | – start-page: 67 year: 2017 ident: 2024062312365471300_ref16 article-title: Multiclass imbalanced classification using fuzzy c-mean and smote with fuzzy support vector machine – volume: 83 year: 2019 ident: 2024062312365471300_ref30 article-title: An empirical comparison and evaluation of minority oversampling techniques on a large number of imbalanced datasets – volume: 20 start-page: 273 year: 1995 ident: 2024062312365471300_ref31 article-title: Support-vector networks – volume: 234 start-page: 107588 year: 2021 ident: 2024062312365471300_ref10 article-title: Svdd boundary and dpc clustering technique-based oversampling approach for handling imbalanced and overlapped data – volume: 26 start-page: 405 year: 2012 ident: 2024062312365471300_ref15 article-title: Mwmote–majority weighted minority oversampling technique for imbalanced data set learning – start-page: 399 year: 2017 ident: 2024062312365471300_ref26 article-title: Enhancing prediction on imbalance data by thresholding technique with noise filtering – start-page: 878 year: 2005 ident: 2024062312365471300_ref11 article-title: Borderline-smote: a new over-sampling method in imbalanced data sets learning – year: 2013 ident: 2024062312365471300_ref32 article-title: Applied Logistic Regression, 3 doi: 10.1002/9781118548387 – volume: 106 year: 2021 ident: 2024062312365471300_ref5 article-title: A fast network intrusion detection system using adaptive synthetic oversampling and lightgbm – volume: 8 start-page: 1325 year: 2018 ident: 2024062312365471300_ref19 article-title: Hybrid prediction model for type 2 diabetes and hypertension using dbscan-based outlier detection, synthetic minority over sampling technique (smote), and random forest – volume: 73 start-page: 220 year: 2017 ident: 2024062312365471300_ref6 article-title: Learning from class-imbalanced data: review of methods and applications – volume: 291 start-page: 184 year: 2015 ident: 2024062312365471300_ref25 article-title: Smote–ipf: addressing the noisy and borderline examples problem in imbalanced classification by a re-sampling method with filtering – volume: 76 start-page: 23 year: 2018 ident: 2024062312365471300_ref3 article-title: C-pugp: a cluster-based positive unlabeled learning method for disease gene prediction and prioritization – start-page: 552 year: 2018 ident: 2024062312365471300_ref24 article-title: Instance selection and class balancing techniques for cross project defect prediction – volume: 51 start-page: 1394 year: 2021 ident: 2024062312365471300_ref22 article-title: Smote-wenn: solving class imbalance and small sample problems by oversampling and distance scaling – volume: 12 start-page: 2825 year: 2011 ident: 2024062312365471300_ref33 article-title: Scikit-learn: machine learning in python – volume: 513 start-page: 429 year: 2020 ident: 2024062312365471300_ref7 article-title: Data imbalance in classification: experimental evaluation – volume: 18 start-page: 1 year: 2017 ident: 2024062312365471300_ref18 article-title: Cure-smote algorithm and hybrid algorithm for feature selection and parameter optimization based on random forests – volume: 33 start-page: 245 year: 2012 ident: 2024062312365471300_ref23 article-title: Smote-rsb${^\ast }$: a hybrid preprocessing approach based on oversampling and undersampling for high imbalanced data-sets using smote and rough sets theory – start-page: 664 year: 2015 ident: 2024062312365471300_ref14 article-title: Kerneladasyn: Kernel based adaptive synthetic data generation for imbalanced learning – volume: 228 start-page: 107269 year: 2021 ident: 2024062312365471300_ref28 article-title: Sp-smote: a novel space partitioning based synthetic minority oversampling technique – volume: 6 start-page: 20 year: 2004 ident: 2024062312365471300_ref21 article-title: A study of the behavior of several methods for balancing machine learning training data – volume: 17 year: 2011 ident: 2024062312365471300_ref34 article-title: Keel data-mining software tool: data set repository, integration of algorithms and experimental analysis framework – volume: 61 start-page: 863 year: 2018 ident: 2024062312365471300_ref8 article-title: Smote for learning from imbalanced data: progress and challenges, marking the 15-year anniversary – volume: 27 start-page: 727 year: 2017 ident: 2024062312365471300_ref27 article-title: Ccr: a combined cleaning and resampling algorithm for imbalanced data classification – year: 2019 ident: 2024062312365471300_ref1 article-title: A systematic review on imbalanced data challenges in machine learning: Applications and solutions – start-page: 320 year: 2020 ident: 2024062312365471300_ref2 article-title: Text classification feature extraction method based on deep learning for unbalanced data sets – volume: 16 start-page: 321 year: 2002 ident: 2024062312365471300_ref9 article-title: Smote: synthetic minority over-sampling technique – start-page: 317 year: 2013 ident: 2024062312365471300_ref13 article-title: Prowsyn: Proximity weighted synthetic oversampling technique for imbalanced data set learning – volume: 465 start-page: 1 year: 2018 ident: 2024062312365471300_ref20 article-title: Improving imbalanced learning through a heuristic oversampling method based on k-means and smote – volume: 519 start-page: 43 year: 2020 ident: 2024062312365471300_ref17 article-title: Adaptive weighted over-sampling for imbalanced datasets based on density peaks clustering with heuristic filtering – volume: 100 start-page: 234 year: 2018 ident: 2024062312365471300_ref4 article-title: Sequence classification for credit-card fraud detection – start-page: 1322 year: 2008 ident: 2024062312365471300_ref12 article-title: Adasyn: Adaptive synthetic sampling approach for imbalanced learning – start-page: 677 year: 2008 ident: 2024062312365471300_ref29 article-title: New oversampling approaches based on polynomial fitting for imbalanced data sets |
SSID | ssj0002096 |
Score | 2.3704062 |
Snippet | The problem of data imbalance is common in reality, which greatly affects the performance of classifiers. Most of the solutions are to balance the data set by... |
SourceID | crossref oup |
SourceType | Index Database Publisher |
StartPage | 1747 |
Title | SPAW-SMOTE: Space Partitioning Adaptive Weighted Synthetic Minority Oversampling Technique For Imbalanced Data Set Learning |
Volume | 67 |
hasFullText | 1 |
inHoldings | 1 |
isFullTextHit | |
isPrint | |
link | http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwnV1bb9MwFLbK9sILd8QYIAsh8VCFJU7ipLxVbNNAKwM10_pW-ZapE6RTlUkD_gm_lnNix_VgEoOXKLWcozTn0_G5fD4m5JWuDY81ZxGvjY4yAQGrZLyOFB9hVQl5r1jRnXzkB8fZh1k-Gwx-Bqyli1a-Ud-v3VfyP1qFMdAr7pL9B816oTAA96BfuIKG4XojHU8_jU-i6eSo2sPAfgrhrwGXcGX7D3X5Di3OO27QSZcBRefyWwMuH3ZpnSyaJZ5cNzxCZoZAZjk8UfmervvL1fD9V4nUR-QI7IpWgGVp-46sp6Fb258NMQxfO0xHfwYUnq7zrnY_SHMWjh52vIJKLP2Azc3OFgIMtJvn8hMsQx4VC1OWWG7POLPVF2PNbMbjCBvHh3bY_VqEle7OqELQVAQLdOLs9x_G3zbGAnWeNV_gRl4KHdsjrq_22f5t_fOsRFuPT-dWwtw9f4tsMghBwIZujncnh1O_zrO4O_3N_z_fEjTdsRJ2nIQrLg9uoww8mOoeueNCDzq2OLpPBqZ5QO72qqPOyj8kP9aweks7UNEQVLQHFe1BRT2oaA8qGoKKelBRABVdg4oiqCiAivagekSO9_eqdweRO6MjUoyVbSQTDgFzIXVe1kWi0ySThutcJlIbxo0qE5PHRmFaQY9UrJkqtZF8VGZKQuQQp4_JRrNszBNCdVEocFBryY3JVF6IUkuQLGKYzkWab5HX_Wecn9tWLPPrVbZFXsJX_sukpzcWt01ur4H9jGy0qwvzHFzRVr5wmPgFW8yQ-A |
linkProvider | EBSCOhost |
openUrl | ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=SPAW-SMOTE%3A+Space+Partitioning+Adaptive+Weighted+Synthetic+Minority+Oversampling+Technique+For+Imbalanced+Data+Set+Learning&rft.jtitle=Computer+journal&rft.au=Zhang%2C+Qiang&rft.au=He%2C+Junjiang&rft.au=Li%2C+Tao&rft.au=Lan%2C+Xiaolong&rft.date=2024-06-22&rft.issn=0010-4620&rft.eissn=1460-2067&rft.volume=67&rft.issue=5&rft.spage=1747&rft.epage=1762&rft_id=info:doi/10.1093%2Fcomjnl%2Fbxad098&rft.externalDBID=n%2Fa&rft.externalDocID=10_1093_comjnl_bxad098 |
thumbnail_l | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=0010-4620&client=summon |
thumbnail_m | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=0010-4620&client=summon |
thumbnail_s | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=0010-4620&client=summon |