Regularized target encoding outperforms traditional methods in supervised machine learning with high cardinality features
Since most machine learning (ML) algorithms are designed for numerical inputs, efficiently encoding categorical variables is a crucial aspect in data analysis. A common problem are high cardinality features, i.e. unordered categorical predictor variables with a high number of levels. We study techni...
Saved in:
Published in | Computational statistics Vol. 37; no. 5; pp. 2671 - 2692 |
---|---|
Main Authors | , , , |
Format | Journal Article |
Language | English |
Published |
Berlin/Heidelberg
Springer Berlin Heidelberg
01.11.2022
Springer Nature B.V |
Subjects | |
Online Access | Get full text |
Cover
Loading…
Abstract | Since most machine learning (ML) algorithms are designed for numerical inputs, efficiently encoding categorical variables is a crucial aspect in data analysis. A common problem are high cardinality features, i.e. unordered categorical predictor variables with a high number of levels. We study techniques that yield numeric representations of categorical variables which can then be used in subsequent ML applications. We focus on the impact of these techniques on a subsequent algorithm’s predictive performance, and—if possible—derive best practices on when to use which technique. We conducted a large-scale benchmark experiment, where we compared different encoding strategies together with five ML algorithms (lasso, random forest, gradient boosting,
k
-nearest neighbors, support vector machine) using datasets from regression, binary- and multiclass–classification settings. In our study, regularized versions of target encoding (i.e. using target predictions based on the feature levels in the training set as a new numerical feature) consistently provided the best results. Traditionally widely used encodings that make unreasonable assumptions to map levels to integers (e.g. integer encoding) or to reduce the number of levels (possibly based on target information, e.g. leaf encoding) before creating binary indicator variables (one-hot or dummy encoding) were not as effective in comparison. |
---|---|
AbstractList | Since most machine learning (ML) algorithms are designed for numerical inputs, efficiently encoding categorical variables is a crucial aspect in data analysis. A common problem are high cardinality features, i.e. unordered categorical predictor variables with a high number of levels. We study techniques that yield numeric representations of categorical variables which can then be used in subsequent ML applications. We focus on the impact of these techniques on a subsequent algorithm’s predictive performance, and—if possible—derive best practices on when to use which technique. We conducted a large-scale benchmark experiment, where we compared different encoding strategies together with five ML algorithms (lasso, random forest, gradient boosting,
k
-nearest neighbors, support vector machine) using datasets from regression, binary- and multiclass–classification settings. In our study, regularized versions of target encoding (i.e. using target predictions based on the feature levels in the training set as a new numerical feature) consistently provided the best results. Traditionally widely used encodings that make unreasonable assumptions to map levels to integers (e.g. integer encoding) or to reduce the number of levels (possibly based on target information, e.g. leaf encoding) before creating binary indicator variables (one-hot or dummy encoding) were not as effective in comparison. Since most machine learning (ML) algorithms are designed for numerical inputs, efficiently encoding categorical variables is a crucial aspect in data analysis. A common problem are high cardinality features, i.e. unordered categorical predictor variables with a high number of levels. We study techniques that yield numeric representations of categorical variables which can then be used in subsequent ML applications. We focus on the impact of these techniques on a subsequent algorithm’s predictive performance, and—if possible—derive best practices on when to use which technique. We conducted a large-scale benchmark experiment, where we compared different encoding strategies together with five ML algorithms (lasso, random forest, gradient boosting, k-nearest neighbors, support vector machine) using datasets from regression, binary- and multiclass–classification settings. In our study, regularized versions of target encoding (i.e. using target predictions based on the feature levels in the training set as a new numerical feature) consistently provided the best results. Traditionally widely used encodings that make unreasonable assumptions to map levels to integers (e.g. integer encoding) or to reduce the number of levels (possibly based on target information, e.g. leaf encoding) before creating binary indicator variables (one-hot or dummy encoding) were not as effective in comparison. |
Author | Pargent, Florian Pfisterer, Florian Bischl, Bernd Thomas, Janek |
Author_xml | – sequence: 1 givenname: Florian orcidid: 0000-0002-2388-553X surname: Pargent fullname: Pargent, Florian email: florian.pargent@psy.lmu.de organization: Department of Psychology, Psychological Methods and Assessment, LMU Munich – sequence: 2 givenname: Florian orcidid: 0000-0001-8867-762X surname: Pfisterer fullname: Pfisterer, Florian organization: Department of Statistics, Statistical Learning and Data Science, LMU Munich – sequence: 3 givenname: Janek orcidid: 0000-0003-4511-6245 surname: Thomas fullname: Thomas, Janek organization: Department of Statistics, Statistical Learning and Data Science, LMU Munich – sequence: 4 givenname: Bernd orcidid: 0000-0001-6002-6980 surname: Bischl fullname: Bischl, Bernd organization: Department of Statistics, Statistical Learning and Data Science, LMU Munich |
BookMark | eNp9kFFr2zAUhcXIYEm2P7AnQZ-9XUmObT2O0K6FQKG0z0K2r2wFR8okeSP99VOWQaEPeTov33e556zIwnmHhHxl8I0B1N8jAGugAM4LYBzqovpAlqxiopDVplmQJchSFCVU_BNZxbiHTNacLcnpCYd50sG-Yk-TDgMmiq7zvXUD9XM6YjA-HCJNQfc2We_0RA-YRt9Hah2NcyZ-25jtg-5G65BOqIM7639sGuloh5F2OuSDerLpRA3qNAeMn8lHo6eIX_7nmrzc3T5v74vd48-H7Y9d0ZWsTkVnZC8EVgbbEqUogUO7abmUOSTq2gjDgVWmNbVsS8E2TWOamm16Dl3FDYo1ubncPQb_a8aY1N7PIT8TFa-Z5IJxJjPVXKgu-BgDGtXZpM99c3E7KQbqPLS6DK3yfOrf0KrKKn-nHoM96HC6LomLFDPsBgxvX12x_gLcTZVi |
CitedBy_id | crossref_primary_10_1007_s00542_025_05848_7 crossref_primary_10_3390_rs16163020 crossref_primary_10_1007_s42979_025_03766_z crossref_primary_10_3390_s23104893 crossref_primary_10_3390_rs16214081 crossref_primary_10_1016_j_scitotenv_2024_176650 crossref_primary_10_1007_s44257_024_00015_0 crossref_primary_10_1080_01605682_2024_2398762 crossref_primary_10_1007_s10489_024_05330_3 crossref_primary_10_1587_comex_2023XBL0082 crossref_primary_10_1155_2024_8858524 crossref_primary_10_3390_app13074119 crossref_primary_10_1017_S1748499523000283 crossref_primary_10_1016_j_seppur_2024_127894 crossref_primary_10_1109_TIT_2023_3287432 crossref_primary_10_1007_s10653_024_02087_z crossref_primary_10_1016_j_geoderma_2024_116838 crossref_primary_10_1007_s12652_024_04776_0 crossref_primary_10_48084_etasr_8226 crossref_primary_10_1063_5_0177271 crossref_primary_10_1007_s11606_023_08065_y crossref_primary_10_3390_math12162553 crossref_primary_10_1007_s11416_024_00517_1 crossref_primary_10_1016_j_chempr_2024_07_025 crossref_primary_10_1109_ACCESS_2022_3170421 crossref_primary_10_1142_S2810939223500028 crossref_primary_10_1007_s42979_024_02999_8 crossref_primary_10_1016_j_jhazmat_2024_134012 crossref_primary_10_1109_ACCESS_2025_3536281 crossref_primary_10_1134_S1064230724700680 crossref_primary_10_1016_j_atmosenv_2024_120615 crossref_primary_10_1007_s40745_024_00575_8 crossref_primary_10_1016_j_rcsop_2024_100463 crossref_primary_10_1017_asb_2024_7 crossref_primary_10_1038_s41598_023_37746_1 crossref_primary_10_1016_j_eswa_2023_120373 crossref_primary_10_1016_j_ress_2024_110558 crossref_primary_10_1016_j_ipm_2023_103526 crossref_primary_10_3390_nano13061024 crossref_primary_10_1016_j_ipm_2024_103645 |
Cites_doi | 10.1109/TKDE.2020.2992529 10.1177/1471082X16644998 10.18637/jss.v033.i01 10.1023/A:1024068626366 10.1111/j.1467-985X.1997.00078.x 10.1177/1471082X16652780 10.1007/978-3-540-70981-7_19 10.1007/3-540-44989-2_43 10.18637/jss.v067.i01 10.1016/j.patrec.2008.08.010 10.21105/joss.00135 10.18637/jss.v032.i09 10.1198/106186005X59630 10.1007/978-3-030-72657-7_14 10.1007/s10994-018-5724-2 10.1145/1553374.1553516 10.1145/507533.507538 10.18637/jss.v077.i01 10.1007/BF02296971 10.7717/peerj.6339 10.1016/j.imavis.2018.04.004 10.1002/bimj.201700129 10.1201/9781315108230 10.1186/s40537-020-00305-w 10.1002/widm.1441 10.1145/2487575.2487629 10.1145/2641190.2641198 10.1002/widm.1301 10.1017/CBO9780511790942 10.1016/j.csda.2019.106839 10.1007/BF02296972 10.32614/CRAN.package.mlrCPO 10.1201/9780203738535 |
ContentType | Journal Article |
Copyright | The Author(s) 2022 The Author(s) 2022. This work is published under http://creativecommons.org/licenses/by/4.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License. |
Copyright_xml | – notice: The Author(s) 2022 – notice: The Author(s) 2022. This work is published under http://creativecommons.org/licenses/by/4.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License. |
DBID | C6C AAYXX CITATION 3V. 7SC 7TB 7WY 7WZ 7XB 87Z 88I 8AL 8C1 8FD 8FE 8FG 8FK 8FL 8G5 ABJCF ABUWG AFKRA ARAPS AZQEC BENPR BEZIV BGLVJ CCPQU DWQXO FR3 FRNLG FYUFA F~G GHDGH GNUQQ GUQSH HCIFZ JQ2 K60 K6~ K7- KR7 L.- L6V L7M L~C L~D M0C M0N M2O M2P M7S MBDVC P5Z P62 PHGZM PHGZT PJZUB PKEHL PPXIY PQBIZ PQBZA PQEST PQGLB PQQKQ PQUKI PTHSS Q9U |
DOI | 10.1007/s00180-022-01207-6 |
DatabaseName | Springer Nature OA Free Journals CrossRef ProQuest Central (Corporate) Computer and Information Systems Abstracts Mechanical & Transportation Engineering Abstracts ABI/INFORM Collection ABI/INFORM Global (PDF only) ProQuest Central (purchase pre-March 2016) ABI/INFORM Collection Science Database (Alumni Edition) Computing Database (Alumni Edition) Public Health Database Technology Research Database ProQuest SciTech Collection ProQuest Technology Collection ProQuest Central (Alumni) (purchase pre-March 2016) ABI/INFORM Collection (Alumni) ProQuest Research Library Materials Science & Engineering Collection ProQuest Central (Alumni) ProQuest Central UK/Ireland Advanced Technologies & Aerospace Collection ProQuest Central Essentials ProQuest Central Business Premium Collection Technology Collection ProQuest One Community College ProQuest Central Engineering Research Database Business Premium Collection (Alumni) Health Research Premium Collection ABI/INFORM Global (Corporate) Health Research Premium Collection (Alumni) ProQuest Central Student ProQuest Research Library SciTech Premium Collection ProQuest Computer Science Collection ProQuest Business Collection (Alumni Edition) ProQuest Business Collection Computer Science Database Civil Engineering Abstracts ABI/INFORM Professional Advanced ProQuest Engineering Collection Advanced Technologies Database with Aerospace Computer and Information Systems Abstracts Academic Computer and Information Systems Abstracts Professional ABI/INFORM Global Computing Database Research Library Science Database Engineering Database Research Library (Corporate) Advanced Technologies & Aerospace Database ProQuest Advanced Technologies & Aerospace Collection ProQuest Central Premium ProQuest One Academic ProQuest Health & Medical Research Collection ProQuest One Academic Middle East (New) ProQuest One Health & Nursing ProQuest One Business ProQuest One Business (Alumni) ProQuest One Academic Eastern Edition (DO NOT USE) ProQuest One Applied & Life Sciences ProQuest One Academic ProQuest One Academic UKI Edition Engineering Collection ProQuest Central Basic |
DatabaseTitle | CrossRef ProQuest Business Collection (Alumni Edition) Research Library Prep Computer Science Database ProQuest Central Student ProQuest Advanced Technologies & Aerospace Collection ProQuest Central Essentials ProQuest Computer Science Collection Computer and Information Systems Abstracts SciTech Premium Collection ABI/INFORM Complete ProQuest One Applied & Life Sciences Health Research Premium Collection Health & Medical Research Collection ProQuest Central (New) Engineering Collection Advanced Technologies & Aerospace Collection Business Premium Collection ABI/INFORM Global Engineering Database ProQuest Science Journals (Alumni Edition) ProQuest One Academic Eastern Edition ProQuest Technology Collection Health Research Premium Collection (Alumni) ProQuest Business Collection ProQuest One Academic UKI Edition Engineering Research Database ProQuest One Academic ProQuest One Academic (New) ABI/INFORM Global (Corporate) ProQuest One Business Technology Collection Technology Research Database Computer and Information Systems Abstracts – Academic ProQuest One Academic Middle East (New) Mechanical & Transportation Engineering Abstracts ProQuest Central (Alumni Edition) ProQuest One Community College ProQuest One Health & Nursing Research Library (Alumni Edition) ProQuest Central ABI/INFORM Professional Advanced ProQuest Health & Medical Research Collection ProQuest Engineering Collection ProQuest Central Korea ProQuest Research Library Advanced Technologies Database with Aerospace ABI/INFORM Complete (Alumni Edition) Civil Engineering Abstracts ProQuest Computing ProQuest Public Health ABI/INFORM Global (Alumni Edition) ProQuest Central Basic ProQuest Science Journals ProQuest Computing (Alumni Edition) ProQuest SciTech Collection Computer and Information Systems Abstracts Professional Advanced Technologies & Aerospace Database Materials Science & Engineering Collection ProQuest One Business (Alumni) ProQuest Central (Alumni) Business Premium Collection (Alumni) |
DatabaseTitleList | CrossRef ProQuest Business Collection (Alumni Edition) |
Database_xml | – sequence: 1 dbid: C6C name: Springer Nature OA Free Journals url: http://www.springeropen.com/ sourceTypes: Publisher – sequence: 2 dbid: 8FG name: ProQuest Technology Collection url: https://search.proquest.com/technologycollection1 sourceTypes: Aggregation Database |
DeliveryMethod | fulltext_linktorsrc |
Discipline | Statistics Mathematics |
EISSN | 1613-9658 |
EndPage | 2692 |
ExternalDocumentID | 10_1007_s00180_022_01207_6 |
GrantInformation_xml | – fundername: Bayerisches Staatsministerium für Wirtschaft und Medien, Energie und Technologie grantid: 20-3410-2-9-8 funderid: http://dx.doi.org/10.13039/501100006463 – fundername: Bundesministerium für Bildung, Wissenschaft und Kultur grantid: 01IS18036A funderid: http://dx.doi.org/10.13039/501100006604 |
GroupedDBID | -5D -5G -BR -EM -Y2 -~C .86 .VR 06D 0R~ 0VY 199 1N0 203 29F 2J2 2JN 2JY 2KG 2LR 2VQ 2~H 30V 3V. 4.4 406 408 409 40D 40E 53G 5GY 5VS 67Z 6NX 78A 7WY 88I 8C1 8FE 8FG 8FL 8G5 8TC 8UJ 95- 95. 95~ 96X AAAVM AABHQ AACDK AAHNG AAIAL AAJBT AAJKR AANZL AARHV AARTL AASML AATNV AATVU AAUYE AAWCG AAYIU AAYQN AAYTO AAYZH ABAKF ABBBX ABBXA ABDZT ABECU ABFTV ABHLI ABHQN ABJCF ABJNI ABJOX ABKCH ABKTR ABLJU ABMNI ABMQK ABNWP ABQBU ABQSL ABSXP ABTEG ABTHY ABTKH ABTMW ABULA ABUWG ABWNU ABXPI ACAOD ACBXY ACDTI ACGFS ACGOD ACHSB ACHXU ACIWK ACKNC ACMDZ ACMLO ACOKC ACOMO ACPIV ACSNA ACZOJ ADBBV ADHHG ADHIR ADINQ ADKNI ADKPE ADRFC ADTPH ADURQ ADYFF ADZKW AEBTG AEFQL AEGAL AEGNC AEJHL AEJRE AEKMD AEMSY AENEX AEOHA AEPYU AESKC AETLH AEVLU AEXYK AFBBN AFGCZ AFKRA AFLOW AFQWF AFWTZ AFZKB AGAYW AGDGC AGJBK AGMZJ AGQEE AGQMX AGRTI AGWIL AGWZB AGYKE AHAVH AHBYD AHKAY AHSBF AHYZX AIAKS AIGIU AIIXL AILAN AITGF AJBLW AJRNO AJZVZ ALIPV ALMA_UNASSIGNED_HOLDINGS ALWAN AMKLP AMXSW AMYLF AMYQR AOCGG ARAPS ARMRJ ASPBG AVWKF AXYYD AYJHY AZFZN AZQEC B-. BA0 BAPOH BDATZ BENPR BEZIV BGLVJ BGNMA BPHCQ BSONS C6C CAG CCPQU COF CS3 CSCUP DDRTE DNIVK DPUIP DU5 DWQXO EBLON EBS EIOEI EJD ESBYG F5P FEDTE FERAY FFXSO FIGPU FINBP FNLPD FRNLG FRRFC FSGXE FWDCC FYUFA GGCAI GGRSB GJIRD GNUQQ GNWQR GQ6 GQ7 GQ8 GROUPED_ABI_INFORM_COMPLETE GUQSH GXS H13 HCIFZ HF~ HG5 HG6 HLICF HMJXF HQYDN HRMNR HVGLF HZ~ H~9 IHE IJ- IKXTQ ITM IWAJR IXC IXE IZIGR IZQ I~X I~Z J-C J0Z JBSCW JCJTX JZLTJ K60 K6V K6~ K7- KDC KOV L6V LAS LLZTM M0C M0N M2O M2P M4Y M7S MA- MK~ N2Q N9A NB0 NPVJJ NQJWS NU0 O9- O93 O9J OAM P2P P62 P9R PF0 PQBIZ PQBZA PQQKQ PROAC PT4 PTHSS Q2X QOS R89 R9I RNS ROL RPX RSV S16 S1Z S27 S3B SAP SDH SHX SISQX SJYHP SMT SNE SNPRN SNX SOHCF SOJ SPISZ SRMVM SSLCW STPWE SZN T13 TSG TSK TSV TUC U2A UG4 UKHRP UOJIU UTJUX UZXMN VC2 VFIZW W23 W48 WK8 YLTOR Z45 Z7R Z7X Z7Y Z81 Z83 Z88 ZMTXR AAPKM AAYXX ABBRH ABDBE ABFSG ACSTC ADHKG ADKFA AEZWR AFDZB AFHIU AFOHR AGQPQ AHPBZ AHWEU AIXLP AMVHM ATHPR AYFIA CITATION PHGZM PHGZT 7SC 7TB 7XB 8AL 8FD 8FK ABRTQ FR3 JQ2 KR7 L.- L7M L~C L~D MBDVC PJZUB PKEHL PPXIY PQEST PQGLB PQUKI Q9U |
ID | FETCH-LOGICAL-c417t-cf9d33e6feb4e934020b5b2990b59ea7f3f2016fbf79b431588f8715d20c62fe3 |
IEDL.DBID | BENPR |
ISSN | 0943-4062 |
IngestDate | Fri Jul 25 19:04:43 EDT 2025 Tue Jul 01 04:23:18 EDT 2025 Thu Apr 24 23:00:55 EDT 2025 Fri Feb 21 02:44:33 EST 2025 |
IsDoiOpenAccess | true |
IsOpenAccess | true |
IsPeerReviewed | true |
IsScholarly | true |
Issue | 5 |
Keywords | Benchmark Dummy encoding Supervised machine learning Generalized linear mixed models Target encoding High-cardinality categorical features |
Language | English |
LinkModel | DirectLink |
MergedId | FETCHMERGED-LOGICAL-c417t-cf9d33e6feb4e934020b5b2990b59ea7f3f2016fbf79b431588f8715d20c62fe3 |
Notes | ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14 |
ORCID | 0000-0001-6002-6980 0000-0003-4511-6245 0000-0001-8867-762X 0000-0002-2388-553X |
OpenAccessLink | https://doi.org/10.1007/s00180-022-01207-6 |
PQID | 2719231219 |
PQPubID | 54096 |
PageCount | 22 |
ParticipantIDs | proquest_journals_2719231219 crossref_citationtrail_10_1007_s00180_022_01207_6 crossref_primary_10_1007_s00180_022_01207_6 springer_journals_10_1007_s00180_022_01207_6 |
ProviderPackageCode | CITATION AAYXX |
PublicationCentury | 2000 |
PublicationDate | 2022-11-01 |
PublicationDateYYYYMMDD | 2022-11-01 |
PublicationDate_xml | – month: 11 year: 2022 text: 2022-11-01 day: 01 |
PublicationDecade | 2020 |
PublicationPlace | Berlin/Heidelberg |
PublicationPlace_xml | – name: Berlin/Heidelberg – name: Heidelberg |
PublicationTitle | Computational statistics |
PublicationTitleAbbrev | Comput Stat |
PublicationYear | 2022 |
Publisher | Springer Berlin Heidelberg Springer Nature B.V |
Publisher_xml | – name: Springer Berlin Heidelberg – name: Springer Nature B.V |
References | Chen T, He T, Benesty M, Khotilovich V,Tang Y, Cho H, Chen K, Mitchell R, Cano I, Zhou T, Li M, Xie J, Lin M, Geng Y, Li Y (2018) Xgboost: Extreme gradient boosting. R package version 0.71.2. https://CRAN.Rproject.org/package=xgboost HancockJTKhoshgoftaarTMSurvey on categorical data for neural networksJ Big Data2020714110.1186/s40537-020-00305-w FeurerMKleinAEggenspergerKSpringenbergJBlumMHutterFCortesCLawrenceNDLeeDDSugiyamaMGarnettREfficient and robust automated machine learningAdvances in neural information processing systems 282015New YorkCurran Associates Inc29622970 NadeauCBengioYInference for the generalization errorMach Learn20035223928110.1023/A:10240686263661039.68104 GelmanAHillJData analysis using regression and multilevel/hierarchical models2006CambridgeCambridge University Press10.1017/CBO9780511790942 Wright MN, König IR (2019) Splitting on categorical predictors in random forests. PeerJ 7. https://doi.org/10.7717/peerj.6339 De LeeuwJYoungFWTakaneYAdditive structure in qualitative data: an alternating least squares method with optimal scaling featuresPsychometrika19764147150310.1007/BF02296971 Thornton C, Hutter F, Hoos HH, Leyton-Brown K (2013) Auto-WEKA: Combined selection and hyperparameter optimization of classification algorithms, In: Proceedings of the 19th ACM SIGKDD international conference on knowledge discovery and data mining, KDD ’13. ACM, New York, NY, USA, pp 847–855. https://doi.org/10.1145/2487575.2487629 R Core Team (2021) R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria Hornik K, Meyer D (2007) Deriving consensus rankings from benchmarking experiments, In: Advances in data analysis. Springer, pp 163–170. https://doi.org/10.1007/978-3-540-70981-7_19 Coors S (2018) Automatic gradient boosting (Master’sthesis). LMU Munich. https://epub.ub.uni-muenchen.de/59108/1/MA_Coors.pdf KuhnMJohnsonKFeature engineering and selection: a practical approach for predictive models2019ChapmanHall/CRC10.1201/9781315108230 Binder M (2018) mlrCPO: Composable preprocessing operators and pipelines for machine learning. R package version 0.3.4-2. https://github.com/mlr-org/mlrCPO RodríguezpBautistaMAGonzàlezJEscaleraSBeyond one-hot encoding: lower dimensional target embeddingImage Vis Comput201875213110.1016/j.imavis.2018.04.004 BrownGPocockAMing-JieZLujánMConditional likelihood maximisation: a unifying framework for information theoretic feature selectionJ Mach Learn Res201213276629136931283.68283 Thomas J, Coors S, Bischl B (2018) Automatic gradient boosting. arXiv preprint arXiv:1807.03873 Bates D (2020) Computational methods for mixed models. Vignette for lme4. https://cran.r-project.org/web/packages/lme4/vignettes/Theory.pdf Therneau T, Atkinson B (2018) Rpart: recursive partitioning and regression trees. R package version 4.1-13. https://CRAN.R-project.org/package=rpart MairPde LeeuwJA general framework for multivariate analysis with optimal scaling: the r package aspectJ Stat Softw20103212310.18637/jss.v032.i09 Weinberger KQ, Dasgupta A, Langford J, Smola AJ, Attenberg J (2009) Feature hashin for large scale multitask learning. In: Proceedings of the 26th Annual International Conference on Machine Learning (ICML ’09). Association for Computing Machinery, New York, NY, USA, 1113–1120. https://doi.org/10.1145/1553374.1553516 Chambers J, Hastie T (1992) Statistical models. Chapter 2 of statistical models in S, 1st edn. Routledge. https://doi.org/10.1201/9780203738535 Gra̧bczewskiKJankowskiNKaynakOAlpaydinEOjaEXuLTransformations of symbolic data for continuous data oriented modelsArtificial neural networks and neural information processing – ICANN/ICONIP 20032003Berlin, HeidelbergSpringer35936610.1007/3-540-44989-2_43 HothornTLeischFZeileisAHornikKThe design and analysis of benchmark experimentsJ Comput Graph Stat200514675699217020810.1198/106186005X59630 Schliep K, Hechenbichler K (2016) Kknn: Weighted k-nearest neighbors R package version 1.3.1. https://CRAN.R-project.org/package=kknn VanschorenJvan RijnNBischlBTorgoLOpenML: networked science in machine learningSIGKDD Explor201315496010.1145/2641190.2641198 BatesDMächlerMBolkerBWalkerSFitting linear mixed-effects models using lme4J Stat Softw20156714810.18637/jss.v067.i01 Nießl C, Herrmann M, Wiedemann C,Casalicchio G, Boulesteix A-L (2021) Over-optimism in benchmark studies and the multiplicity of design and analysis options when interpreting their results. WIREs Data Mining and Knowledge Discovery, e1441. https://doi.org/10.1002/widm.1441 BommertASunXBischlBRahnenführerJLangMBenchmark for filter methods for feature selection in high-dimensional classification dataComput Stat Data Anal2020401320910.1016/j.csda.2019.10683907135552 FriedmanJHastieTTibshiraniRRegularization paths for generalized linear models via coordinate descentJ Stat Softw20103312210.18637/jss.v033.i01 Prokopev V (2018) Mean (likelihood) encodings: a comprehensive study. Kaggle Forums YoungFWDe LeeuwJTakaneYRegression with qualitative and quantitative variables: an alternating least squares method with optimal scaling featuresPsychometrika19764150552910.1007/BF022969720351.92032 Prokhorenkova L, Gusev G, Vorobev A, Dorogush AV, Gulin A (2018) CatBoost: Unbiased boosting with categorical features, in: Bengio S, Wallach H, Larochelle H, Grauman K, Cesa-Bianchi N, Garnett R (Eds.), Advances in Neural Information Processing Systems 31. Curran Associates, Inc., pp. 6638–6648 Wright MN, Ziegler A (2017) Ranger: a fast implementation of random forests for high dimensional data in C++ and R. J Stat Softw 77:1–17. https://doi.org/10.18637/jss.v077.i01 Meyer D, Hornik K (2018) Relations: data structures and algorithms for relations CerdaPVaroquauxGKéglBSimilarity encoding for learning with dirty categorical variablesMach Learn201810714771494383527510.1007/s10994-018-5724-2 SecaDMendes-MoreiraJRochaÁAdeliHDzemydaGMoreiraFRamalho CorreiaAMBenchmark of encoders of nominal features for regressionTrends and applications in information systems and technologies2021ChamSpringer International Publishing14615510.1007/978-3-030-72657-7_14 TutzGGertheissJRejoinder: Regularized regression for categorical dataStat Model201616249260351539210.1177/1471082X16652780 LangMBischlBSurmannDBatchtools: tools for r to work on batch systemsJ Open Source Softw201710.21105/joss.00135 BischlBLangMKotthoffLSchiffnerJRichterJStuderusECasalicchioGJonesZMmlr: machine learning in rJ Mach Learn Res2016171535674381392.68007 ChiquetJGrandvaletYRigaillGOn coding effects in regularized categorical regressionStat Modell20161622823710.1177/1471082X16644998 HandDJHenleyWEStatistical classification methods in consumer credit scoring: a reviewJ R Stat Soc A Stat Soc199716052354110.1111/j.1467-985X.1997.00078.x Fernández-DelgadoMCernadasEBarroSAmorimDDo we need hundreds of classifiers to solve real world classification problems?J Mach Learn Res2014153133318132771551319.62005 CerdaPVaroquauxGEncoding high-cardinality string categorical variablesIEEE Trans Knowl Data Eng202010.1109/TKDE.2020.2992529 Steinwart I, Thomann P (2017) liquidSVM: A fast and versatile SVM package. arXiv: 1702:06899 FerriCHernández-OralloJModroiuRAn experimental comparison of performance measures for classificationPattern Recogn Lett200930273810.1016/j.patrec.2008.08.010 Guo C, Berkhahn F (2016) Entity embeddings of categorical variables. arXiv preprint arXiv:1604.06737 Dehghani M, Tay Y, Gritsenko AA, Zhao Z, Houlsby N, Diaz F, Metzler D, Vinyals O (2021) The benchmark lottery. arXiv preprint arXiv:2107.07002 Micci-BarrecaDA preprocessing scheme for high-cardinality categorical attributes in classification and prediction problemsSIGKDD Explor Newsl20013273210.1145/507533.507538 Probst P, Wright MN, Boulesteix A-L (2019) Hyperparameters and tuning strategies for random forest. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery. https://doi.org/10.1002/widm.1301 BoulesteixA-LBinderHAbrahamowiczMSauerbreiWOn the necessity and design of studies comparing statistical methodsBiomet J Biomet Zeitschrift201760216218374449410.1002/bimj.2017001291383.62019 G Brown (1207_CR7) 2012; 13 P Cerda (1207_CR8) 2020 J Vanschoren (1207_CR46) 2013; 15 C Ferri (1207_CR17) 2009; 30 FW Young (1207_CR50) 1976; 41 1207_CR22 J Chiquet (1207_CR12) 2016; 16 D Micci-Barreca (1207_CR31) 2001; 3 1207_CR25 A-L Boulesteix (1207_CR6) 2017; 60 J De Leeuw (1207_CR14) 1976; 41 1207_CR10 1207_CR11 K Gra̧bczewski (1207_CR21) 2003 P Cerda (1207_CR9) 2018; 107 1207_CR13 1207_CR15 G Tutz (1207_CR45) 2016; 16 D Seca (1207_CR40) 2021 1207_CR42 1207_CR43 1207_CR44 M Kuhn (1207_CR27) 2019 1207_CR41 M Fernández-Delgado (1207_CR16) 2014; 15 1207_CR47 1207_CR48 A Gelman (1207_CR20) 2006 1207_CR49 1207_CR3 T Hothorn (1207_CR26) 2005; 14 DJ Hand (1207_CR24) 1997; 160 C Nadeau (1207_CR32) 2003; 52 1207_CR1 P Mair (1207_CR29) 2010; 32 JT Hancock (1207_CR23) 2020; 7 1207_CR33 p Rodríguez (1207_CR38) 2018; 75 B Bischl (1207_CR4) 2016; 17 M Feurer (1207_CR18) 2015 J Friedman (1207_CR19) 2010; 33 1207_CR34 A Bommert (1207_CR5) 2020 1207_CR30 1207_CR39 D Bates (1207_CR2) 2015; 67 1207_CR35 M Lang (1207_CR28) 2017 1207_CR36 1207_CR37 |
References_xml | – reference: FeurerMKleinAEggenspergerKSpringenbergJBlumMHutterFCortesCLawrenceNDLeeDDSugiyamaMGarnettREfficient and robust automated machine learningAdvances in neural information processing systems 282015New YorkCurran Associates Inc29622970 – reference: ChiquetJGrandvaletYRigaillGOn coding effects in regularized categorical regressionStat Modell20161622823710.1177/1471082X16644998 – reference: Probst P, Wright MN, Boulesteix A-L (2019) Hyperparameters and tuning strategies for random forest. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery. https://doi.org/10.1002/widm.1301 – reference: FerriCHernández-OralloJModroiuRAn experimental comparison of performance measures for classificationPattern Recogn Lett200930273810.1016/j.patrec.2008.08.010 – reference: BischlBLangMKotthoffLSchiffnerJRichterJStuderusECasalicchioGJonesZMmlr: machine learning in rJ Mach Learn Res2016171535674381392.68007 – reference: Coors S (2018) Automatic gradient boosting (Master’sthesis). LMU Munich. https://epub.ub.uni-muenchen.de/59108/1/MA_Coors.pdf – reference: BatesDMächlerMBolkerBWalkerSFitting linear mixed-effects models using lme4J Stat Softw20156714810.18637/jss.v067.i01 – reference: GelmanAHillJData analysis using regression and multilevel/hierarchical models2006CambridgeCambridge University Press10.1017/CBO9780511790942 – reference: Hornik K, Meyer D (2007) Deriving consensus rankings from benchmarking experiments, In: Advances in data analysis. Springer, pp 163–170. https://doi.org/10.1007/978-3-540-70981-7_19 – reference: Dehghani M, Tay Y, Gritsenko AA, Zhao Z, Houlsby N, Diaz F, Metzler D, Vinyals O (2021) The benchmark lottery. arXiv preprint arXiv:2107.07002 – reference: Thornton C, Hutter F, Hoos HH, Leyton-Brown K (2013) Auto-WEKA: Combined selection and hyperparameter optimization of classification algorithms, In: Proceedings of the 19th ACM SIGKDD international conference on knowledge discovery and data mining, KDD ’13. ACM, New York, NY, USA, pp 847–855. https://doi.org/10.1145/2487575.2487629 – reference: YoungFWDe LeeuwJTakaneYRegression with qualitative and quantitative variables: an alternating least squares method with optimal scaling featuresPsychometrika19764150552910.1007/BF022969720351.92032 – reference: Therneau T, Atkinson B (2018) Rpart: recursive partitioning and regression trees. R package version 4.1-13. https://CRAN.R-project.org/package=rpart – reference: Wright MN, Ziegler A (2017) Ranger: a fast implementation of random forests for high dimensional data in C++ and R. J Stat Softw 77:1–17. https://doi.org/10.18637/jss.v077.i01 – reference: RodríguezpBautistaMAGonzàlezJEscaleraSBeyond one-hot encoding: lower dimensional target embeddingImage Vis Comput201875213110.1016/j.imavis.2018.04.004 – reference: Prokhorenkova L, Gusev G, Vorobev A, Dorogush AV, Gulin A (2018) CatBoost: Unbiased boosting with categorical features, in: Bengio S, Wallach H, Larochelle H, Grauman K, Cesa-Bianchi N, Garnett R (Eds.), Advances in Neural Information Processing Systems 31. Curran Associates, Inc., pp. 6638–6648 – reference: Fernández-DelgadoMCernadasEBarroSAmorimDDo we need hundreds of classifiers to solve real world classification problems?J Mach Learn Res2014153133318132771551319.62005 – reference: Steinwart I, Thomann P (2017) liquidSVM: A fast and versatile SVM package. arXiv: 1702:06899 – reference: MairPde LeeuwJA general framework for multivariate analysis with optimal scaling: the r package aspectJ Stat Softw20103212310.18637/jss.v032.i09 – reference: BoulesteixA-LBinderHAbrahamowiczMSauerbreiWOn the necessity and design of studies comparing statistical methodsBiomet J Biomet Zeitschrift201760216218374449410.1002/bimj.2017001291383.62019 – reference: Bates D (2020) Computational methods for mixed models. Vignette for lme4. https://cran.r-project.org/web/packages/lme4/vignettes/Theory.pdf – reference: Guo C, Berkhahn F (2016) Entity embeddings of categorical variables. arXiv preprint arXiv:1604.06737 – reference: Binder M (2018) mlrCPO: Composable preprocessing operators and pipelines for machine learning. R package version 0.3.4-2. https://github.com/mlr-org/mlrCPO – reference: Meyer D, Hornik K (2018) Relations: data structures and algorithms for relations – reference: Thomas J, Coors S, Bischl B (2018) Automatic gradient boosting. arXiv preprint arXiv:1807.03873 – reference: Schliep K, Hechenbichler K (2016) Kknn: Weighted k-nearest neighbors R package version 1.3.1. https://CRAN.R-project.org/package=kknn – reference: Micci-BarrecaDA preprocessing scheme for high-cardinality categorical attributes in classification and prediction problemsSIGKDD Explor Newsl20013273210.1145/507533.507538 – reference: NadeauCBengioYInference for the generalization errorMach Learn20035223928110.1023/A:10240686263661039.68104 – reference: CerdaPVaroquauxGEncoding high-cardinality string categorical variablesIEEE Trans Knowl Data Eng202010.1109/TKDE.2020.2992529 – reference: HandDJHenleyWEStatistical classification methods in consumer credit scoring: a reviewJ R Stat Soc A Stat Soc199716052354110.1111/j.1467-985X.1997.00078.x – reference: Weinberger KQ, Dasgupta A, Langford J, Smola AJ, Attenberg J (2009) Feature hashin for large scale multitask learning. In: Proceedings of the 26th Annual International Conference on Machine Learning (ICML ’09). Association for Computing Machinery, New York, NY, USA, 1113–1120. https://doi.org/10.1145/1553374.1553516 – reference: HancockJTKhoshgoftaarTMSurvey on categorical data for neural networksJ Big Data2020714110.1186/s40537-020-00305-w – reference: Wright MN, König IR (2019) Splitting on categorical predictors in random forests. PeerJ 7. https://doi.org/10.7717/peerj.6339 – reference: Chambers J, Hastie T (1992) Statistical models. Chapter 2 of statistical models in S, 1st edn. Routledge. https://doi.org/10.1201/9780203738535 – reference: De LeeuwJYoungFWTakaneYAdditive structure in qualitative data: an alternating least squares method with optimal scaling featuresPsychometrika19764147150310.1007/BF02296971 – reference: R Core Team (2021) R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria – reference: Nießl C, Herrmann M, Wiedemann C,Casalicchio G, Boulesteix A-L (2021) Over-optimism in benchmark studies and the multiplicity of design and analysis options when interpreting their results. WIREs Data Mining and Knowledge Discovery, e1441. https://doi.org/10.1002/widm.1441 – reference: BrownGPocockAMing-JieZLujánMConditional likelihood maximisation: a unifying framework for information theoretic feature selectionJ Mach Learn Res201213276629136931283.68283 – reference: VanschorenJvan RijnNBischlBTorgoLOpenML: networked science in machine learningSIGKDD Explor201315496010.1145/2641190.2641198 – reference: TutzGGertheissJRejoinder: Regularized regression for categorical dataStat Model201616249260351539210.1177/1471082X16652780 – reference: Gra̧bczewskiKJankowskiNKaynakOAlpaydinEOjaEXuLTransformations of symbolic data for continuous data oriented modelsArtificial neural networks and neural information processing – ICANN/ICONIP 20032003Berlin, HeidelbergSpringer35936610.1007/3-540-44989-2_43 – reference: SecaDMendes-MoreiraJRochaÁAdeliHDzemydaGMoreiraFRamalho CorreiaAMBenchmark of encoders of nominal features for regressionTrends and applications in information systems and technologies2021ChamSpringer International Publishing14615510.1007/978-3-030-72657-7_14 – reference: Prokopev V (2018) Mean (likelihood) encodings: a comprehensive study. Kaggle Forums – reference: KuhnMJohnsonKFeature engineering and selection: a practical approach for predictive models2019ChapmanHall/CRC10.1201/9781315108230 – reference: FriedmanJHastieTTibshiraniRRegularization paths for generalized linear models via coordinate descentJ Stat Softw20103312210.18637/jss.v033.i01 – reference: CerdaPVaroquauxGKéglBSimilarity encoding for learning with dirty categorical variablesMach Learn201810714771494383527510.1007/s10994-018-5724-2 – reference: HothornTLeischFZeileisAHornikKThe design and analysis of benchmark experimentsJ Comput Graph Stat200514675699217020810.1198/106186005X59630 – reference: Chen T, He T, Benesty M, Khotilovich V,Tang Y, Cho H, Chen K, Mitchell R, Cano I, Zhou T, Li M, Xie J, Lin M, Geng Y, Li Y (2018) Xgboost: Extreme gradient boosting. R package version 0.71.2. https://CRAN.Rproject.org/package=xgboost – reference: LangMBischlBSurmannDBatchtools: tools for r to work on batch systemsJ Open Source Softw201710.21105/joss.00135 – reference: BommertASunXBischlBRahnenführerJLangMBenchmark for filter methods for feature selection in high-dimensional classification dataComput Stat Data Anal2020401320910.1016/j.csda.2019.10683907135552 – start-page: 2962 volume-title: Advances in neural information processing systems 28 year: 2015 ident: 1207_CR18 – year: 2020 ident: 1207_CR8 publication-title: IEEE Trans Knowl Data Eng doi: 10.1109/TKDE.2020.2992529 – volume: 16 start-page: 228 year: 2016 ident: 1207_CR12 publication-title: Stat Modell doi: 10.1177/1471082X16644998 – volume: 33 start-page: 1 year: 2010 ident: 1207_CR19 publication-title: J Stat Softw doi: 10.18637/jss.v033.i01 – volume: 52 start-page: 239 year: 2003 ident: 1207_CR32 publication-title: Mach Learn doi: 10.1023/A:1024068626366 – ident: 1207_CR42 – ident: 1207_CR22 – volume: 160 start-page: 523 year: 1997 ident: 1207_CR24 publication-title: J R Stat Soc A Stat Soc doi: 10.1111/j.1467-985X.1997.00078.x – volume: 16 start-page: 249 year: 2016 ident: 1207_CR45 publication-title: Stat Model doi: 10.1177/1471082X16652780 – ident: 1207_CR25 doi: 10.1007/978-3-540-70981-7_19 – start-page: 359 volume-title: Artificial neural networks and neural information processing – ICANN/ICONIP 2003 year: 2003 ident: 1207_CR21 doi: 10.1007/3-540-44989-2_43 – ident: 1207_CR36 – volume: 67 start-page: 1 year: 2015 ident: 1207_CR2 publication-title: J Stat Softw doi: 10.18637/jss.v067.i01 – ident: 1207_CR13 – volume: 30 start-page: 27 year: 2009 ident: 1207_CR17 publication-title: Pattern Recogn Lett doi: 10.1016/j.patrec.2008.08.010 – ident: 1207_CR35 – ident: 1207_CR41 – year: 2017 ident: 1207_CR28 publication-title: J Open Source Softw doi: 10.21105/joss.00135 – volume: 15 start-page: 3133 year: 2014 ident: 1207_CR16 publication-title: J Mach Learn Res – volume: 32 start-page: 1 year: 2010 ident: 1207_CR29 publication-title: J Stat Softw doi: 10.18637/jss.v032.i09 – volume: 14 start-page: 675 year: 2005 ident: 1207_CR26 publication-title: J Comput Graph Stat doi: 10.1198/106186005X59630 – ident: 1207_CR39 – start-page: 146 volume-title: Trends and applications in information systems and technologies year: 2021 ident: 1207_CR40 doi: 10.1007/978-3-030-72657-7_14 – volume: 107 start-page: 1477 year: 2018 ident: 1207_CR9 publication-title: Mach Learn doi: 10.1007/s10994-018-5724-2 – ident: 1207_CR47 doi: 10.1145/1553374.1553516 – volume: 3 start-page: 27 year: 2001 ident: 1207_CR31 publication-title: SIGKDD Explor Newsl doi: 10.1145/507533.507538 – ident: 1207_CR49 doi: 10.18637/jss.v077.i01 – volume: 41 start-page: 471 year: 1976 ident: 1207_CR14 publication-title: Psychometrika doi: 10.1007/BF02296971 – ident: 1207_CR48 doi: 10.7717/peerj.6339 – volume: 13 start-page: 27 year: 2012 ident: 1207_CR7 publication-title: J Mach Learn Res – ident: 1207_CR15 – volume: 75 start-page: 21 year: 2018 ident: 1207_CR38 publication-title: Image Vis Comput doi: 10.1016/j.imavis.2018.04.004 – ident: 1207_CR11 – ident: 1207_CR30 – volume: 60 start-page: 216 year: 2017 ident: 1207_CR6 publication-title: Biomet J Biomet Zeitschrift doi: 10.1002/bimj.201700129 – volume-title: Feature engineering and selection: a practical approach for predictive models year: 2019 ident: 1207_CR27 doi: 10.1201/9781315108230 – volume: 7 start-page: 1 year: 2020 ident: 1207_CR23 publication-title: J Big Data doi: 10.1186/s40537-020-00305-w – ident: 1207_CR33 doi: 10.1002/widm.1441 – ident: 1207_CR43 – ident: 1207_CR44 doi: 10.1145/2487575.2487629 – volume: 15 start-page: 49 year: 2013 ident: 1207_CR46 publication-title: SIGKDD Explor doi: 10.1145/2641190.2641198 – volume: 17 start-page: 1 year: 2016 ident: 1207_CR4 publication-title: J Mach Learn Res – ident: 1207_CR34 doi: 10.1002/widm.1301 – volume-title: Data analysis using regression and multilevel/hierarchical models year: 2006 ident: 1207_CR20 doi: 10.1017/CBO9780511790942 – year: 2020 ident: 1207_CR5 publication-title: Comput Stat Data Anal doi: 10.1016/j.csda.2019.106839 – volume: 41 start-page: 505 year: 1976 ident: 1207_CR50 publication-title: Psychometrika doi: 10.1007/BF02296972 – ident: 1207_CR37 – ident: 1207_CR3 doi: 10.32614/CRAN.package.mlrCPO – ident: 1207_CR10 doi: 10.1201/9780203738535 – ident: 1207_CR1 |
SSID | ssj0022721 |
Score | 2.489572 |
Snippet | Since most machine learning (ML) algorithms are designed for numerical inputs, efficiently encoding categorical variables is a crucial aspect in data analysis.... |
SourceID | proquest crossref springer |
SourceType | Aggregation Database Enrichment Source Index Database Publisher |
StartPage | 2671 |
SubjectTerms | Algorithms Best practice Data analysis Economic Theory/Quantitative Economics/Mathematical Methods Machine learning Mathematics and Statistics Original Paper Performance prediction Probability and Statistics in Computer Science Probability Theory and Stochastic Processes Statistics Support vector machines |
SummonAdditionalLinks | – databaseName: Springer Nature OA Free Journals dbid: C6C link: http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV1BS8MwFA46L_MgOhWnU3LwpoE1TZr2KMMxhHkQB7uVpnmRgc6xdgf99ealXYdDBc9NQun3mve95L3vEXIdxCCy3BqGSoRMADcsyzLLjHNmqq9sKDjWDo8fo9FEPEzltJbJwVqYrft7FPsM4j7DnHMs81Qs2iV7MggVtmkYRIMmuOLK11hhopyLiSJeF8j8vMZ3J7RhlluXod7HDA_JQU0O6V2F5hHZgXmH7I8bZdWiQ9rIDitx5WPy8eQ7yS9nn2BoldNNUZgS_RF9X5WLqiqgoOUyM7Pq1I9WPaMLOpvTYrXAraJws998UiXQuovEC8UDWopixjRHI6r4OrXghUCLEzIZ3j8PRqzupcByEaiS5TYxYQiRBS0gCTFq1FKjL9IygcyBYh0ViKy2KtGOVMg4ti6Wkob384hbCE9Ja_4-hzNCUXNOGZ1JrgIBASRWONbkNk0Q3IAUXRKsP26a10Lj2O_iNW0kkj0gqQMk9YCkUZfcNHMWlczGn6N7a8zS-pcrUvc2SFbdDtwlt2scN49_X-38f8MvSJujKfl6xB5plcsVXDpiUuorb5FfLHzZqQ priority: 102 providerName: Springer Nature |
Title | Regularized target encoding outperforms traditional methods in supervised machine learning with high cardinality features |
URI | https://link.springer.com/article/10.1007/s00180-022-01207-6 https://www.proquest.com/docview/2719231219 |
Volume | 37 |
hasFullText | 1 |
inHoldings | 1 |
isFullTextHit | |
isPrint | |
link | http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwfV1Nb9swDCXa5NIdhnZrsaxtoMNunbBYlr9ORWokDVokGIIF6E6GbVFFgDZJY-fQ_fqJspJgA9qLDralg0mRjxL5CPDNi1HmpVacmAi5RKF4nueaK-PMol6kfSmodng8CUczefcQPLgDt8qlVW5tojXUalnSGfkPEVksYjbY9eqFU9coul11LTQOoW1McBy3oH0zmPyc7kIuEdnKK0qfM5FSKFzZjC2eo350PU7Z7FRAGvHwX9e0x5v_XZFazzM8ho8OMrJ-I-MTOMDFJ_gw3vGtVp_hdWpbyq_nf1CxJrmbEUMlOSa23NSrpjygYvU6V_Pm-I81zaMrNl-warMim1GZ2c82uxKZayfxyOiklhGrMStJmxrgzjRaRtDqFGbDwa90xF1TBV5KL6p5qRPl-xhqLCQmPoWPRVCQUyqCBHMjHW0wQagLHSWFQRdBHGsTVAVK9MpQaPTPoLVYLvALMCKfi1SRB0Y2Ej1MtDTwyVhPlEJhIDvgbf9nVjrGcWp88ZTtuJKtDDIjg8zKIAs7cLWbs2r4Nt79-mIrpsztvSrba0oHvm9Ft3_99mpf31_tHI4EaYstRLyAVr3e4KVBJHXRhcM49Wgc3nah3b_9fT_oOlU0T9MwNeNM9P8CcPHjIw |
linkProvider | ProQuest |
linkToHtml | http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwtV1LT9tAEB5ROLQcEI-ipuWxBzjRFcl6bccHhBAQwiMcKpC4ubZ3tooESYgdIfqj-huZWduJigQ3zrZXlufbeazn-wZgp9VGnWTWSFYilBqVkUmSWGkomIXN0HpaMXe4dx10b_XFnX83B_9qLgy3VdY-0TlqM8z4jHxfhS4XoQ12OHqUPDWK_67WIzRKWFzi8xOVbPnB-QnZd1epzunNcVdWUwVkplthITMbGc_DwGKqMfK4fkr9lL1y6keY0OtZCoqBTW0YpRRe_XbbUlXhG9XMAmXRo3U_wYL2KJIzM71zNi3wVOh4XtysR3VZoCqSjqPq8fS7puTeeaarhjL4PxDOsttXP2RdnOssw1KVoIqjElErMIeDVVjsTdVd8zV4_uUG2I_7f9GIspVcsB4mh0ExnBSjkoyQi2KcmH552CjKUdW56A9EPhmxh8rp6QfXy4miGl7xR_C5sGANZZExdssyQVh0-qP5V7j9kI-9DvOD4QC_gWCpu9CkiU9I0NjCyGpK1shXo1YGfd2AVv0946zSN-cxG_fxVJnZ2SAmG8TOBnHQgL3pM6NS3ePduzdqM8XVTs_jGS4b8LM23ezy26t9f3-1bfjcveldxVfn15c_4Iti5DgK5AbMF-MJblIuVKRbDoACfn804l8A-9YZiw |
linkToPdf | http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwtV1LT9tAEB5RkKpyqKAPkZbHHuipXZGs1177gBACIh4FVVWRuLm2d7aKVJI0doTgp_HrmFnbiYoEN862V5bn23ms5_sGYLsXo84KZyUrEUqNysosy5y0FMxM17hAK-YOn19Ex5f69Cq8WoD7lgvDbZWtT_SO2o4KPiPfUcbnIrTBdlzTFvHjsL83_id5ghT_aW3HadQQOcPbGyrfyt2TQ7L1F6X6R78OjmUzYUAWumcqWbjEBgFGDnONScC1VB7m7KHzMMGMXtVRgIxc7kySU6gN49hRhRFa1S0i5TCgdV_BkglMzHssPpi1lyhlPOeLG_eoRotUQ9jxtD2ehNeV3EfP1FUjo_-D4jzTffRz1se8_gq8bZJVsV-jaxUWcPgOls9nSq_le7j96YfZTwZ3aEXdVi5YG5NDohhNq3FNTChFNcnsoD54FPXY6lIMhqKcjtlblfT0te_rRNEMsvgj-IxYsJ6yKBjHdckgHHot0vIDXL7Ix_4Ii8PRENdAsOydsXkWEio09jBxmhI38tuolcVQd6DXfs-0aLTOeeTG33Sm0uxtkJINUm-DNOrA19kz41rp49m711szpc2uL9M5RjvwrTXd_PLTq316frUteE1YT7-fXJx9hjeKgePZkOuwWE2muEFpUZVvevwJ-P3SgH8Ar-kdjA |
openUrl | ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=Regularized+target+encoding+outperforms+traditional+methods+in+supervised+machine+learning+with+high+cardinality+features&rft.jtitle=Computational+statistics&rft.au=Pargent%2C+Florian&rft.au=Pfisterer%2C+Florian&rft.au=Thomas%2C+Janek&rft.au=Bischl%2C+Bernd&rft.date=2022-11-01&rft.pub=Springer+Nature+B.V&rft.issn=0943-4062&rft.eissn=1613-9658&rft.volume=37&rft.issue=5&rft.spage=2671&rft.epage=2692&rft_id=info:doi/10.1007%2Fs00180-022-01207-6&rft.externalDBID=HAS_PDF_LINK |
thumbnail_l | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=0943-4062&client=summon |
thumbnail_m | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=0943-4062&client=summon |
thumbnail_s | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=0943-4062&client=summon |