Windowing as a Sub-Sampling Method for Distributed Data Mining
Windowing is a sub-sampling method, originally proposed to cope with large datasets when inducing decision trees with the ID3 and C4.5 algorithms. The method exhibits a strong negative correlation between the accuracy of the learned models and the number of examples used to induce them, i.e., the hi...
Saved in:
Published in | Mathematical and computational applications Vol. 25; no. 3; p. 39 |
---|---|
Main Authors | , , , , |
Format | Journal Article |
Language | English |
Published |
MDPI AG
30.06.2020
|
Subjects | |
Online Access | Get full text |
Cover
Loading…
Abstract | Windowing is a sub-sampling method, originally proposed to cope with large datasets when inducing decision trees with the ID3 and C4.5 algorithms. The method exhibits a strong negative correlation between the accuracy of the learned models and the number of examples used to induce them, i.e., the higher the accuracy of the obtained model, the fewer examples used to induce it. This paper contributes to a better understanding of this behavior in order to promote windowing as a sub-sampling method for Distributed Data Mining. For this, the generalization of the behavior of windowing beyond decision trees is established, by corroborating the observed negative correlation when adopting inductive algorithms of different nature. Then, focusing on decision trees, the windows (samples) and the obtained models are analyzed in terms of Minimum Description Length (MDL), Area Under the ROC Curve (AUC), Kulllback–Leibler divergence, and the similitude metric Sim1; and compared to those obtained when using traditional methods: random, balanced, and stratified samplings. It is shown that the aggressive sampling performed by windowing, up to 3% of the original dataset, induces models that are significantly more accurate than those obtained from the traditional sampling methods, among which only the balanced sampling is comparable in terms of AUC. Although the considered informational properties did not correlate with the obtained accuracy, they provide clues about the behavior of windowing and suggest further experiments to enhance such understanding and the performance of the method, i.e., studying the evolution of the windows over time. |
---|---|
AbstractList | Windowing is a sub-sampling method, originally proposed to cope with large datasets when inducing decision trees with the ID3 and C4.5 algorithms. The method exhibits a strong negative correlation between the accuracy of the learned models and the number of examples used to induce them, i.e., the higher the accuracy of the obtained model, the fewer examples used to induce it. This paper contributes to a better understanding of this behavior in order to promote windowing as a sub-sampling method for Distributed Data Mining. For this, the generalization of the behavior of windowing beyond decision trees is established, by corroborating the observed negative correlation when adopting inductive algorithms of different nature. Then, focusing on decision trees, the windows (samples) and the obtained models are analyzed in terms of Minimum Description Length (MDL), Area Under the ROC Curve (AUC), Kulllback–Leibler divergence, and the similitude metric Sim1; and compared to those obtained when using traditional methods: random, balanced, and stratified samplings. It is shown that the aggressive sampling performed by windowing, up to 3% of the original dataset, induces models that are significantly more accurate than those obtained from the traditional sampling methods, among which only the balanced sampling is comparable in terms of AUC. Although the considered informational properties did not correlate with the obtained accuracy, they provide clues about the behavior of windowing and suggest further experiments to enhance such understanding and the performance of the method, i.e., studying the evolution of the windows over time. |
Author | Cruz-Ramírez, Nicandro Guerra-Hernández, Alejandro Limón, Xavier Martínez-Galicia, David Grimaldo, Francisco |
Author_xml | – sequence: 1 givenname: David surname: Martínez-Galicia fullname: Martínez-Galicia, David – sequence: 2 givenname: Alejandro orcidid: 0000-0002-4856-4011 surname: Guerra-Hernández fullname: Guerra-Hernández, Alejandro – sequence: 3 givenname: Nicandro orcidid: 0000-0002-0708-9875 surname: Cruz-Ramírez fullname: Cruz-Ramírez, Nicandro – sequence: 4 givenname: Xavier surname: Limón fullname: Limón, Xavier – sequence: 5 givenname: Francisco orcidid: 0000-0002-1357-7170 surname: Grimaldo fullname: Grimaldo, Francisco |
BookMark | eNpNkE1Lw0AQhhepYK09-Qdyl-jsTpLNXgRp_Si0eKjicZlkd-uWNls2KeK_N7UiPcwHw8zzDu8lGzShsYxdc7hFVHC3rUnkgACozthQCCXTUmZycNJfsHHbrgFA8AwEwJDdf_jGhC_frBJqE0qW-ypd0na3OUwWtvsMJnEhJlPfdtFX-86aZEodJQvf9CtX7NzRprXjvzpi70-Pb5OXdP76PJs8zNMasej6bHJTc2VspcigFE5AYWTuKm4dkqky4FKRQnQFLzNOKPrIy_5aqv5dHLHZkWsCrfUu-i3Fbx3I699BiCtNsfP1xuqiF6nzivIsx6wqRQnOEma1sr2CtAfWzZFVx9C20bp_Hgd9cFKfOIk_V6Vmeg |
CitedBy_id | crossref_primary_10_3390_math9222917 |
Cites_doi | 10.1007/BF00117105 10.1016/0890-5401(89)90010-2 10.1080/03610928008827904 10.1613/jair.279 10.1613/jair.487 10.1080/01621459.1937.10503522 10.1007/s10115-018-1222-x 10.1214/aoms/1177729694 10.1016/j.ipm.2009.03.002 10.1214/aos/1176350051 10.1007/978-3-030-29349-9 10.1109/ACCESS.2020.2991800 10.1007/978-0-85729-388-6 10.1214/aoms/1177731944 10.1016/j.patrec.2016.11.006 10.1007/BF00116251 |
ContentType | Journal Article |
DBID | AAYXX CITATION DOA |
DOI | 10.3390/mca25030039 |
DatabaseName | CrossRef DOAJ Directory of Open Access Journals |
DatabaseTitle | CrossRef |
DatabaseTitleList | CrossRef |
Database_xml | – sequence: 1 dbid: DOA name: DOAJ Directory of Open Access Journals url: https://www.doaj.org/ sourceTypes: Open Website |
DeliveryMethod | fulltext_linktorsrc |
Discipline | Mathematics |
EISSN | 2297-8747 |
ExternalDocumentID | oai_doaj_org_article_6debc5ba54534b8280fea34c9e79a7e0 10_3390_mca25030039 |
GroupedDBID | .4S AADQD AAFWJ AAYXX ADBBV AFPKN AFZYC ALMA_UNASSIGNED_HOLDINGS BCNDV CITATION GIY GROUPED_DOAJ IAO ITC MODMG M~E OK1 RIG |
ID | FETCH-LOGICAL-c336t-c3d5dc19deb9ad372f206d75fb1ef3adb40179a933f61841a321a358c33790003 |
IEDL.DBID | DOA |
ISSN | 2297-8747 1300-686X |
IngestDate | Fri Oct 04 12:56:31 EDT 2024 Thu Sep 26 21:28:20 EDT 2024 |
IsDoiOpenAccess | true |
IsOpenAccess | true |
IsPeerReviewed | true |
IsScholarly | true |
Issue | 3 |
Language | English |
LinkModel | DirectLink |
MergedId | FETCHMERGED-LOGICAL-c336t-c3d5dc19deb9ad372f206d75fb1ef3adb40179a933f61841a321a358c33790003 |
ORCID | 0000-0002-0708-9875 0000-0002-4856-4011 0000-0002-1357-7170 |
OpenAccessLink | https://doaj.org/article/6debc5ba54534b8280fea34c9e79a7e0 |
ParticipantIDs | doaj_primary_oai_doaj_org_article_6debc5ba54534b8280fea34c9e79a7e0 crossref_primary_10_3390_mca25030039 |
PublicationCentury | 2000 |
PublicationDate | 2020-06-30 |
PublicationDateYYYYMMDD | 2020-06-30 |
PublicationDate_xml | – month: 06 year: 2020 text: 2020-06-30 day: 30 |
PublicationDecade | 2020 |
PublicationTitle | Mathematical and computational applications |
PublicationYear | 2020 |
Publisher | MDPI AG |
Publisher_xml | – name: MDPI AG |
References | Friedman (ref_27) 1940; 11 ref_14 ref_12 ref_11 ref_31 Quinlan (ref_8) 1990; 5 Iman (ref_29) 1980; 9 ref_19 Maillo (ref_30) 2020; 8 ref_17 ref_16 ref_15 (ref_25) 2006; 7 Quinlan (ref_21) 1989; 80 Quinlan (ref_5) 1996; 4 Kullback (ref_22) 1951; 22 Friedman (ref_26) 1937; 32 (ref_7) 1998; 8 Sokolova (ref_18) 2009; 45 ref_24 Grimaldo (ref_10) 2017; 93 ref_23 Bifet (ref_13) 2010; 11 Rissanen (ref_20) 1986; 14 ref_1 Michalski (ref_2) 1983; Volume I ref_28 Grimaldo (ref_9) 2019; 60 Quinlan (ref_3) 1986; 1 ref_4 ref_6 |
References_xml | – volume: Volume I start-page: 463 year: 1983 ident: ref_2 article-title: Learning efficient classification procedures and their application to chess en games publication-title: Machine Learning contributor: fullname: Michalski – ident: ref_28 – volume: 5 start-page: 239 year: 1990 ident: ref_8 article-title: Learning Logical Definitions from Relations publication-title: Mach. Learn. doi: 10.1007/BF00117105 contributor: fullname: Quinlan – ident: ref_11 – volume: 80 start-page: 227 year: 1989 ident: ref_21 article-title: Inferring decision trees using the minimum description length principle publication-title: Inf. Comput. doi: 10.1016/0890-5401(89)90010-2 contributor: fullname: Quinlan – volume: 9 start-page: 571 year: 1980 ident: ref_29 article-title: Approximations of the critical region of the fbietkan statistic publication-title: Commun. Stat. Theory Methods doi: 10.1080/03610928008827904 contributor: fullname: Iman – ident: ref_16 – volume: 4 start-page: 77 year: 1996 ident: ref_5 article-title: Improved Use of Continuous Attributes in C4.5 publication-title: J. Artif. Intell. Res. doi: 10.1613/jair.279 contributor: fullname: Quinlan – volume: 8 start-page: 129 year: 1998 ident: ref_7 article-title: Integrative windowing publication-title: J. Artif. Intell. Res. doi: 10.1613/jair.487 – ident: ref_14 – volume: 7 start-page: 1 year: 2006 ident: ref_25 article-title: Statistical Comparisons of Classifiers over Multiple Data Sets publication-title: J. Mach. Learn. Res. – volume: 32 start-page: 675 year: 1937 ident: ref_26 article-title: The use of ranks to avoid the assumption of normality implicit in the analysis of variance publication-title: J. Am. Stat. Assoc. doi: 10.1080/01621459.1937.10503522 contributor: fullname: Friedman – ident: ref_1 – volume: 60 start-page: 99 year: 2019 ident: ref_9 article-title: Modeling and implementing distributed data mining strategies in JaCa-DDM publication-title: Knowl. Inf. Syst. doi: 10.1007/s10115-018-1222-x contributor: fullname: Grimaldo – volume: 22 start-page: 79 year: 1951 ident: ref_22 article-title: On information and sufficiency publication-title: Ann. Math. Stat. doi: 10.1214/aoms/1177729694 contributor: fullname: Kullback – volume: 45 start-page: 427 year: 2009 ident: ref_18 article-title: A Systematic Analysis of Performance Measures for Classification Tasks publication-title: Inf. Process. Manag. doi: 10.1016/j.ipm.2009.03.002 contributor: fullname: Sokolova – volume: 14 start-page: 1080 year: 1986 ident: ref_20 article-title: Stochastic Complexity and Modeling publication-title: Ann. Stat. doi: 10.1214/aos/1176350051 contributor: fullname: Rissanen – ident: ref_24 doi: 10.1007/978-3-030-29349-9 – ident: ref_6 – ident: ref_4 – ident: ref_31 – volume: 8 start-page: 87918 year: 2020 ident: ref_30 article-title: Redundancy and Complexity Metrics for Big Data Classification: Towards Smart Data publication-title: IEEE Access doi: 10.1109/ACCESS.2020.2991800 contributor: fullname: Maillo – ident: ref_12 – ident: ref_23 doi: 10.1007/978-0-85729-388-6 – volume: 11 start-page: 86 year: 1940 ident: ref_27 article-title: A Comparison of Alternative Tests of Significance for the Problem of m Rankings publication-title: Ann. Math. Stat. doi: 10.1214/aoms/1177731944 contributor: fullname: Friedman – volume: 93 start-page: 23 year: 2017 ident: ref_10 article-title: A Windowing Strategy for Distributed Data Mining Optimized through GPUs publication-title: Pattern Recognit. Lett. doi: 10.1016/j.patrec.2016.11.006 contributor: fullname: Grimaldo – ident: ref_15 – ident: ref_17 – ident: ref_19 – volume: 11 start-page: 1601 year: 2010 ident: ref_13 article-title: MOA: Massive Online Analysis publication-title: J. Mach. Learn. Res. contributor: fullname: Bifet – volume: 1 start-page: 81 year: 1986 ident: ref_3 article-title: Induction of Decision Trees publication-title: Mach. Learn. doi: 10.1007/BF00116251 contributor: fullname: Quinlan |
SSID | ssj0002140200 ssib005905691 ssib045321559 |
Score | 2.2211912 |
Snippet | Windowing is a sub-sampling method, originally proposed to cope with large datasets when inducing decision trees with the ID3 and C4.5 algorithms. The method... |
SourceID | doaj crossref |
SourceType | Open Website Aggregation Database |
StartPage | 39 |
SubjectTerms | distributed data mining sub-sampling windowing |
Title | Windowing as a Sub-Sampling Method for Distributed Data Mining |
URI | https://doaj.org/article/6debc5ba54534b8280fea34c9e79a7e0 |
Volume | 25 |
hasFullText | 1 |
inHoldings | 1 |
isFullTextHit | |
isPrint | |
link | http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwrV09T8MwELVQJxgQn6J8yUPXqInt2PGCBJSqQgoLVHSLzrG90SIaxMZv5y4ppUwsLB6s2IqeHd89xfceYwOwykiMLEnwwSQKUp_YiGQlgKwL741xmgqFywc9mar7WT7bsPqiO2GdPHAH3FD74OrcAUZ6qRzygzTiPKq2wVgwoWPrWb5BptqdZTGw_6jA4Fix_v9GZ7TIiDd1JcQp8qdCz7riPSltOnypATMDSXWrv8LVhqp_G37Ge2x3lTfy6-5999lWmB-wnXIturo8ZFfPyK8XHxiLOCw5cDwTkkegG-PYU7ZO0RxTVD4irVyyuQqej6ABXrYmEUdsOr57up0kK3uEpJZSN9j63NeZRYAseGlEFKn2Jo8uC1GCd4q-NrBSRnJ1yQAxAJkXONqQVag8Zr35Yh5OGFeKNFgAcfaFckGBcMF5IWJqdG517LPBNwrVa6eCUSF7ILCqDbD67IYQWj9C0tVtBy5otVrQ6q8FPf2PSc7YtiBi3F7sO2e95u09XGD20LjLdqNgW37efQGBxr6f |
link.rule.ids | 315,786,790,870,2115,27955,27956 |
linkProvider | Directory of Open Access Journals |
openUrl | ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=Windowing+as+a+Sub-Sampling+Method+for+Distributed+Data+Mining&rft.jtitle=Mathematical+and+computational+applications&rft.au=Mart%C3%ADnez-Galicia%2C+David&rft.au=Guerra-Hern%C3%A1ndez%2C+Alejandro&rft.au=Cruz-Ram%C3%ADrez%2C+Nicandro&rft.au=Lim%C3%B3n%2C+Xavier&rft.date=2020-06-30&rft.issn=2297-8747&rft.eissn=2297-8747&rft.volume=25&rft.issue=3&rft.spage=39&rft_id=info:doi/10.3390%2Fmca25030039&rft.externalDBID=n%2Fa&rft.externalDocID=10_3390_mca25030039 |
thumbnail_l | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=2297-8747&client=summon |
thumbnail_m | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=2297-8747&client=summon |
thumbnail_s | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=2297-8747&client=summon |