Windowing as a Sub-Sampling Method for Distributed Data Mining

Windowing is a sub-sampling method, originally proposed to cope with large datasets when inducing decision trees with the ID3 and C4.5 algorithms. The method exhibits a strong negative correlation between the accuracy of the learned models and the number of examples used to induce them, i.e., the hi...

Full description

Saved in:
Bibliographic Details
Published inMathematical and computational applications Vol. 25; no. 3; p. 39
Main Authors Martínez-Galicia, David, Guerra-Hernández, Alejandro, Cruz-Ramírez, Nicandro, Limón, Xavier, Grimaldo, Francisco
Format Journal Article
LanguageEnglish
Published MDPI AG 30.06.2020
Subjects
Online AccessGet full text

Cover

Loading…
Abstract Windowing is a sub-sampling method, originally proposed to cope with large datasets when inducing decision trees with the ID3 and C4.5 algorithms. The method exhibits a strong negative correlation between the accuracy of the learned models and the number of examples used to induce them, i.e., the higher the accuracy of the obtained model, the fewer examples used to induce it. This paper contributes to a better understanding of this behavior in order to promote windowing as a sub-sampling method for Distributed Data Mining. For this, the generalization of the behavior of windowing beyond decision trees is established, by corroborating the observed negative correlation when adopting inductive algorithms of different nature. Then, focusing on decision trees, the windows (samples) and the obtained models are analyzed in terms of Minimum Description Length (MDL), Area Under the ROC Curve (AUC), Kulllback–Leibler divergence, and the similitude metric Sim1; and compared to those obtained when using traditional methods: random, balanced, and stratified samplings. It is shown that the aggressive sampling performed by windowing, up to 3% of the original dataset, induces models that are significantly more accurate than those obtained from the traditional sampling methods, among which only the balanced sampling is comparable in terms of AUC. Although the considered informational properties did not correlate with the obtained accuracy, they provide clues about the behavior of windowing and suggest further experiments to enhance such understanding and the performance of the method, i.e., studying the evolution of the windows over time.
AbstractList Windowing is a sub-sampling method, originally proposed to cope with large datasets when inducing decision trees with the ID3 and C4.5 algorithms. The method exhibits a strong negative correlation between the accuracy of the learned models and the number of examples used to induce them, i.e., the higher the accuracy of the obtained model, the fewer examples used to induce it. This paper contributes to a better understanding of this behavior in order to promote windowing as a sub-sampling method for Distributed Data Mining. For this, the generalization of the behavior of windowing beyond decision trees is established, by corroborating the observed negative correlation when adopting inductive algorithms of different nature. Then, focusing on decision trees, the windows (samples) and the obtained models are analyzed in terms of Minimum Description Length (MDL), Area Under the ROC Curve (AUC), Kulllback–Leibler divergence, and the similitude metric Sim1; and compared to those obtained when using traditional methods: random, balanced, and stratified samplings. It is shown that the aggressive sampling performed by windowing, up to 3% of the original dataset, induces models that are significantly more accurate than those obtained from the traditional sampling methods, among which only the balanced sampling is comparable in terms of AUC. Although the considered informational properties did not correlate with the obtained accuracy, they provide clues about the behavior of windowing and suggest further experiments to enhance such understanding and the performance of the method, i.e., studying the evolution of the windows over time.
Author Cruz-Ramírez, Nicandro
Guerra-Hernández, Alejandro
Limón, Xavier
Martínez-Galicia, David
Grimaldo, Francisco
Author_xml – sequence: 1
  givenname: David
  surname: Martínez-Galicia
  fullname: Martínez-Galicia, David
– sequence: 2
  givenname: Alejandro
  orcidid: 0000-0002-4856-4011
  surname: Guerra-Hernández
  fullname: Guerra-Hernández, Alejandro
– sequence: 3
  givenname: Nicandro
  orcidid: 0000-0002-0708-9875
  surname: Cruz-Ramírez
  fullname: Cruz-Ramírez, Nicandro
– sequence: 4
  givenname: Xavier
  surname: Limón
  fullname: Limón, Xavier
– sequence: 5
  givenname: Francisco
  orcidid: 0000-0002-1357-7170
  surname: Grimaldo
  fullname: Grimaldo, Francisco
BookMark eNpNkE1Lw0AQhhepYK09-Qdyl-jsTpLNXgRp_Si0eKjicZlkd-uWNls2KeK_N7UiPcwHw8zzDu8lGzShsYxdc7hFVHC3rUnkgACozthQCCXTUmZycNJfsHHbrgFA8AwEwJDdf_jGhC_frBJqE0qW-ypd0na3OUwWtvsMJnEhJlPfdtFX-86aZEodJQvf9CtX7NzRprXjvzpi70-Pb5OXdP76PJs8zNMasej6bHJTc2VspcigFE5AYWTuKm4dkqky4FKRQnQFLzNOKPrIy_5aqv5dHLHZkWsCrfUu-i3Fbx3I699BiCtNsfP1xuqiF6nzivIsx6wqRQnOEma1sr2CtAfWzZFVx9C20bp_Hgd9cFKfOIk_V6Vmeg
CitedBy_id crossref_primary_10_3390_math9222917
Cites_doi 10.1007/BF00117105
10.1016/0890-5401(89)90010-2
10.1080/03610928008827904
10.1613/jair.279
10.1613/jair.487
10.1080/01621459.1937.10503522
10.1007/s10115-018-1222-x
10.1214/aoms/1177729694
10.1016/j.ipm.2009.03.002
10.1214/aos/1176350051
10.1007/978-3-030-29349-9
10.1109/ACCESS.2020.2991800
10.1007/978-0-85729-388-6
10.1214/aoms/1177731944
10.1016/j.patrec.2016.11.006
10.1007/BF00116251
ContentType Journal Article
DBID AAYXX
CITATION
DOA
DOI 10.3390/mca25030039
DatabaseName CrossRef
DOAJ Directory of Open Access Journals
DatabaseTitle CrossRef
DatabaseTitleList CrossRef

Database_xml – sequence: 1
  dbid: DOA
  name: DOAJ Directory of Open Access Journals
  url: https://www.doaj.org/
  sourceTypes: Open Website
DeliveryMethod fulltext_linktorsrc
Discipline Mathematics
EISSN 2297-8747
ExternalDocumentID oai_doaj_org_article_6debc5ba54534b8280fea34c9e79a7e0
10_3390_mca25030039
GroupedDBID .4S
AADQD
AAFWJ
AAYXX
ADBBV
AFPKN
AFZYC
ALMA_UNASSIGNED_HOLDINGS
BCNDV
CITATION
GIY
GROUPED_DOAJ
IAO
ITC
MODMG
M~E
OK1
RIG
ID FETCH-LOGICAL-c336t-c3d5dc19deb9ad372f206d75fb1ef3adb40179a933f61841a321a358c33790003
IEDL.DBID DOA
ISSN 2297-8747
1300-686X
IngestDate Fri Oct 04 12:56:31 EDT 2024
Thu Sep 26 21:28:20 EDT 2024
IsDoiOpenAccess true
IsOpenAccess true
IsPeerReviewed true
IsScholarly true
Issue 3
Language English
LinkModel DirectLink
MergedId FETCHMERGED-LOGICAL-c336t-c3d5dc19deb9ad372f206d75fb1ef3adb40179a933f61841a321a358c33790003
ORCID 0000-0002-0708-9875
0000-0002-4856-4011
0000-0002-1357-7170
OpenAccessLink https://doaj.org/article/6debc5ba54534b8280fea34c9e79a7e0
ParticipantIDs doaj_primary_oai_doaj_org_article_6debc5ba54534b8280fea34c9e79a7e0
crossref_primary_10_3390_mca25030039
PublicationCentury 2000
PublicationDate 2020-06-30
PublicationDateYYYYMMDD 2020-06-30
PublicationDate_xml – month: 06
  year: 2020
  text: 2020-06-30
  day: 30
PublicationDecade 2020
PublicationTitle Mathematical and computational applications
PublicationYear 2020
Publisher MDPI AG
Publisher_xml – name: MDPI AG
References Friedman (ref_27) 1940; 11
ref_14
ref_12
ref_11
ref_31
Quinlan (ref_8) 1990; 5
Iman (ref_29) 1980; 9
ref_19
Maillo (ref_30) 2020; 8
ref_17
ref_16
ref_15
(ref_25) 2006; 7
Quinlan (ref_21) 1989; 80
Quinlan (ref_5) 1996; 4
Kullback (ref_22) 1951; 22
Friedman (ref_26) 1937; 32
(ref_7) 1998; 8
Sokolova (ref_18) 2009; 45
ref_24
Grimaldo (ref_10) 2017; 93
ref_23
Bifet (ref_13) 2010; 11
Rissanen (ref_20) 1986; 14
ref_1
Michalski (ref_2) 1983; Volume I
ref_28
Grimaldo (ref_9) 2019; 60
Quinlan (ref_3) 1986; 1
ref_4
ref_6
References_xml – volume: Volume I
  start-page: 463
  year: 1983
  ident: ref_2
  article-title: Learning efficient classification procedures and their application to chess en games
  publication-title: Machine Learning
  contributor:
    fullname: Michalski
– ident: ref_28
– volume: 5
  start-page: 239
  year: 1990
  ident: ref_8
  article-title: Learning Logical Definitions from Relations
  publication-title: Mach. Learn.
  doi: 10.1007/BF00117105
  contributor:
    fullname: Quinlan
– ident: ref_11
– volume: 80
  start-page: 227
  year: 1989
  ident: ref_21
  article-title: Inferring decision trees using the minimum description length principle
  publication-title: Inf. Comput.
  doi: 10.1016/0890-5401(89)90010-2
  contributor:
    fullname: Quinlan
– volume: 9
  start-page: 571
  year: 1980
  ident: ref_29
  article-title: Approximations of the critical region of the fbietkan statistic
  publication-title: Commun. Stat. Theory Methods
  doi: 10.1080/03610928008827904
  contributor:
    fullname: Iman
– ident: ref_16
– volume: 4
  start-page: 77
  year: 1996
  ident: ref_5
  article-title: Improved Use of Continuous Attributes in C4.5
  publication-title: J. Artif. Intell. Res.
  doi: 10.1613/jair.279
  contributor:
    fullname: Quinlan
– volume: 8
  start-page: 129
  year: 1998
  ident: ref_7
  article-title: Integrative windowing
  publication-title: J. Artif. Intell. Res.
  doi: 10.1613/jair.487
– ident: ref_14
– volume: 7
  start-page: 1
  year: 2006
  ident: ref_25
  article-title: Statistical Comparisons of Classifiers over Multiple Data Sets
  publication-title: J. Mach. Learn. Res.
– volume: 32
  start-page: 675
  year: 1937
  ident: ref_26
  article-title: The use of ranks to avoid the assumption of normality implicit in the analysis of variance
  publication-title: J. Am. Stat. Assoc.
  doi: 10.1080/01621459.1937.10503522
  contributor:
    fullname: Friedman
– ident: ref_1
– volume: 60
  start-page: 99
  year: 2019
  ident: ref_9
  article-title: Modeling and implementing distributed data mining strategies in JaCa-DDM
  publication-title: Knowl. Inf. Syst.
  doi: 10.1007/s10115-018-1222-x
  contributor:
    fullname: Grimaldo
– volume: 22
  start-page: 79
  year: 1951
  ident: ref_22
  article-title: On information and sufficiency
  publication-title: Ann. Math. Stat.
  doi: 10.1214/aoms/1177729694
  contributor:
    fullname: Kullback
– volume: 45
  start-page: 427
  year: 2009
  ident: ref_18
  article-title: A Systematic Analysis of Performance Measures for Classification Tasks
  publication-title: Inf. Process. Manag.
  doi: 10.1016/j.ipm.2009.03.002
  contributor:
    fullname: Sokolova
– volume: 14
  start-page: 1080
  year: 1986
  ident: ref_20
  article-title: Stochastic Complexity and Modeling
  publication-title: Ann. Stat.
  doi: 10.1214/aos/1176350051
  contributor:
    fullname: Rissanen
– ident: ref_24
  doi: 10.1007/978-3-030-29349-9
– ident: ref_6
– ident: ref_4
– ident: ref_31
– volume: 8
  start-page: 87918
  year: 2020
  ident: ref_30
  article-title: Redundancy and Complexity Metrics for Big Data Classification: Towards Smart Data
  publication-title: IEEE Access
  doi: 10.1109/ACCESS.2020.2991800
  contributor:
    fullname: Maillo
– ident: ref_12
– ident: ref_23
  doi: 10.1007/978-0-85729-388-6
– volume: 11
  start-page: 86
  year: 1940
  ident: ref_27
  article-title: A Comparison of Alternative Tests of Significance for the Problem of m Rankings
  publication-title: Ann. Math. Stat.
  doi: 10.1214/aoms/1177731944
  contributor:
    fullname: Friedman
– volume: 93
  start-page: 23
  year: 2017
  ident: ref_10
  article-title: A Windowing Strategy for Distributed Data Mining Optimized through GPUs
  publication-title: Pattern Recognit. Lett.
  doi: 10.1016/j.patrec.2016.11.006
  contributor:
    fullname: Grimaldo
– ident: ref_15
– ident: ref_17
– ident: ref_19
– volume: 11
  start-page: 1601
  year: 2010
  ident: ref_13
  article-title: MOA: Massive Online Analysis
  publication-title: J. Mach. Learn. Res.
  contributor:
    fullname: Bifet
– volume: 1
  start-page: 81
  year: 1986
  ident: ref_3
  article-title: Induction of Decision Trees
  publication-title: Mach. Learn.
  doi: 10.1007/BF00116251
  contributor:
    fullname: Quinlan
SSID ssj0002140200
ssib005905691
ssib045321559
Score 2.2211912
Snippet Windowing is a sub-sampling method, originally proposed to cope with large datasets when inducing decision trees with the ID3 and C4.5 algorithms. The method...
SourceID doaj
crossref
SourceType Open Website
Aggregation Database
StartPage 39
SubjectTerms distributed data mining
sub-sampling
windowing
Title Windowing as a Sub-Sampling Method for Distributed Data Mining
URI https://doaj.org/article/6debc5ba54534b8280fea34c9e79a7e0
Volume 25
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
link http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwrV09T8MwELVQJxgQn6J8yUPXqInt2PGCBJSqQgoLVHSLzrG90SIaxMZv5y4ppUwsLB6s2IqeHd89xfceYwOwykiMLEnwwSQKUp_YiGQlgKwL741xmgqFywc9mar7WT7bsPqiO2GdPHAH3FD74OrcAUZ6qRzygzTiPKq2wVgwoWPrWb5BptqdZTGw_6jA4Fix_v9GZ7TIiDd1JcQp8qdCz7riPSltOnypATMDSXWrv8LVhqp_G37Ge2x3lTfy6-5999lWmB-wnXIturo8ZFfPyK8XHxiLOCw5cDwTkkegG-PYU7ZO0RxTVD4irVyyuQqej6ABXrYmEUdsOr57up0kK3uEpJZSN9j63NeZRYAseGlEFKn2Jo8uC1GCd4q-NrBSRnJ1yQAxAJkXONqQVag8Zr35Yh5OGFeKNFgAcfaFckGBcMF5IWJqdG517LPBNwrVa6eCUSF7ILCqDbD67IYQWj9C0tVtBy5otVrQ6q8FPf2PSc7YtiBi3F7sO2e95u09XGD20LjLdqNgW37efQGBxr6f
link.rule.ids 315,786,790,870,2115,27955,27956
linkProvider Directory of Open Access Journals
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=Windowing+as+a+Sub-Sampling+Method+for+Distributed+Data+Mining&rft.jtitle=Mathematical+and+computational+applications&rft.au=Mart%C3%ADnez-Galicia%2C+David&rft.au=Guerra-Hern%C3%A1ndez%2C+Alejandro&rft.au=Cruz-Ram%C3%ADrez%2C+Nicandro&rft.au=Lim%C3%B3n%2C+Xavier&rft.date=2020-06-30&rft.issn=2297-8747&rft.eissn=2297-8747&rft.volume=25&rft.issue=3&rft.spage=39&rft_id=info:doi/10.3390%2Fmca25030039&rft.externalDBID=n%2Fa&rft.externalDocID=10_3390_mca25030039
thumbnail_l http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=2297-8747&client=summon
thumbnail_m http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=2297-8747&client=summon
thumbnail_s http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=2297-8747&client=summon