An Information Theoretic Approach to Detection of Minority Subsets

Unsupervised learning techniques, e.g. clustering, is useful for obtaining a summary of a dataset. However, its application to large databases can be computationally expensive. Alternatively, useful information can also be retrieved from its subsets in a more efficient yet effective manner. This pap...

Full description

Saved in:
Bibliographic Details
Published inTransactions of the Japanese Society for Artificial Intelligence Vol. 22; no. 3; pp. 311 - 321
Main Authors Ando, Shin, Sakuma, Jun, Suzuki, Einoshin, Kobayashi, Shigenobu
Format Journal Article
LanguageEnglish
Japanese
Published Tokyo The Japanese Society for Artificial Intelligence 2007
Japan Science and Technology Agency
Subjects
Online AccessGet full text

Cover

Loading…
Abstract Unsupervised learning techniques, e.g. clustering, is useful for obtaining a summary of a dataset. However, its application to large databases can be computationally expensive. Alternatively, useful information can also be retrieved from its subsets in a more efficient yet effective manner. This paper addresses the problem of finding a small subset of minority instances whose distribution significantly differs from that of the majority. Generally, such a subset can substantially overlap with the majority, which is problematic for conventional estimation of distribution. This paper proposes a new approach for estimating a minority distribution based on Information Theoretic Framework, an extension of the Rate Distortion Theory for unsupervised learning tasks. Specifically, the proposed method (a) estimates parameters which maximize the divergence between the minority and majority distributions, (b) penalizes the redundancy of data expression based on the mutual information between the observed and hidden variables, and (c) employs a hard assignment approximation to avoid computation of trivial conditional probabilities. The algorithm of the proposed method has no problem-dependent parameter and its time and space complexities are linear to the size of the minority subset. Experiments using artificial datasets show the proposed method yields significantly high precision and sensitivity in detecting minority subsets which substantially overlaps with the majority. The proposed method also substantially outperforms one-class classification and mixture estimation methods in real-world benchmark datasets for text and satellite imagery classification.
AbstractList Unsupervised learning techniques, e.g. clustering, is useful for obtaining a summary of a dataset. However, its application to large databases can be computationally expensive. Alternatively, useful information can also be retrieved from its subsets in a more efficient yet effective manner. This paper addresses the problem of finding a small subset of minority instances whose distribution significantly differs from that of the majority. Generally, such a subset can substantially overlap with the majority, which is problematic for conventional estimation of distribution. This paper proposes a new approach for estimating a minority distribution based on Information Theoretic Framework, an extension of the Rate Distortion Theory for unsupervised learning tasks. Specifically, the proposed method (a) estimates parameters which maximize the divergence between the minority and majority distributions, (b) penalizes the redundancy of data expression based on the mutual information between the observed and hidden variables, and (c) employs a hard assignment approximation to avoid computation of trivial conditional probabilities. The algorithm of the proposed method has no problem-dependent parameter and its time and space complexities are linear to the size of the minority subset. Experiments using artificial datasets show the proposed method yields significantly high precision and sensitivity in detecting minority subsets which substantially overlaps with the majority. The proposed method also substantially outperforms one-class classification and mixture estimation methods in real-world benchmark datasets for text and satellite imagery classification.
Author Ando, Shin
Kobayashi, Shigenobu
Suzuki, Einoshin
Sakuma, Jun
Author_xml – sequence: 1
  fullname: Ando, Shin
  organization: Graduate School of Engineering, Yokohama National University
– sequence: 2
  fullname: Sakuma, Jun
  organization: Interdisciplinary Graduate School of Science and Engineering, Tokyo Institute of Technology
– sequence: 3
  fullname: Suzuki, Einoshin
  organization: Graduate School of Information Science and Electrical Engineering, Kyushu University
– sequence: 4
  fullname: Kobayashi, Shigenobu
  organization: Interdisciplinary Graduate School of Science and Engineering, Tokyo Institute of Technology
BookMark eNpNkE1PAjEQhhuDiYjc_AGbeHWx03a3y3HFLxKMB_HcdLutLIEW23Lg31tZQrzMTOZ95iPvNRpYZzVCt4AnUBD-ENdBdhNCJhTgAg2BsjKvMMWDU405sCs0DqFrMAZCGeBiiB5rm82tcX4rY-dstlxp53XsVFbvdt5Jtcqiy5501OqoO5O9d9b5Lh6yz30TdAw36NLITdDjUx6hr5fn5ewtX3y8zmf1IleEM8glVQU1RUOgrXhFphRwm_qSGGKAc1zSllOTQoUbUnGtdNuWLWBSgSkJ53SE7vq96a-fvQ5RrN3e23RSAOPllDEocaLue0p5F4LXRux8t5X-IACLP6PE0ShBiEhGJbzu8XWI8lufYemTBxv9D-5DmjlraiW90Jb-Ag5Vc_k
CitedBy_id crossref_primary_10_1527_tjsai_23_344
crossref_primary_10_1527_tjsai_23_163
Cites_doi 10.1145/1081870.1081891
10.1145/1015330.1015431
10.1023/A:1007692713085
10.1137/1.9781611972740.22
10.1145/1015330.1015399
10.1007/s10115-003-0086-9
10.1007/978-94-011-5014-9_12
10.1145/335191.335388
10.1023/B:AIRE.0000045502.10941.a9
10.1007/3-540-47977-5_38
10.1002/047174882X
ContentType Journal Article
Copyright 2007 JSAI (The Japanese Society for Artificial Intelligence)
Copyright Japan Science and Technology Agency 2007
Copyright_xml – notice: 2007 JSAI (The Japanese Society for Artificial Intelligence)
– notice: Copyright Japan Science and Technology Agency 2007
DBID AAYXX
CITATION
7SC
8FD
JQ2
L7M
L~C
L~D
DOI 10.1527/tjsai.22.311
DatabaseName CrossRef
Computer and Information Systems Abstracts
Technology Research Database
ProQuest Computer Science Collection
Advanced Technologies Database with Aerospace
Computer and Information Systems Abstracts – Academic
Computer and Information Systems Abstracts Professional
DatabaseTitle CrossRef
Computer and Information Systems Abstracts
Technology Research Database
Computer and Information Systems Abstracts – Academic
Advanced Technologies Database with Aerospace
ProQuest Computer Science Collection
Computer and Information Systems Abstracts Professional
DatabaseTitleList
Computer and Information Systems Abstracts
DeliveryMethod fulltext_linktorsrc
Discipline Computer Science
EISSN 1346-8030
EndPage 321
ExternalDocumentID 3180061281
10_1527_tjsai_22_311
article_tjsai_22_3_22_3_311_article_char_en
GroupedDBID 123
2WC
ACGFS
ALMA_UNASSIGNED_HOLDINGS
CS3
E3Z
EBS
EJD
JSF
KQ8
OK1
PQEST
PQQKQ
RJT
XSB
AAYXX
CITATION
7SC
8FD
JQ2
L7M
L~C
L~D
ID FETCH-LOGICAL-c2741-a3c53f5b21d87829310d741a2f2f177063d73f3d780b287ecedd6d10281f62773
ISSN 1346-0714
IngestDate Thu Oct 10 18:59:59 EDT 2024
Fri Aug 23 00:34:17 EDT 2024
Wed Apr 05 14:07:47 EDT 2023
IsDoiOpenAccess true
IsOpenAccess true
IsPeerReviewed false
IsScholarly true
Issue 3
Language English
Japanese
LinkModel OpenURL
MergedId FETCHMERGED-LOGICAL-c2741-a3c53f5b21d87829310d741a2f2f177063d73f3d780b287ecedd6d10281f62773
OpenAccessLink https://www.jstage.jst.go.jp/article/tjsai/22/3/22_3_311/_article/-char/en
PQID 1476944160
PQPubID 2029095
PageCount 11
ParticipantIDs proquest_journals_1476944160
crossref_primary_10_1527_tjsai_22_311
jstage_primary_article_tjsai_22_3_22_3_311_article_char_en
PublicationCentury 2000
PublicationDate 2007-00-00
PublicationDateYYYYMMDD 2007-01-01
PublicationDate_xml – year: 2007
  text: 2007-00-00
PublicationDecade 2000
PublicationPlace Tokyo
PublicationPlace_xml – name: Tokyo
PublicationTitle Transactions of the Japanese Society for Artificial Intelligence
PublicationYear 2007
Publisher The Japanese Society for Artificial Intelligence
Japan Science and Technology Agency
Publisher_xml – name: The Japanese Society for Artificial Intelligence
– name: Japan Science and Technology Agency
References [Crammer 04] Crammer, K. and Chechik, G.: A needle in a haystack: local one-class optimization, in Proceedings of the 21st International Conference on Machine learning, p. 26, New York, NY, USA (2004), ACM Press
[Breunig 00] Breunig, M. M., Kriegel, H.-P., Ng, R. T., and Sander, J.: LOF: Identifying Density-Based Local Outliers., in Chen, W., Naughton, J. F., and Bernstein, P. A. eds., Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data, pp. 93-104, ACM (2000)
[Hodge 04] Hodge, V. and Austin, J.: A Survey of Outlier Detection Methodologies, Artifificial Intelligence Review, Vol. 22, No. 2, pp. 85-126 (2004)
[Banerjee 04a] Banerjee, A., Dhillon, I. S., Ghosh, J., and Merugu, S.: An information theoretic analysis of maximum likelihood mixture estimation for exponential families., in Brodley, C. E. ed., Machine Learning, Proceedings of the Twenty-first International Conference, ACM (2004)
[Banerjee 04b] Banerjee, A., Merugu, S., Dhillon, I. S., and Ghosh, J.: Clustering with Bregman Divergences., in Berry, M. W., Dayal, U., Kamath, C., and Skillicorn, D. B. eds., SDM, pp. 57-64, SIAM (2004)
[Neal 98] Neal, R. M. and Hinton, G. E.: A view of the EM algorithm that justies incremental, sparse, and other variants, chapter Foundations for Learning, pp. 355-368, MIT Press, Cambridge, MA, USA (1998)
[Cover 05] Cover, T. M. and Thomas, J. A.: Elements of information theory, J. Wiley, Hoboken, N.J., 2nd edition (2005)
[Hinneburg 03] Hinneburg, A. and Keim, D. A.: A General Approach to Clustering in Large Databases with Noise., Knowledge and Information Systems, Vol. 5, No. 4, pp. 387-415 (2003)
[Hermes 02] Hermes, L., Zöller, T., and Buhmann, J. M.: Parametric Distributional Clustering for Image Segmentation., in Heyden, A., Sparr, G., Nielsen, M., and Johansen, P. eds., European Conference on Computer Vision (ECCV 3), pp. 577-591 (2002)
[Nigam 00] Nigam, K., McCallum, A., Thrun, S., and Mitchell, T. M.: Text Classification from Labeled and Unlabeled Documents using EM., Machine Learning, Vol. 39, No. 2/3, pp. 103-134 (2000)
[Lazarevic 05] Lazarevic, A. and Kumar, V.: Feature bagging for outlier detection, in KDD '05: Proceeding of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining, pp. 157-166, New York, NY, USA (2005), ACM Press
[Bekkerman 03] Bekkerman, R., El-Yaniv, R., Tishby, N., and Winter, Y.: Distributional Word Clusters vs. Words for Text Categorization., Journal of Machine Learning Research, Vol. 3, pp. 1183-1208 (2003)
[Manevitz 01] Manevitz, L. M. and Yousef, M.: One-Class SVMs for Document Classification., Journal of Machine Learning Research, Vol. 2, pp. 139-154 (2001)
12
13
Hinneburg, A. and Keim, D. A. (8) 2003; 5
BEKKERMAN R (3) 2003; 3
Manevitz, L. M. and Yousef, M. (11) 2001; 2
1
2
4
5
6
7
Hodge, V. and Austin, J. (9) 2004; 22
10
References_xml – ident: 10
  doi: 10.1145/1081870.1081891
– ident: 1
  doi: 10.1145/1015330.1015431
– ident: 13
  doi: 10.1023/A:1007692713085
– volume: 3
  start-page: 1183
  issn: 1532-4435
  issue: 7/8
  year: 2003
  ident: 3
  publication-title: Journal of Machine Learning Research
  contributor:
    fullname: BEKKERMAN R
– volume: 2
  start-page: 139
  issn: 1532-4435
  year: 2001
  ident: 11
  publication-title: Journal of Machine Learning Research
  contributor:
    fullname: Manevitz, L. M. and Yousef, M.
– ident: 2
  doi: 10.1137/1.9781611972740.22
– ident: 6
  doi: 10.1145/1015330.1015399
– volume: 5
  start-page: 387
  issn: 0219-1377
  year: 2003
  ident: 8
  publication-title: Knowledge and Information Systems
  doi: 10.1007/s10115-003-0086-9
  contributor:
    fullname: Hinneburg, A. and Keim, D. A.
– ident: 12
  doi: 10.1007/978-94-011-5014-9_12
– ident: 4
  doi: 10.1145/335191.335388
– volume: 22
  start-page: 85
  issn: 0160-2896
  year: 2004
  ident: 9
  publication-title: Artifificial Intelligence Review
  doi: 10.1023/B:AIRE.0000045502.10941.a9
  contributor:
    fullname: Hodge, V. and Austin, J.
– ident: 7
  doi: 10.1007/3-540-47977-5_38
– ident: 5
  doi: 10.1002/047174882X
SSID ssib001234105
ssib008501343
ssib047348305
ssib000961560
ssib026596680
ssj0057238
ssib006575950
Score 1.6887451
Snippet Unsupervised learning techniques, e.g. clustering, is useful for obtaining a summary of a dataset. However, its application to large databases can be...
SourceID proquest
crossref
jstage
SourceType Aggregation Database
Publisher
StartPage 311
SubjectTerms information theoretic framework
minority detection
mixture estimation
text classification
Title An Information Theoretic Approach to Detection of Minority Subsets
URI https://www.jstage.jst.go.jp/article/tjsai/22/3/22_3_311/_article/-char/en
https://www.proquest.com/docview/1476944160
Volume 22
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
ispartofPNX Transactions of the Japanese Society for Artificial Intelligence, 2007, Vol.22(3), pp.311-321
link http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwnV1Lj9MwELaqhQMX3ojCgnyAE0ppnDiJuRXoarUvhDaVeovycJYuIkXb5MD-I_4lM7YTO1qEgIsVOVaUzEzG33hehLwqK19KgNVeJfzCC4VfekkIV6JmRZhzGQmJHt3Ts-hwFR6t-Xoy-elELXVtMSuvf5tX8j9chTngK2bJ_gNnh4fCBFwDf2EEDsP4VzxWJ3pD-qHOs8ekRMSWQ6bUR9nKsseFp5tmi-3qlMKQuopTj01T2zp814cOHMFeij0qR9GdiysVYKTLdNiKnvZEoVLnr-dfNoPknedfu286LLezk911p7tmL-G1ds7yY9AyP7DNk3kOVpItutEJRWz3DXzHQUmhI8D6C94sVG6po3qDMFLpVHpnsnPJ3HhujL5mzJHLwFG-gVbbNzYFztAt3V7u8s2MsZlZNq69ffYpO1idnGTpcp2O7-q93k8UGMQk_1sMdBpGjx5_dpCsiHzX0kR_lnAcxgkHrG0rn7GIg51pVWeIRYaU6tUggmMvOHVWYIhicjbgU966HzJCU7cvwaC4uIkqFFRK75O7xsahCy2wD8hENg_Jvb5_CDWcekTeLxrqyC8d5Jf28kvbLR3kl25r2ssvNfL7mKwOlumHQ8_09PBKLJTk5UHJg5oXzK8SAKcCrIsK5nNWs9qPYwDMVRzUMCTzAox5WUpseYYo2K8jFsfBE7LXbBv5lFAgMIZKckTUIRg-xdwvBTbPTkQtCy6m5HVPm-y7Lt2SockLNMwUDTPGMqDhlLzThBtWmR_aWaUHWDzcw6RI0EFTst8TOzOaYQfmdBwJsDOi-bM_335O7mhfAR7p7ZO99qqTLwDktsVLJV6_ABlOpPM
link.rule.ids 315,783,787,4031,27935,27936,27937
linkProvider Colorado Alliance of Research Libraries
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=An+Information+Theoretic+Approach+to+Detection+of+Minority+Subsets&rft.jtitle=Transactions+of+the+Japanese+Society+for+Artificial+Intelligence&rft.au=Ando%2C+Shin&rft.au=Sakuma%2C+Jun&rft.au=Suzuki%2C+Einoshin&rft.au=Kobayashi%2C+Shigenobu&rft.date=2007&rft.pub=Japan+Science+and+Technology+Agency&rft.issn=1346-0714&rft.eissn=1346-8030&rft.volume=22&rft.issue=3&rft.spage=311&rft_id=info:doi/10.1527%2Ftjsai.22.311&rft.externalDBID=NO_FULL_TEXT&rft.externalDocID=3180061281
thumbnail_l http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=1346-0714&client=summon
thumbnail_m http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=1346-0714&client=summon
thumbnail_s http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=1346-0714&client=summon