An Information Theoretic Approach to Detection of Minority Subsets
Unsupervised learning techniques, e.g. clustering, is useful for obtaining a summary of a dataset. However, its application to large databases can be computationally expensive. Alternatively, useful information can also be retrieved from its subsets in a more efficient yet effective manner. This pap...
Saved in:
Published in | Transactions of the Japanese Society for Artificial Intelligence Vol. 22; no. 3; pp. 311 - 321 |
---|---|
Main Authors | , , , |
Format | Journal Article |
Language | English Japanese |
Published |
Tokyo
The Japanese Society for Artificial Intelligence
2007
Japan Science and Technology Agency |
Subjects | |
Online Access | Get full text |
Cover
Loading…
Abstract | Unsupervised learning techniques, e.g. clustering, is useful for obtaining a summary of a dataset. However, its application to large databases can be computationally expensive. Alternatively, useful information can also be retrieved from its subsets in a more efficient yet effective manner. This paper addresses the problem of finding a small subset of minority instances whose distribution significantly differs from that of the majority. Generally, such a subset can substantially overlap with the majority, which is problematic for conventional estimation of distribution. This paper proposes a new approach for estimating a minority distribution based on Information Theoretic Framework, an extension of the Rate Distortion Theory for unsupervised learning tasks. Specifically, the proposed method (a) estimates parameters which maximize the divergence between the minority and majority distributions, (b) penalizes the redundancy of data expression based on the mutual information between the observed and hidden variables, and (c) employs a hard assignment approximation to avoid computation of trivial conditional probabilities. The algorithm of the proposed method has no problem-dependent parameter and its time and space complexities are linear to the size of the minority subset. Experiments using artificial datasets show the proposed method yields significantly high precision and sensitivity in detecting minority subsets which substantially overlaps with the majority. The proposed method also substantially outperforms one-class classification and mixture estimation methods in real-world benchmark datasets for text and satellite imagery classification. |
---|---|
AbstractList | Unsupervised learning techniques, e.g. clustering, is useful for obtaining a summary of a dataset. However, its application to large databases can be computationally expensive. Alternatively, useful information can also be retrieved from its subsets in a more efficient yet effective manner. This paper addresses the problem of finding a small subset of minority instances whose distribution significantly differs from that of the majority. Generally, such a subset can substantially overlap with the majority, which is problematic for conventional estimation of distribution. This paper proposes a new approach for estimating a minority distribution based on Information Theoretic Framework, an extension of the Rate Distortion Theory for unsupervised learning tasks. Specifically, the proposed method (a) estimates parameters which maximize the divergence between the minority and majority distributions, (b) penalizes the redundancy of data expression based on the mutual information between the observed and hidden variables, and (c) employs a hard assignment approximation to avoid computation of trivial conditional probabilities. The algorithm of the proposed method has no problem-dependent parameter and its time and space complexities are linear to the size of the minority subset. Experiments using artificial datasets show the proposed method yields significantly high precision and sensitivity in detecting minority subsets which substantially overlaps with the majority. The proposed method also substantially outperforms one-class classification and mixture estimation methods in real-world benchmark datasets for text and satellite imagery classification. |
Author | Ando, Shin Kobayashi, Shigenobu Suzuki, Einoshin Sakuma, Jun |
Author_xml | – sequence: 1 fullname: Ando, Shin organization: Graduate School of Engineering, Yokohama National University – sequence: 2 fullname: Sakuma, Jun organization: Interdisciplinary Graduate School of Science and Engineering, Tokyo Institute of Technology – sequence: 3 fullname: Suzuki, Einoshin organization: Graduate School of Information Science and Electrical Engineering, Kyushu University – sequence: 4 fullname: Kobayashi, Shigenobu organization: Interdisciplinary Graduate School of Science and Engineering, Tokyo Institute of Technology |
BookMark | eNpNkE1PAjEQhhuDiYjc_AGbeHWx03a3y3HFLxKMB_HcdLutLIEW23Lg31tZQrzMTOZ95iPvNRpYZzVCt4AnUBD-ENdBdhNCJhTgAg2BsjKvMMWDU405sCs0DqFrMAZCGeBiiB5rm82tcX4rY-dstlxp53XsVFbvdt5Jtcqiy5501OqoO5O9d9b5Lh6yz30TdAw36NLITdDjUx6hr5fn5ewtX3y8zmf1IleEM8glVQU1RUOgrXhFphRwm_qSGGKAc1zSllOTQoUbUnGtdNuWLWBSgSkJ53SE7vq96a-fvQ5RrN3e23RSAOPllDEocaLue0p5F4LXRux8t5X-IACLP6PE0ShBiEhGJbzu8XWI8lufYemTBxv9D-5DmjlraiW90Jb-Ag5Vc_k |
CitedBy_id | crossref_primary_10_1527_tjsai_23_344 crossref_primary_10_1527_tjsai_23_163 |
Cites_doi | 10.1145/1081870.1081891 10.1145/1015330.1015431 10.1023/A:1007692713085 10.1137/1.9781611972740.22 10.1145/1015330.1015399 10.1007/s10115-003-0086-9 10.1007/978-94-011-5014-9_12 10.1145/335191.335388 10.1023/B:AIRE.0000045502.10941.a9 10.1007/3-540-47977-5_38 10.1002/047174882X |
ContentType | Journal Article |
Copyright | 2007 JSAI (The Japanese Society for Artificial Intelligence) Copyright Japan Science and Technology Agency 2007 |
Copyright_xml | – notice: 2007 JSAI (The Japanese Society for Artificial Intelligence) – notice: Copyright Japan Science and Technology Agency 2007 |
DBID | AAYXX CITATION 7SC 8FD JQ2 L7M L~C L~D |
DOI | 10.1527/tjsai.22.311 |
DatabaseName | CrossRef Computer and Information Systems Abstracts Technology Research Database ProQuest Computer Science Collection Advanced Technologies Database with Aerospace Computer and Information Systems Abstracts Academic Computer and Information Systems Abstracts Professional |
DatabaseTitle | CrossRef Computer and Information Systems Abstracts Technology Research Database Computer and Information Systems Abstracts – Academic Advanced Technologies Database with Aerospace ProQuest Computer Science Collection Computer and Information Systems Abstracts Professional |
DatabaseTitleList | Computer and Information Systems Abstracts |
DeliveryMethod | fulltext_linktorsrc |
Discipline | Computer Science |
EISSN | 1346-8030 |
EndPage | 321 |
ExternalDocumentID | 3180061281 10_1527_tjsai_22_311 article_tjsai_22_3_22_3_311_article_char_en |
GroupedDBID | 123 2WC ACGFS ALMA_UNASSIGNED_HOLDINGS CS3 E3Z EBS EJD JSF KQ8 OK1 PQEST PQQKQ RJT XSB AAYXX CITATION 7SC 8FD JQ2 L7M L~C L~D |
ID | FETCH-LOGICAL-c2741-a3c53f5b21d87829310d741a2f2f177063d73f3d780b287ecedd6d10281f62773 |
ISSN | 1346-0714 |
IngestDate | Thu Oct 10 18:59:59 EDT 2024 Fri Aug 23 00:34:17 EDT 2024 Wed Apr 05 14:07:47 EDT 2023 |
IsDoiOpenAccess | true |
IsOpenAccess | true |
IsPeerReviewed | false |
IsScholarly | true |
Issue | 3 |
Language | English Japanese |
LinkModel | OpenURL |
MergedId | FETCHMERGED-LOGICAL-c2741-a3c53f5b21d87829310d741a2f2f177063d73f3d780b287ecedd6d10281f62773 |
OpenAccessLink | https://www.jstage.jst.go.jp/article/tjsai/22/3/22_3_311/_article/-char/en |
PQID | 1476944160 |
PQPubID | 2029095 |
PageCount | 11 |
ParticipantIDs | proquest_journals_1476944160 crossref_primary_10_1527_tjsai_22_311 jstage_primary_article_tjsai_22_3_22_3_311_article_char_en |
PublicationCentury | 2000 |
PublicationDate | 2007-00-00 |
PublicationDateYYYYMMDD | 2007-01-01 |
PublicationDate_xml | – year: 2007 text: 2007-00-00 |
PublicationDecade | 2000 |
PublicationPlace | Tokyo |
PublicationPlace_xml | – name: Tokyo |
PublicationTitle | Transactions of the Japanese Society for Artificial Intelligence |
PublicationYear | 2007 |
Publisher | The Japanese Society for Artificial Intelligence Japan Science and Technology Agency |
Publisher_xml | – name: The Japanese Society for Artificial Intelligence – name: Japan Science and Technology Agency |
References | [Crammer 04] Crammer, K. and Chechik, G.: A needle in a haystack: local one-class optimization, in Proceedings of the 21st International Conference on Machine learning, p. 26, New York, NY, USA (2004), ACM Press [Breunig 00] Breunig, M. M., Kriegel, H.-P., Ng, R. T., and Sander, J.: LOF: Identifying Density-Based Local Outliers., in Chen, W., Naughton, J. F., and Bernstein, P. A. eds., Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data, pp. 93-104, ACM (2000) [Hodge 04] Hodge, V. and Austin, J.: A Survey of Outlier Detection Methodologies, Artifificial Intelligence Review, Vol. 22, No. 2, pp. 85-126 (2004) [Banerjee 04a] Banerjee, A., Dhillon, I. S., Ghosh, J., and Merugu, S.: An information theoretic analysis of maximum likelihood mixture estimation for exponential families., in Brodley, C. E. ed., Machine Learning, Proceedings of the Twenty-first International Conference, ACM (2004) [Banerjee 04b] Banerjee, A., Merugu, S., Dhillon, I. S., and Ghosh, J.: Clustering with Bregman Divergences., in Berry, M. W., Dayal, U., Kamath, C., and Skillicorn, D. B. eds., SDM, pp. 57-64, SIAM (2004) [Neal 98] Neal, R. M. and Hinton, G. E.: A view of the EM algorithm that justies incremental, sparse, and other variants, chapter Foundations for Learning, pp. 355-368, MIT Press, Cambridge, MA, USA (1998) [Cover 05] Cover, T. M. and Thomas, J. A.: Elements of information theory, J. Wiley, Hoboken, N.J., 2nd edition (2005) [Hinneburg 03] Hinneburg, A. and Keim, D. A.: A General Approach to Clustering in Large Databases with Noise., Knowledge and Information Systems, Vol. 5, No. 4, pp. 387-415 (2003) [Hermes 02] Hermes, L., Zöller, T., and Buhmann, J. M.: Parametric Distributional Clustering for Image Segmentation., in Heyden, A., Sparr, G., Nielsen, M., and Johansen, P. eds., European Conference on Computer Vision (ECCV 3), pp. 577-591 (2002) [Nigam 00] Nigam, K., McCallum, A., Thrun, S., and Mitchell, T. M.: Text Classification from Labeled and Unlabeled Documents using EM., Machine Learning, Vol. 39, No. 2/3, pp. 103-134 (2000) [Lazarevic 05] Lazarevic, A. and Kumar, V.: Feature bagging for outlier detection, in KDD '05: Proceeding of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining, pp. 157-166, New York, NY, USA (2005), ACM Press [Bekkerman 03] Bekkerman, R., El-Yaniv, R., Tishby, N., and Winter, Y.: Distributional Word Clusters vs. Words for Text Categorization., Journal of Machine Learning Research, Vol. 3, pp. 1183-1208 (2003) [Manevitz 01] Manevitz, L. M. and Yousef, M.: One-Class SVMs for Document Classification., Journal of Machine Learning Research, Vol. 2, pp. 139-154 (2001) 12 13 Hinneburg, A. and Keim, D. A. (8) 2003; 5 BEKKERMAN R (3) 2003; 3 Manevitz, L. M. and Yousef, M. (11) 2001; 2 1 2 4 5 6 7 Hodge, V. and Austin, J. (9) 2004; 22 10 |
References_xml | – ident: 10 doi: 10.1145/1081870.1081891 – ident: 1 doi: 10.1145/1015330.1015431 – ident: 13 doi: 10.1023/A:1007692713085 – volume: 3 start-page: 1183 issn: 1532-4435 issue: 7/8 year: 2003 ident: 3 publication-title: Journal of Machine Learning Research contributor: fullname: BEKKERMAN R – volume: 2 start-page: 139 issn: 1532-4435 year: 2001 ident: 11 publication-title: Journal of Machine Learning Research contributor: fullname: Manevitz, L. M. and Yousef, M. – ident: 2 doi: 10.1137/1.9781611972740.22 – ident: 6 doi: 10.1145/1015330.1015399 – volume: 5 start-page: 387 issn: 0219-1377 year: 2003 ident: 8 publication-title: Knowledge and Information Systems doi: 10.1007/s10115-003-0086-9 contributor: fullname: Hinneburg, A. and Keim, D. A. – ident: 12 doi: 10.1007/978-94-011-5014-9_12 – ident: 4 doi: 10.1145/335191.335388 – volume: 22 start-page: 85 issn: 0160-2896 year: 2004 ident: 9 publication-title: Artifificial Intelligence Review doi: 10.1023/B:AIRE.0000045502.10941.a9 contributor: fullname: Hodge, V. and Austin, J. – ident: 7 doi: 10.1007/3-540-47977-5_38 – ident: 5 doi: 10.1002/047174882X |
SSID | ssib001234105 ssib008501343 ssib047348305 ssib000961560 ssib026596680 ssj0057238 ssib006575950 |
Score | 1.6887451 |
Snippet | Unsupervised learning techniques, e.g. clustering, is useful for obtaining a summary of a dataset. However, its application to large databases can be... |
SourceID | proquest crossref jstage |
SourceType | Aggregation Database Publisher |
StartPage | 311 |
SubjectTerms | information theoretic framework minority detection mixture estimation text classification |
Title | An Information Theoretic Approach to Detection of Minority Subsets |
URI | https://www.jstage.jst.go.jp/article/tjsai/22/3/22_3_311/_article/-char/en https://www.proquest.com/docview/1476944160 |
Volume | 22 |
hasFullText | 1 |
inHoldings | 1 |
isFullTextHit | |
isPrint | |
ispartofPNX | Transactions of the Japanese Society for Artificial Intelligence, 2007, Vol.22(3), pp.311-321 |
link | http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwnV1Lj9MwELaqhQMX3ojCgnyAE0ppnDiJuRXoarUvhDaVeovycJYuIkXb5MD-I_4lM7YTO1qEgIsVOVaUzEzG33hehLwqK19KgNVeJfzCC4VfekkIV6JmRZhzGQmJHt3Ts-hwFR6t-Xoy-elELXVtMSuvf5tX8j9chTngK2bJ_gNnh4fCBFwDf2EEDsP4VzxWJ3pD-qHOs8ekRMSWQ6bUR9nKsseFp5tmi-3qlMKQuopTj01T2zp814cOHMFeij0qR9GdiysVYKTLdNiKnvZEoVLnr-dfNoPknedfu286LLezk911p7tmL-G1ds7yY9AyP7DNk3kOVpItutEJRWz3DXzHQUmhI8D6C94sVG6po3qDMFLpVHpnsnPJ3HhujL5mzJHLwFG-gVbbNzYFztAt3V7u8s2MsZlZNq69ffYpO1idnGTpcp2O7-q93k8UGMQk_1sMdBpGjx5_dpCsiHzX0kR_lnAcxgkHrG0rn7GIg51pVWeIRYaU6tUggmMvOHVWYIhicjbgU966HzJCU7cvwaC4uIkqFFRK75O7xsahCy2wD8hENg_Jvb5_CDWcekTeLxrqyC8d5Jf28kvbLR3kl25r2ssvNfL7mKwOlumHQ8_09PBKLJTk5UHJg5oXzK8SAKcCrIsK5nNWs9qPYwDMVRzUMCTzAox5WUpseYYo2K8jFsfBE7LXbBv5lFAgMIZKckTUIRg-xdwvBTbPTkQtCy6m5HVPm-y7Lt2SockLNMwUDTPGMqDhlLzThBtWmR_aWaUHWDzcw6RI0EFTst8TOzOaYQfmdBwJsDOi-bM_335O7mhfAR7p7ZO99qqTLwDktsVLJV6_ABlOpPM |
link.rule.ids | 315,783,787,4031,27935,27936,27937 |
linkProvider | Colorado Alliance of Research Libraries |
openUrl | ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=An+Information+Theoretic+Approach+to+Detection+of+Minority+Subsets&rft.jtitle=Transactions+of+the+Japanese+Society+for+Artificial+Intelligence&rft.au=Ando%2C+Shin&rft.au=Sakuma%2C+Jun&rft.au=Suzuki%2C+Einoshin&rft.au=Kobayashi%2C+Shigenobu&rft.date=2007&rft.pub=Japan+Science+and+Technology+Agency&rft.issn=1346-0714&rft.eissn=1346-8030&rft.volume=22&rft.issue=3&rft.spage=311&rft_id=info:doi/10.1527%2Ftjsai.22.311&rft.externalDBID=NO_FULL_TEXT&rft.externalDocID=3180061281 |
thumbnail_l | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=1346-0714&client=summon |
thumbnail_m | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=1346-0714&client=summon |
thumbnail_s | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=1346-0714&client=summon |