Multisubject analysis and classification of books and book collections, based on a subject term vocabulary and the Latent Dirichlet Allocation

In this paper, a new method for automatically analyzing and classifying books and book collections according to the subjects they cover is presented. It is based on a combination of the LDA method for discovering latent topics in the collection, on the one hand, and the description of subjects by me...

Full description

Saved in:

Bibliographic Details
Published in	IEEE access Vol. 11; p. 1
Main Authors	Makris, N., Mitrou, N.
Format	Journal Article
Language	English
Published	Piscataway IEEE 01.01.2023 The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
Subjects	Artificial Intelligence Classification Classification Algorithms Digital Libraries Dirichlet problem Documents E-books Ensemble learning Flattening Formulas (mathematics) Latent Dirichlet Allocation Mixtures Multi-subject Classification Physical sciences Probabilistic logic Probabilistic models Probability distribution Recurrent neural networks Resource management Statistical analysis Statistical Natural Language Processing Subject Headings Task analysis University Coursebooks Vocabulary
Online Access	Get full text

Cover

Loading…

Abstract	In this paper, a new method for automatically analyzing and classifying books and book collections according to the subjects they cover is presented. It is based on a combination of the LDA method for discovering latent topics in the collection, on the one hand, and the description of subjects by means of a subject term vocabulary , on the other. Books, topics and subjects, all are modelled as bag-of-words , with specific distributions over the underlying word vocabulary. The Table of Contents (ToC) was used to describe the books, instead of their entire body, while subject (or standard ) documents are produced by a subject term hierarchy of the respective disciplines. Frequency-of-terms in the documents and word-generative probabilistic models (as the ones postulated by LDA) were integrated into a consistent statistical framework. Using Bayesian statistics and simple marginalization equations we were able to transform the expressions of the books from distributions over unlabeled topics (derived by the LDA) to distributions over labeled subjects representing the respective disciplines (Physical sciences, Health sciences, Mathematics, etc). More specifically, the necessary theoretical basis is firstly established, with each subject formally defined by the respective branch of a subject term hierarchy (much like a ToC) or the respective bag of words (single words and biwords) produced by flattening the hierarchy branch; flattening is realized by taking all the terms of the nodes and leaves of the branch with repetitions allowed. Being confined within a closed set of subjects, we are able to invert the frequency-of-terms in each subject [also interpreted as the probability of generating a term ( w n ) when sampling the subject ( s i ) and denoted by Pr{ w n \| s i })] and express each term as a weighted mixture (or probability distribution ) of subjects , denoted by Pr{ s i \| w n }. This is the key idea of the proposed method. Then, any document (d m ) can be expressed as a weighted mixture of subjects (or the respective distribution, denoted by Pr{ s i \|d m }) by simply summing up the distributions of the individual terms contained in the document. This is made possible by virtue of some simple formulas that have been formally proven for the union of documents (Pr{ s i \|(d 1 ∪d 2 )}) and for the union of subjects (Pr{( s i ∪ s j )\|d}). Since not all vocabulary terms are found in a particular set of books, nor, conversely, all corpus words are included in the subject vocabulary either, two important measures come to the foreground and are calculated with the proposed formulation: the coverage of a book or a corpus by the subject term vocabulary and, conversely, the vocabulary coverage by a set of books . These measures are useful for updating/enriching the subject term vocabulary, whenever it happens that documents with new subjects are included in the corpus under analysis. Following the theoretical formulation, the derived results are combined with the LDA in order to further facilitate our multisubject analysis task: using the subject term vocabulary, LDA is applied on the corpus under study and results in expressing each book (b m ), as a probability distribution over hidden topics (denoted by Pr{t k \|b m }). In the same framework, each topic (t k ) is expressed as a probability distribution over words (Pr{ w n \|t k }). Having estimated each word's probability distribution over subjects (Pr{ s i \| w n }), we can express each discovered topic as a weighted mixture of subjects [Pr{ s i \|t k } = Σ n Pr{ w n \|t k } Pr{ s i \| w n }] and, by using that, we express each book in the same manner [Pr{ s i \|b m } = Σ k Pr{t k \|b m } Pr{ s i \|t k }]. This is a very clear and formal way towards obtaining the desired result. The proposed methodology was applied to a Springer's e-book collection with more than 50,000 books, while a subject term hierarchy developed by KALLIPOS, a project creating open-access e-books, was used for the proof of concept. A number of experiments were conducted to showcase the validity and usefulness of the proposed approach.
AbstractList	In this paper, a new method for automatically analyzing and classifying books and book collections according to the subjects they cover is presented. It is based on a combination of the LDA method for discovering latent topics in the collection, on the one hand, and the description of subjects by means of a subject term vocabulary, on the other. Books, topics and subjects, all are modelled as bag-of-words, with specific distributions over the underlying word vocabulary. The Table of Contents (ToC) was used to describe the books, instead of their entire body, while subject (or standard) documents are produced by a subject term hierarchy of the respective disciplines. Frequency-of-terms in the documents and word-generative probabilistic models (as the ones postulated by LDA) were integrated into a consistent statistical framework. Using Bayesian statistics and simple marginalization equations we were able to transform the expressions of the books from distributions over unlabeled topics (derived by the LDA) to distributions over labeled subjects representing the respective disciplines (Physical sciences, Health sciences, Mathematics, etc). More specifically, the necessary theoretical basis is firstly established, with each subject formally defined by the respective branch of a subject term hierarchy (much like a ToC) or the respective bag of words (single words and biwords) produced by flattening the hierarchy branch; flattening is realized by taking all the terms of the nodes and leaves of the branch with repetitions allowed. Being confined within a closed set of subjects, we are able to invert the frequency-of-terms in each subject [also interpreted as the probability of generating a term ( <tex-math notation="LaTeX">$w_{n}$ </tex-math>) when sampling the subject ( <tex-math notation="LaTeX">$s_{i}$ </tex-math>) and denoted by Pr{ <tex-math notation="LaTeX">$w_{n}\vert $ </tex-math> <tex-math notation="LaTeX">$s_{i}$ </tex-math>})] and express each term as a weighted mixture (or probability distribution) of subjects, denoted by Pr{ <tex-math notation="LaTeX">$s_{i}\vert $ </tex-math> <tex-math notation="LaTeX">$w_{n}$ </tex-math>}. This is the key idea of the proposed method. Then, any document (dm) can be expressed as a weighted mixture of subjects (or the respective distribution, denoted by Pr{ <tex-math notation="LaTeX">$s_{i}\vert \text{d}_{m}$ </tex-math>}) by simply summing up the distributions of the individual terms contained in the document. This is made possible by virtue of some simple formulas that have been formally proven for the union of documents ( <tex-math notation="LaTeX">$\Pr \left \{{{\mathrm {s}_{i}\vert (\mathbf {d}}_{1}\mathrm {\cup }\mathbf {d}_{2}) }\right \})$ </tex-math> and for the union of subjects <tex-math notation="LaTeX">$(\Pr \left \{{{\mathrm {(\mathbf {s}}}_{i}\mathrm {\cup }\mathbf {s}_{j}\mathrm {)\vert \mathbf {d}} }\right \})$ </tex-math>. Since not all vocabulary terms are found in a particular set of books, nor, conversely, all corpus words are included in the subject vocabulary either, two important measures come to the foreground and are calculated with the proposed formulation: the coverage of a book or a corpus by the subject term vocabulary and, conversely, the vocabulary coverage by a set of books. These measures are useful for updating/enriching the subject term vocabulary, whenever it happens that documents with new subjects are included in the corpus under analysis. Following the theoretical formulation, the derived results are combined with the LDA in order to further facilitate our multisubject analysis task: using the subject term vocabulary, LDA is applied on the corpus under study and results in expressing each book (bm), as a probability distribution over hidden topics (denotedby <tex-math notation="LaTeX">$\text{Pr}\left\{\mathbf{t}_k \mid \mathbf{b}_m\right\}$ </tex-math>). In the same framework, each topic <tex-math notation="LaTeX">$\left(\mathbf{t}_k\right)$ </tex-math> is expressed as a probability distribution over words <tex-math notation="LaTeX">$\left(\text{Pr}\left\{w_n \mid \mathbf{t}_k\right\}\right)$ </tex-math>. Having estimated each word's probability distribution over subjects <tex-math notation="LaTeX">$\left(\text{Pr}\left\{\mathbf{s}_i \mid w_n\right\}\right)$ </tex-math>, we can express each discovered topic as a weighted mixture of subjects <tex-math notation="LaTeX">$\left[\text{Pr}\left\{\mathbf{s}_i \mid \mathbf{t}_k\right\}=\sum_n \text{Pr}\left\{w_n \mid \mathbf{t}_k\right\} \text{Pr}\left\{\mathbf{s}_i \mid w_n\right\}\right]$ </tex-math> and, by using that, we express each book in the same manner <tex-math notation="LaTeX">$\left[\text{Pr}\left\{\mathbf{s}_i \mid \mathbf{b}_m\right\}=\sum_k \text{Pr}\left\{\mathbf{t}_k \mid \mathbf{b}_m\right\} \text{Pr}\left\{\mathbf{s}_i \mid \mathbf{t}_k\right\}\right]$ </tex-math>. This is a very clear and formal way towards obtaining the desired result. The proposed methodology was applied to a Springer's e-book collection with more than 50,000 books, while a subject term hierarchy developed by KALLIPOS, a project creating openaccess e-books, was used for the proof of concept. A number of experiments were conducted to showcase the validity and usefulness of the proposed approach. In this paper, a new method for automatically analyzing and classifying books and book collections according to the subjects they cover is presented. It is based on a combination of the LDA method for discovering latent topics in the collection, on the one hand, and the description of subjects by means of a subject term vocabulary, on the other. Books, topics and subjects, all are modelled as bag-of-words, with specific distributions over the underlying word vocabulary. The Table of Contents (ToC) was used to describe the books, instead of their entire body, while subject (or standard) documents are produced by a subject term hierarchy of the respective disciplines. Frequency-of-terms in the documents and word-generative probabilistic models (as the ones postulated by LDA) were integrated into a consistent statistical framework. Using Bayesian statistics and simple marginalization equations we were able to transform the expressions of the books from distributions over unlabeled topics (derived by the LDA) to distributions over labeled subjects representing the respective disciplines (Physical sciences, Health sciences, Mathematics, etc). More specifically, the necessary theoretical basis is firstly established, with each subject formally defined by the respective branch of a subject term hierarchy (much like a ToC) or the respective bag of words (single words and biwords) produced by flattening the hierarchy branch; flattening is realized by taking all the terms of the nodes and leaves of the branch with repetitions allowed. Being confined within a closed set of subjects, we are able to invert the frequency-of-terms in each subject [also interpreted as the probability of generating a term ([Formula Omitted]) when sampling the subject ([Formula Omitted]) and denoted by Pr{[Formula Omitted] [Formula Omitted]})] and express each term as a weighted mixture (or probability distribution) of subjects, denoted by Pr{[Formula Omitted] [Formula Omitted]}. This is the key idea of the proposed method. Then, any document (dm) can be expressed as a weighted mixture of subjects (or the respective distribution, denoted by Pr{[Formula Omitted]}) by simply summing up the distributions of the individual terms contained in the document. This is made possible by virtue of some simple formulas that have been formally proven for the union of documents ([Formula Omitted] and for the union of subjects [Formula Omitted]. Since not all vocabulary terms are found in a particular set of books, nor, conversely, all corpus words are included in the subject vocabulary either, two important measures come to the foreground and are calculated with the proposed formulation: the coverage of a book or a corpus by the subject term vocabulary and, conversely, the vocabulary coverage by a set of books. These measures are useful for updating/enriching the subject term vocabulary, whenever it happens that documents with new subjects are included in the corpus under analysis. Following the theoretical formulation, the derived results are combined with the LDA in order to further facilitate our multisubject analysis task: using the subject term vocabulary, LDA is applied on the corpus under study and results in expressing each book (bm), as a probability distribution over hidden topics (denotedby [Formula Omitted]). In the same framework, each topic [Formula Omitted] is expressed as a probability distribution over words [Formula Omitted]. Having estimated each word's probability distribution over subjects [Formula Omitted], we can express each discovered topic as a weighted mixture of subjects [Formula Omitted] and, by using that, we express each book in the same manner [Formula Omitted]. This is a very clear and formal way towards obtaining the desired result. The proposed methodology was applied to a Springer's e-book collection with more than 50,000 books, while a subject term hierarchy developed by KALLIPOS, a project creating openaccess e-books, was used for the proof of concept. A number of experiments were conducted to showcase the validity and usefulness of the proposed approach. In this paper, a new method for automatically analyzing and classifying books and book collections according to the subjects they cover is presented. It is based on a combination of the LDA method for discovering latent topics in the collection, on the one hand, and the description of subjects by means of a subject term vocabulary , on the other. Books, topics and subjects, all are modelled as bag-of-words , with specific distributions over the underlying word vocabulary. The Table of Contents (ToC) was used to describe the books, instead of their entire body, while subject (or standard ) documents are produced by a subject term hierarchy of the respective disciplines. Frequency-of-terms in the documents and word-generative probabilistic models (as the ones postulated by LDA) were integrated into a consistent statistical framework. Using Bayesian statistics and simple marginalization equations we were able to transform the expressions of the books from distributions over unlabeled topics (derived by the LDA) to distributions over labeled subjects representing the respective disciplines (Physical sciences, Health sciences, Mathematics, etc). More specifically, the necessary theoretical basis is firstly established, with each subject formally defined by the respective branch of a subject term hierarchy (much like a ToC) or the respective bag of words (single words and biwords) produced by flattening the hierarchy branch; flattening is realized by taking all the terms of the nodes and leaves of the branch with repetitions allowed. Being confined within a closed set of subjects, we are able to invert the frequency-of-terms in each subject [also interpreted as the probability of generating a term ( w n ) when sampling the subject ( s i ) and denoted by Pr{ w n \| s i })] and express each term as a weighted mixture (or probability distribution ) of subjects , denoted by Pr{ s i \| w n }. This is the key idea of the proposed method. Then, any document (d m ) can be expressed as a weighted mixture of subjects (or the respective distribution, denoted by Pr{ s i \|d m }) by simply summing up the distributions of the individual terms contained in the document. This is made possible by virtue of some simple formulas that have been formally proven for the union of documents (Pr{ s i \|(d 1 ∪d 2 )}) and for the union of subjects (Pr{( s i ∪ s j )\|d}). Since not all vocabulary terms are found in a particular set of books, nor, conversely, all corpus words are included in the subject vocabulary either, two important measures come to the foreground and are calculated with the proposed formulation: the coverage of a book or a corpus by the subject term vocabulary and, conversely, the vocabulary coverage by a set of books . These measures are useful for updating/enriching the subject term vocabulary, whenever it happens that documents with new subjects are included in the corpus under analysis. Following the theoretical formulation, the derived results are combined with the LDA in order to further facilitate our multisubject analysis task: using the subject term vocabulary, LDA is applied on the corpus under study and results in expressing each book (b m ), as a probability distribution over hidden topics (denoted by Pr{t k \|b m }). In the same framework, each topic (t k ) is expressed as a probability distribution over words (Pr{ w n \|t k }). Having estimated each word's probability distribution over subjects (Pr{ s i \| w n }), we can express each discovered topic as a weighted mixture of subjects [Pr{ s i \|t k } = Σ n Pr{ w n \|t k } Pr{ s i \| w n }] and, by using that, we express each book in the same manner [Pr{ s i \|b m } = Σ k Pr{t k \|b m } Pr{ s i \|t k }]. This is a very clear and formal way towards obtaining the desired result. The proposed methodology was applied to a Springer's e-book collection with more than 50,000 books, while a subject term hierarchy developed by KALLIPOS, a project creating open-access e-books, was used for the proof of concept. A number of experiments were conducted to showcase the validity and usefulness of the proposed approach.
Author	Mitrou, N. Makris, N.
Author_xml	– sequence: 1 givenname: N. orcidid: 0009-0003-4550-4472 surname: Makris fullname: Makris, N. organization: School of Electrical & Computer Engineering National Technical University of Athens, Greece – sequence: 2 givenname: N. surname: Mitrou fullname: Mitrou, N. organization: School of Electrical & Computer Engineering National Technical University of Athens, Greece
BookMark	eNp9kctuFDEQRS0UJELIF8DCEltm8LPbvRwNIYk0iEVgbVX7QTw47WC7kfITfDOe6USKWOCNS1V1bpV9X6OTKU0OobeUrCklw8fNdntxc7NmhPE156zrGXuBThnthhWXvDt5Fr9C56XsSTuqpWR_iv58mWMNZR73zlQME8SHEkoLLDYRSgk-GKghTTh5PKb0c6kdImxSjI1qxfIBj1Ccxa0P8JNadfkO_04GxjlCfjiC9dbhHVQ3Vfwp5GBuo6t4E2NaprxBLz3E4s4f7zP0_fPFt-3Vavf18nq72a2MIENdWUOFpNKwXoDyRHJrpLSgwHmrQHDvRU_7XkpGFRkI7ayUTqiBqq6jngh-hq4XXZtgr-9zuGsL6gRBHxMp_9CQazDR6WGQ1vKeeCvaUEFHbh0RFMzYUWeANa33i9Z9Tr9mV6repzm3nyyaKSU7ybgirWtYukxOpWTntQn1-OaaIURNiT64qRc39cFN_ehmY_k_7NPG_6feLVRwzj0j2ECU6vlfGj-uhg
CODEN	IAECCG
CitedBy_id	crossref_primary_10_1002_ep_14549
Cites_doi	10.1109/ACCESS.2020.3041651 10.1016/j.eswa.2022.117215 10.1007/s10994-011-5272-5 10.3115/1699510.1699543 10.1109/IEIT53149.2021.9587387 10.1007/s10994-017-5689-6 10.1016/S0306-4573(01)00045-0 10.1007/s11704-017-7031-7 10.1007/s10489-020-01798-x 10.1145/2684822.2685324 10.1109/ICPR.2016.7899673 10.1016/j.ins.2022.12.022 10.14569/IJACSA.2021.0120352 10.1109/ACCESS.2020.3029429 10.1109/ICCMC53470.2022.9754079 10.1007/s00500-021-06310-2 10.1007/s10994-011-5256-5
ContentType	Journal Article
Copyright	Copyright The Institute of Electrical and Electronics Engineers, Inc. (IEEE) 2023
Copyright_xml	– notice: Copyright The Institute of Electrical and Electronics Engineers, Inc. (IEEE) 2023
DBID	97E ESBDL RIA RIE AAYXX CITATION 7SC 7SP 7SR 8BQ 8FD JG9 JQ2 L7M L~C L~D DOA
DOI	10.1109/ACCESS.2023.3326722
DatabaseName	IEEE All-Society Periodicals Package (ASPP) 2005–Present IEEE Xplore Open Access IEEE All-Society Periodicals Package (ASPP) 1998–Present IEEE Electronic Library (IEL) CrossRef Computer and Information Systems Abstracts Electronics & Communications Abstracts Engineered Materials Abstracts METADEX Technology Research Database Materials Research Database ProQuest Computer Science Collection Advanced Technologies Database with Aerospace Computer and Information Systems Abstracts Academic Computer and Information Systems Abstracts Professional DOAJ Open Access Full Text
DatabaseTitle	CrossRef Materials Research Database Engineered Materials Abstracts Technology Research Database Computer and Information Systems Abstracts – Academic Electronics & Communications Abstracts ProQuest Computer Science Collection Computer and Information Systems Abstracts Advanced Technologies Database with Aerospace METADEX Computer and Information Systems Abstracts Professional
DatabaseTitleList	Materials Research Database
Database_xml	– sequence: 1 dbid: DOA name: DOAJ Directory of Open Access Journals url: https://www.doaj.org/ sourceTypes: Open Website – sequence: 2 dbid: RIE name: IEEE url: https://proxy.k.utb.cz/login?url=https://ieeexplore.ieee.org/ sourceTypes: Publisher
DeliveryMethod	fulltext_linktorsrc
Discipline	Engineering
EISSN	2169-3536
EndPage	1
ExternalDocumentID	oai_doaj_org_article_995dd370fd414541b3de041acb61eca2 10_1109_ACCESS_2023_3326722 10290887
Genre	orig-research
GroupedDBID	0R~ 5VS 6IK 97E AAJGR ABAZT ABVLG ACGFS ADBBV ALMA_UNASSIGNED_HOLDINGS BCNDV BEFXN BFFAM BGNUA BKEBE BPEOZ EBS ESBDL GROUPED_DOAJ IPLJI JAVBF KQ8 M43 M~E O9- OCL OK1 RIA RIE RNS 4.4 AAYXX AGSQL CITATION EJD RIG 7SC 7SP 7SR 8BQ 8FD JG9 JQ2 L7M L~C L~D
ID	FETCH-LOGICAL-c409t-dc14515c274a8f053dc55da8aefd8a43ff471775521809016d55e48918661f043
IEDL.DBID	RIE
ISSN	2169-3536
IngestDate	Wed Aug 27 01:31:16 EDT 2025 Mon Jun 30 04:16:47 EDT 2025 Tue Jul 01 04:14:05 EDT 2025 Thu Apr 24 23:08:23 EDT 2025 Wed Aug 27 02:37:46 EDT 2025
IsDoiOpenAccess	true
IsOpenAccess	true
IsPeerReviewed	true
IsScholarly	true
Language	English
License	https://creativecommons.org/licenses/by-nc-nd/4.0
LinkModel	DirectLink
MergedId	FETCHMERGED-LOGICAL-c409t-dc14515c274a8f053dc55da8aefd8a43ff471775521809016d55e48918661f043
Notes	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14
ORCID	0009-0003-4550-4472 0000-0003-4521-1082
OpenAccessLink	https://proxy.k.utb.cz/login?url=https://ieeexplore.ieee.org/document/10290887
PQID	2885652380
PQPubID	4845423
PageCount	1
ParticipantIDs	doaj_primary_oai_doaj_org_article_995dd370fd414541b3de041acb61eca2 crossref_primary_10_1109_ACCESS_2023_3326722 crossref_citationtrail_10_1109_ACCESS_2023_3326722 proquest_journals_2885652380 ieee_primary_10290887
ProviderPackageCode	CITATION AAYXX
PublicationCentury	2000
PublicationDate	2023-01-01
PublicationDateYYYYMMDD	2023-01-01
PublicationDate_xml	– month: 01 year: 2023 text: 2023-01-01 day: 01
PublicationDecade	2020
PublicationPlace	Piscataway
PublicationPlace_xml	– name: Piscataway
PublicationTitle	IEEE access
PublicationTitleAbbrev	Access
PublicationYear	2023
Publisher	IEEE The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
Publisher_xml	– name: IEEE – name: The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
References	ref13 ref12 ref15 (ref23) 2023 ref14 ref11 ref10 mitrou (ref26) 2022 ref16 (ref29) 2015 ref19 ehu?ek (ref27) 2023 kingma (ref18) 2013; abs 1312 6114 bekkerman (ref21) 2003 (ref24) 2023 ref20 ref22 makris (ref31) 2023 ehu?ek (ref25) 2023 nam (ref9) 2017 ref28 blei (ref17) 2007 ref8 (ref30) 2023 blei (ref2) 2003; 3 ref7 ref4 ref3 (ref1) 2023 ref6 ref5
References_xml	– ident: ref11 doi: 10.1109/ACCESS.2020.3041651 – ident: ref4 doi: 10.1016/j.eswa.2022.117215 – ident: ref3 doi: 10.1007/s10994-011-5272-5 – year: 2023 ident: ref30 publication-title: SciPy – ident: ref15 doi: 10.3115/1699510.1699543 – ident: ref6 doi: 10.1109/IEIT53149.2021.9587387 – ident: ref19 doi: 10.1007/s10994-017-5689-6 – ident: ref20 doi: 10.1016/S0306-4573(01)00045-0 – year: 2003 ident: ref21 publication-title: Using Bigrams in Text Categorization – year: 2023 ident: ref31 article-title: Scientific code (GitHub) – ident: ref5 doi: 10.1007/s11704-017-7031-7 – ident: ref14 doi: 10.1007/s10489-020-01798-x – ident: ref28 doi: 10.1145/2684822.2685324 – start-page: 1 year: 2022 ident: ref26 article-title: KALLIPOS: The project that is shaping the OER landscape in Greece publication-title: Proc Innovating Higher Educ Conf (I-HE) – volume: abs 1312 6114 start-page: 1 year: 2013 ident: ref18 article-title: Auto-encoding variational Bayes publication-title: CoRR – year: 2023 ident: ref27 publication-title: Gensim Topic Modelling for Humans – ident: ref12 doi: 10.1109/ICPR.2016.7899673 – year: 2015 ident: ref29 publication-title: KALLIPOS Subject Terms Catalogue – start-page: 1 year: 2017 ident: ref9 article-title: Maximizing subset accuracy with recurrent neural networks in multi-label classification publication-title: Proc 31st Int Conf Neural Inf Process Syst – year: 2023 ident: ref23 publication-title: Snowball Stemmer – volume: 3 start-page: 993 year: 2003 ident: ref2 article-title: Latent Dirichlet allocation publication-title: J Mach Learn Res – ident: ref13 doi: 10.1016/j.ins.2022.12.022 – ident: ref8 doi: 10.14569/IJACSA.2021.0120352 – ident: ref10 doi: 10.1109/ACCESS.2020.3029429 – year: 2023 ident: ref1 publication-title: LCSH – ident: ref22 doi: 10.1109/ICCMC53470.2022.9754079 – ident: ref16 doi: 10.1007/s00500-021-06310-2 – year: 2023 ident: ref24 publication-title: Lancaster Stemmer – year: 2023 ident: ref25 publication-title: Gensim Topic Modelling for Humans – ident: ref7 doi: 10.1007/s10994-011-5256-5 – start-page: 1 year: 2007 ident: ref17 article-title: Supervised topic models publication-title: Proc NIPS
SSID	ssj0000816957
Score	2.2721984
Snippet	In this paper, a new method for automatically analyzing and classifying books and book collections according to the subjects they cover is presented. It is...
SourceID	doaj proquest crossref ieee
SourceType	Open Website Aggregation Database Enrichment Source Index Database Publisher
StartPage	1
SubjectTerms	Artificial Intelligence Classification Classification Algorithms Digital Libraries Dirichlet problem Documents E-books Ensemble learning Flattening Formulas (mathematics) Latent Dirichlet Allocation Mixtures Multi-subject Classification Physical sciences Probabilistic logic Probabilistic models Probability distribution Recurrent neural networks Resource management Statistical analysis Statistical Natural Language Processing Subject Headings Task analysis University Coursebooks Vocabulary
SummonAdditionalLinks	– databaseName: DOAJ Open Access Full Text dbid: DOA link: http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwrV07TxwxELYQVVJEgRDlEkAuUrLg565dHicQQiRNANFZfiqR0BLljiJ_Ir-ZGa8PXRQpadKsTnvehz2z831j2d8Q8lGUEooMpvNC5k71gXeeeThwaUKMFjASNwp_-txf3KjLO323UeoL14RN8sDTwJ1Yq1OSAytJcaUVDzJlpriPoec5-hp9AfM2kqkagw3vrR6azBBn9mS-WECPjrFa-LEEzjII8RsUVcX-VmLlj7hcweb8NXnVWCKdT2-3Q7byuEtebmgHviG_6tbZ5WPAiRS61hahfky0FrrEJUB11OlDobijYfoPf9E6W1A3NCyP6CngWKLQztMv7W7XEK7pLaBcwEWqP-uFQBTpFRDTcUUhTH6LX8HgdH6PWIg32iM352fXi4uuFVfoIqR0qy5FrNGrI2Sl3hT4FFPUOnnjc0nGK1kKwNYwaIB3w4A09EnrrIxFgTxemJJvyfb4MOZ3hNoglRI8ZCkCsC8bhDdamwyZn_RsGGZErMfZxaY8jgUw7l3NQJh1k3EcGsc148zI0fNF3yfhjb83P0UDPjdF1ex6AnzJNV9y__KlGdlD8288T-AyMOjA_tofXPvEl04YA2QYGA97_z-e_YG8wP5Mszv7ZHv14zEfAN9ZhcPq2k8b6_nU priority: 102 providerName: Directory of Open Access Journals
Title	Multisubject analysis and classification of books and book collections, based on a subject term vocabulary and the Latent Dirichlet Allocation
URI	https://ieeexplore.ieee.org/document/10290887 https://www.proquest.com/docview/2885652380 https://doaj.org/article/995dd370fd414541b3de041acb61eca2
Volume	11
hasFullText	1
inHoldings	1
isFullTextHit
isPrint
link	http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV1Lb9QwELZoT3CAAkUslMoHjk1w_NjYx-2KqkLQE5V6s_wUiCqL2Gyl8iP4zcw43lUBgbhEVmI7jmbs-WYyD0Je85x9Fl43jovUyLnvGsccXDqhfQgGZCQGCn-4mJ9fyndX6qoGq5dYmJRScT5LLTbLv_y4Chs0lcEO5-iW0--RPdDcpmCtnUEFK0gY1dfMQh0zbxbLJXxEiwXCWwEwpef8F-lTkvTXqip_HMVFvpw9IhfblU1uJV_azejb8P23pI3_vfQD8rAiTbqYWOMxuZeGJ-TBnfyDT8mPEn673ng0xlBX85NAI9KAqBrdiArl6CpThOPTM2xRZKDixjWsTyjKwkihn6Pb2fDIpzcgKT06ut6WgQA26XsAt8NI4aj9HD4B09DFNcpTnOiQXJ69_bg8b2qBhiaAWjg2MWCdXxVAs3U6w3aOQanotEs5aidFziD6-l4BRNAMgMc8KpWkNphkr8tMimdkf1gN6TmhxgspeeeT4B4QnPHcaaV0Au1RONb3M8K3hLOhZi_HIhrXtmgxzNiJ2hapbSu1Z-RkN-jrlLzj391PkSN2XTHzdrkBlLR1I1tjVIyiZzlK-HjZeRETk50Lft6l4GCSQ6T-nfdNhJ-Roy2D2XpMrC3XGgA1oCb24i_DXpL7uMTJ6HNE9sdvm_QKYNDoj4v54Lhsgp881AW9
linkProvider	IEEE
linkToHtml	http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV1Lj9MwELZgOQAHnosoLOADx01I_GjsY6lYFej2tCvtzfJTIFYpoikS_Ah-MzOOWy0gEJfISmzH0Yw9n52Zbwh5yVJyiTtVWcZjJaaurWxj4dJy5bzXYCMxUPh0NV2ci3cX8qIEq-dYmBhjdj6LNRbzv_yw9ls8KoMZztAtp7tOboDhl2wM19ofqWAOCS27wi3UNvrVbD6Hz6gxRXjNAah0jP1ifzJNf8mr8sdinC3MyV2y2o1tdCz5VG8HV_vvv9E2_vfg75E7BWvS2agc98m12D8gt68wED4kP3IA7mbr8DiG2sJQAoVAPeJqdCTKsqPrRBGQj8-wRFGFsiNXvzmmaA0DhXqW7nrDRZ9-BVvp0NX1W24IcJMuAd72A4XF9qP_AGpDZ5doUbGjQ3J-8uZsvqhKiobKw8ZwqILHTL_Sw97WqgQTOngpg1U2pqCs4CmB8es6kBXyhAG8DFJGoTTS7LWpEfwROejXfXxMqHZcCNa6yJkDDKcds0pKFWH_yG3TdRPCdoIzvvCXYxqNS5P3MY02o7QNStsUaU_I8b7R55G-49_VX6NG7Ksi93a-AZI0ZSobrWUIvGtSEPDxonU8xEa01rtpG72FTg5R-lfeNwp-Qo52CmbKQrExTCmA1ICbmid_afaC3FycnS7N8u3q_VNyC4c7HgEdkYPhyzY-A1A0uOd5KvwE2ikIEg
openUrl	ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=Multisubject+Analysis+and+Classification+of+Books+and+Book+Collections%2C+Based+on+a+Subject+Term+Vocabulary+and+the+Latent+Dirichlet+Allocation&rft.jtitle=IEEE+access&rft.au=Makris%2C+Nikolaos&rft.au=Mitrou%2C+Nikolaos&rft.date=2023-01-01&rft.issn=2169-3536&rft.eissn=2169-3536&rft.volume=11&rft.spage=120881&rft.epage=120898&rft_id=info:doi/10.1109%2FACCESS.2023.3326722&rft.externalDBID=n%2Fa&rft.externalDocID=10_1109_ACCESS_2023_3326722
thumbnail_l	http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=2169-3536&client=summon
thumbnail_m	http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=2169-3536&client=summon
thumbnail_s	http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=2169-3536&client=summon