Multisubject analysis and classification of books and book collections, based on a subject term vocabulary and the Latent Dirichlet Allocation
In this paper, a new method for automatically analyzing and classifying books and book collections according to the subjects they cover is presented. It is based on a combination of the LDA method for discovering latent topics in the collection, on the one hand, and the description of subjects by me...
Saved in:
Published in | IEEE access Vol. 11; p. 1 |
---|---|
Main Authors | , |
Format | Journal Article |
Language | English |
Published |
Piscataway
IEEE
01.01.2023
The Institute of Electrical and Electronics Engineers, Inc. (IEEE) |
Subjects | |
Online Access | Get full text |
Cover
Loading…
Abstract | In this paper, a new method for automatically analyzing and classifying books and book collections according to the subjects they cover is presented. It is based on a combination of the LDA method for discovering latent topics in the collection, on the one hand, and the description of subjects by means of a subject term vocabulary , on the other. Books, topics and subjects, all are modelled as bag-of-words , with specific distributions over the underlying word vocabulary. The Table of Contents (ToC) was used to describe the books, instead of their entire body, while subject (or standard ) documents are produced by a subject term hierarchy of the respective disciplines. Frequency-of-terms in the documents and word-generative probabilistic models (as the ones postulated by LDA) were integrated into a consistent statistical framework. Using Bayesian statistics and simple marginalization equations we were able to transform the expressions of the books from distributions over unlabeled topics (derived by the LDA) to distributions over labeled subjects representing the respective disciplines (Physical sciences, Health sciences, Mathematics, etc). More specifically, the necessary theoretical basis is firstly established, with each subject formally defined by the respective branch of a subject term hierarchy (much like a ToC) or the respective bag of words (single words and biwords) produced by flattening the hierarchy branch; flattening is realized by taking all the terms of the nodes and leaves of the branch with repetitions allowed. Being confined within a closed set of subjects, we are able to invert the frequency-of-terms in each subject [also interpreted as the probability of generating a term ( w n ) when sampling the subject ( s i ) and denoted by Pr{ w n | s i })] and express each term as a weighted mixture (or probability distribution ) of subjects , denoted by Pr{ s i | w n }. This is the key idea of the proposed method. Then, any document (d m ) can be expressed as a weighted mixture of subjects (or the respective distribution, denoted by Pr{ s i |d m }) by simply summing up the distributions of the individual terms contained in the document. This is made possible by virtue of some simple formulas that have been formally proven for the union of documents (Pr{ s i |(d 1 ∪d 2 )}) and for the union of subjects (Pr{( s i ∪ s j )|d}). Since not all vocabulary terms are found in a particular set of books, nor, conversely, all corpus words are included in the subject vocabulary either, two important measures come to the foreground and are calculated with the proposed formulation: the coverage of a book or a corpus by the subject term vocabulary and, conversely, the vocabulary coverage by a set of books . These measures are useful for updating/enriching the subject term vocabulary, whenever it happens that documents with new subjects are included in the corpus under analysis. Following the theoretical formulation, the derived results are combined with the LDA in order to further facilitate our multisubject analysis task: using the subject term vocabulary, LDA is applied on the corpus under study and results in expressing each book (b m ), as a probability distribution over hidden topics (denoted by Pr{t k |b m }). In the same framework, each topic (t k ) is expressed as a probability distribution over words (Pr{ w n |t k }). Having estimated each word's probability distribution over subjects (Pr{ s i | w n }), we can express each discovered topic as a weighted mixture of subjects [Pr{ s i |t k } = Σ n Pr{ w n |t k } Pr{ s i | w n }] and, by using that, we express each book in the same manner [Pr{ s i |b m } = Σ k Pr{t k |b m } Pr{ s i |t k }]. This is a very clear and formal way towards obtaining the desired result. The proposed methodology was applied to a Springer's e-book collection with more than 50,000 books, while a subject term hierarchy developed by KALLIPOS, a project creating open-access e-books, was used for the proof of concept. A number of experiments were conducted to showcase the validity and usefulness of the proposed approach. |
---|---|
AbstractList | In this paper, a new method for automatically analyzing and classifying books and book collections according to the subjects they cover is presented. It is based on a combination of the LDA method for discovering latent topics in the collection, on the one hand, and the description of subjects by means of a subject term vocabulary, on the other. Books, topics and subjects, all are modelled as bag-of-words, with specific distributions over the underlying word vocabulary. The Table of Contents (ToC) was used to describe the books, instead of their entire body, while subject (or standard) documents are produced by a subject term hierarchy of the respective disciplines. Frequency-of-terms in the documents and word-generative probabilistic models (as the ones postulated by LDA) were integrated into a consistent statistical framework. Using Bayesian statistics and simple marginalization equations we were able to transform the expressions of the books from distributions over unlabeled topics (derived by the LDA) to distributions over labeled subjects representing the respective disciplines (Physical sciences, Health sciences, Mathematics, etc). More specifically, the necessary theoretical basis is firstly established, with each subject formally defined by the respective branch of a subject term hierarchy (much like a ToC) or the respective bag of words (single words and biwords) produced by flattening the hierarchy branch; flattening is realized by taking all the terms of the nodes and leaves of the branch with repetitions allowed. Being confined within a closed set of subjects, we are able to invert the frequency-of-terms in each subject [also interpreted as the probability of generating a term ( <tex-math notation="LaTeX">$w_{n}$ </tex-math>) when sampling the subject ( <tex-math notation="LaTeX">$s_{i}$ </tex-math>) and denoted by Pr{ <tex-math notation="LaTeX">$w_{n}\vert $ </tex-math> <tex-math notation="LaTeX">$s_{i}$ </tex-math>})] and express each term as a weighted mixture (or probability distribution) of subjects, denoted by Pr{ <tex-math notation="LaTeX">$s_{i}\vert $ </tex-math> <tex-math notation="LaTeX">$w_{n}$ </tex-math>}. This is the key idea of the proposed method. Then, any document (dm) can be expressed as a weighted mixture of subjects (or the respective distribution, denoted by Pr{ <tex-math notation="LaTeX">$s_{i}\vert \text{d}_{m}$ </tex-math>}) by simply summing up the distributions of the individual terms contained in the document. This is made possible by virtue of some simple formulas that have been formally proven for the union of documents ( <tex-math notation="LaTeX">$\Pr \left \{{{\mathrm {s}_{i}\vert (\mathbf {d}}_{1}\mathrm {\cup }\mathbf {d}_{2}) }\right \})$ </tex-math> and for the union of subjects <tex-math notation="LaTeX">$(\Pr \left \{{{\mathrm {(\mathbf {s}}}_{i}\mathrm {\cup }\mathbf {s}_{j}\mathrm {)\vert \mathbf {d}} }\right \})$ </tex-math>. Since not all vocabulary terms are found in a particular set of books, nor, conversely, all corpus words are included in the subject vocabulary either, two important measures come to the foreground and are calculated with the proposed formulation: the coverage of a book or a corpus by the subject term vocabulary and, conversely, the vocabulary coverage by a set of books. These measures are useful for updating/enriching the subject term vocabulary, whenever it happens that documents with new subjects are included in the corpus under analysis. Following the theoretical formulation, the derived results are combined with the LDA in order to further facilitate our multisubject analysis task: using the subject term vocabulary, LDA is applied on the corpus under study and results in expressing each book (bm), as a probability distribution over hidden topics (denotedby <tex-math notation="LaTeX">$\text{Pr}\left\{\mathbf{t}_k \mid \mathbf{b}_m\right\}$ </tex-math>). In the same framework, each topic <tex-math notation="LaTeX">$\left(\mathbf{t}_k\right)$ </tex-math> is expressed as a probability distribution over words <tex-math notation="LaTeX">$\left(\text{Pr}\left\{w_n \mid \mathbf{t}_k\right\}\right)$ </tex-math>. Having estimated each word's probability distribution over subjects <tex-math notation="LaTeX">$\left(\text{Pr}\left\{\mathbf{s}_i \mid w_n\right\}\right)$ </tex-math>, we can express each discovered topic as a weighted mixture of subjects <tex-math notation="LaTeX">$\left[\text{Pr}\left\{\mathbf{s}_i \mid \mathbf{t}_k\right\}=\sum_n \text{Pr}\left\{w_n \mid \mathbf{t}_k\right\} \text{Pr}\left\{\mathbf{s}_i \mid w_n\right\}\right]$ </tex-math> and, by using that, we express each book in the same manner <tex-math notation="LaTeX">$\left[\text{Pr}\left\{\mathbf{s}_i \mid \mathbf{b}_m\right\}=\sum_k \text{Pr}\left\{\mathbf{t}_k \mid \mathbf{b}_m\right\} \text{Pr}\left\{\mathbf{s}_i \mid \mathbf{t}_k\right\}\right]$ </tex-math>. This is a very clear and formal way towards obtaining the desired result. The proposed methodology was applied to a Springer's e-book collection with more than 50,000 books, while a subject term hierarchy developed by KALLIPOS, a project creating openaccess e-books, was used for the proof of concept. A number of experiments were conducted to showcase the validity and usefulness of the proposed approach. In this paper, a new method for automatically analyzing and classifying books and book collections according to the subjects they cover is presented. It is based on a combination of the LDA method for discovering latent topics in the collection, on the one hand, and the description of subjects by means of a subject term vocabulary, on the other. Books, topics and subjects, all are modelled as bag-of-words, with specific distributions over the underlying word vocabulary. The Table of Contents (ToC) was used to describe the books, instead of their entire body, while subject (or standard) documents are produced by a subject term hierarchy of the respective disciplines. Frequency-of-terms in the documents and word-generative probabilistic models (as the ones postulated by LDA) were integrated into a consistent statistical framework. Using Bayesian statistics and simple marginalization equations we were able to transform the expressions of the books from distributions over unlabeled topics (derived by the LDA) to distributions over labeled subjects representing the respective disciplines (Physical sciences, Health sciences, Mathematics, etc). More specifically, the necessary theoretical basis is firstly established, with each subject formally defined by the respective branch of a subject term hierarchy (much like a ToC) or the respective bag of words (single words and biwords) produced by flattening the hierarchy branch; flattening is realized by taking all the terms of the nodes and leaves of the branch with repetitions allowed. Being confined within a closed set of subjects, we are able to invert the frequency-of-terms in each subject [also interpreted as the probability of generating a term ([Formula Omitted]) when sampling the subject ([Formula Omitted]) and denoted by Pr{[Formula Omitted] [Formula Omitted]})] and express each term as a weighted mixture (or probability distribution) of subjects, denoted by Pr{[Formula Omitted] [Formula Omitted]}. This is the key idea of the proposed method. Then, any document (dm) can be expressed as a weighted mixture of subjects (or the respective distribution, denoted by Pr{[Formula Omitted]}) by simply summing up the distributions of the individual terms contained in the document. This is made possible by virtue of some simple formulas that have been formally proven for the union of documents ([Formula Omitted] and for the union of subjects [Formula Omitted]. Since not all vocabulary terms are found in a particular set of books, nor, conversely, all corpus words are included in the subject vocabulary either, two important measures come to the foreground and are calculated with the proposed formulation: the coverage of a book or a corpus by the subject term vocabulary and, conversely, the vocabulary coverage by a set of books. These measures are useful for updating/enriching the subject term vocabulary, whenever it happens that documents with new subjects are included in the corpus under analysis. Following the theoretical formulation, the derived results are combined with the LDA in order to further facilitate our multisubject analysis task: using the subject term vocabulary, LDA is applied on the corpus under study and results in expressing each book (bm), as a probability distribution over hidden topics (denotedby [Formula Omitted]). In the same framework, each topic [Formula Omitted] is expressed as a probability distribution over words [Formula Omitted]. Having estimated each word's probability distribution over subjects [Formula Omitted], we can express each discovered topic as a weighted mixture of subjects [Formula Omitted] and, by using that, we express each book in the same manner [Formula Omitted]. This is a very clear and formal way towards obtaining the desired result. The proposed methodology was applied to a Springer's e-book collection with more than 50,000 books, while a subject term hierarchy developed by KALLIPOS, a project creating openaccess e-books, was used for the proof of concept. A number of experiments were conducted to showcase the validity and usefulness of the proposed approach. In this paper, a new method for automatically analyzing and classifying books and book collections according to the subjects they cover is presented. It is based on a combination of the LDA method for discovering latent topics in the collection, on the one hand, and the description of subjects by means of a subject term vocabulary , on the other. Books, topics and subjects, all are modelled as bag-of-words , with specific distributions over the underlying word vocabulary. The Table of Contents (ToC) was used to describe the books, instead of their entire body, while subject (or standard ) documents are produced by a subject term hierarchy of the respective disciplines. Frequency-of-terms in the documents and word-generative probabilistic models (as the ones postulated by LDA) were integrated into a consistent statistical framework. Using Bayesian statistics and simple marginalization equations we were able to transform the expressions of the books from distributions over unlabeled topics (derived by the LDA) to distributions over labeled subjects representing the respective disciplines (Physical sciences, Health sciences, Mathematics, etc). More specifically, the necessary theoretical basis is firstly established, with each subject formally defined by the respective branch of a subject term hierarchy (much like a ToC) or the respective bag of words (single words and biwords) produced by flattening the hierarchy branch; flattening is realized by taking all the terms of the nodes and leaves of the branch with repetitions allowed. Being confined within a closed set of subjects, we are able to invert the frequency-of-terms in each subject [also interpreted as the probability of generating a term ( w n ) when sampling the subject ( s i ) and denoted by Pr{ w n | s i })] and express each term as a weighted mixture (or probability distribution ) of subjects , denoted by Pr{ s i | w n }. This is the key idea of the proposed method. Then, any document (d m ) can be expressed as a weighted mixture of subjects (or the respective distribution, denoted by Pr{ s i |d m }) by simply summing up the distributions of the individual terms contained in the document. This is made possible by virtue of some simple formulas that have been formally proven for the union of documents (Pr{ s i |(d 1 ∪d 2 )}) and for the union of subjects (Pr{( s i ∪ s j )|d}). Since not all vocabulary terms are found in a particular set of books, nor, conversely, all corpus words are included in the subject vocabulary either, two important measures come to the foreground and are calculated with the proposed formulation: the coverage of a book or a corpus by the subject term vocabulary and, conversely, the vocabulary coverage by a set of books . These measures are useful for updating/enriching the subject term vocabulary, whenever it happens that documents with new subjects are included in the corpus under analysis. Following the theoretical formulation, the derived results are combined with the LDA in order to further facilitate our multisubject analysis task: using the subject term vocabulary, LDA is applied on the corpus under study and results in expressing each book (b m ), as a probability distribution over hidden topics (denoted by Pr{t k |b m }). In the same framework, each topic (t k ) is expressed as a probability distribution over words (Pr{ w n |t k }). Having estimated each word's probability distribution over subjects (Pr{ s i | w n }), we can express each discovered topic as a weighted mixture of subjects [Pr{ s i |t k } = Σ n Pr{ w n |t k } Pr{ s i | w n }] and, by using that, we express each book in the same manner [Pr{ s i |b m } = Σ k Pr{t k |b m } Pr{ s i |t k }]. This is a very clear and formal way towards obtaining the desired result. The proposed methodology was applied to a Springer's e-book collection with more than 50,000 books, while a subject term hierarchy developed by KALLIPOS, a project creating open-access e-books, was used for the proof of concept. A number of experiments were conducted to showcase the validity and usefulness of the proposed approach. |
Author | Mitrou, N. Makris, N. |
Author_xml | – sequence: 1 givenname: N. orcidid: 0009-0003-4550-4472 surname: Makris fullname: Makris, N. organization: School of Electrical & Computer Engineering National Technical University of Athens, Greece – sequence: 2 givenname: N. surname: Mitrou fullname: Mitrou, N. organization: School of Electrical & Computer Engineering National Technical University of Athens, Greece |
BookMark | eNp9kctuFDEQRS0UJELIF8DCEltm8LPbvRwNIYk0iEVgbVX7QTw47WC7kfITfDOe6USKWOCNS1V1bpV9X6OTKU0OobeUrCklw8fNdntxc7NmhPE156zrGXuBThnthhWXvDt5Fr9C56XsSTuqpWR_iv58mWMNZR73zlQME8SHEkoLLDYRSgk-GKghTTh5PKb0c6kdImxSjI1qxfIBj1Ccxa0P8JNadfkO_04GxjlCfjiC9dbhHVQ3Vfwp5GBuo6t4E2NaprxBLz3E4s4f7zP0_fPFt-3Vavf18nq72a2MIENdWUOFpNKwXoDyRHJrpLSgwHmrQHDvRU_7XkpGFRkI7ayUTqiBqq6jngh-hq4XXZtgr-9zuGsL6gRBHxMp_9CQazDR6WGQ1vKeeCvaUEFHbh0RFMzYUWeANa33i9Z9Tr9mV6repzm3nyyaKSU7ybgirWtYukxOpWTntQn1-OaaIURNiT64qRc39cFN_ehmY_k_7NPG_6feLVRwzj0j2ECU6vlfGj-uhg |
CODEN | IAECCG |
CitedBy_id | crossref_primary_10_1002_ep_14549 |
Cites_doi | 10.1109/ACCESS.2020.3041651 10.1016/j.eswa.2022.117215 10.1007/s10994-011-5272-5 10.3115/1699510.1699543 10.1109/IEIT53149.2021.9587387 10.1007/s10994-017-5689-6 10.1016/S0306-4573(01)00045-0 10.1007/s11704-017-7031-7 10.1007/s10489-020-01798-x 10.1145/2684822.2685324 10.1109/ICPR.2016.7899673 10.1016/j.ins.2022.12.022 10.14569/IJACSA.2021.0120352 10.1109/ACCESS.2020.3029429 10.1109/ICCMC53470.2022.9754079 10.1007/s00500-021-06310-2 10.1007/s10994-011-5256-5 |
ContentType | Journal Article |
Copyright | Copyright The Institute of Electrical and Electronics Engineers, Inc. (IEEE) 2023 |
Copyright_xml | – notice: Copyright The Institute of Electrical and Electronics Engineers, Inc. (IEEE) 2023 |
DBID | 97E ESBDL RIA RIE AAYXX CITATION 7SC 7SP 7SR 8BQ 8FD JG9 JQ2 L7M L~C L~D DOA |
DOI | 10.1109/ACCESS.2023.3326722 |
DatabaseName | IEEE All-Society Periodicals Package (ASPP) 2005–Present IEEE Xplore Open Access IEEE All-Society Periodicals Package (ASPP) 1998–Present IEEE Electronic Library (IEL) CrossRef Computer and Information Systems Abstracts Electronics & Communications Abstracts Engineered Materials Abstracts METADEX Technology Research Database Materials Research Database ProQuest Computer Science Collection Advanced Technologies Database with Aerospace Computer and Information Systems Abstracts Academic Computer and Information Systems Abstracts Professional DOAJ Open Access Full Text |
DatabaseTitle | CrossRef Materials Research Database Engineered Materials Abstracts Technology Research Database Computer and Information Systems Abstracts – Academic Electronics & Communications Abstracts ProQuest Computer Science Collection Computer and Information Systems Abstracts Advanced Technologies Database with Aerospace METADEX Computer and Information Systems Abstracts Professional |
DatabaseTitleList | Materials Research Database |
Database_xml | – sequence: 1 dbid: DOA name: DOAJ Directory of Open Access Journals url: https://www.doaj.org/ sourceTypes: Open Website – sequence: 2 dbid: RIE name: IEEE url: https://proxy.k.utb.cz/login?url=https://ieeexplore.ieee.org/ sourceTypes: Publisher |
DeliveryMethod | fulltext_linktorsrc |
Discipline | Engineering |
EISSN | 2169-3536 |
EndPage | 1 |
ExternalDocumentID | oai_doaj_org_article_995dd370fd414541b3de041acb61eca2 10_1109_ACCESS_2023_3326722 10290887 |
Genre | orig-research |
GroupedDBID | 0R~ 5VS 6IK 97E AAJGR ABAZT ABVLG ACGFS ADBBV ALMA_UNASSIGNED_HOLDINGS BCNDV BEFXN BFFAM BGNUA BKEBE BPEOZ EBS ESBDL GROUPED_DOAJ IPLJI JAVBF KQ8 M43 M~E O9- OCL OK1 RIA RIE RNS 4.4 AAYXX AGSQL CITATION EJD RIG 7SC 7SP 7SR 8BQ 8FD JG9 JQ2 L7M L~C L~D |
ID | FETCH-LOGICAL-c409t-dc14515c274a8f053dc55da8aefd8a43ff471775521809016d55e48918661f043 |
IEDL.DBID | RIE |
ISSN | 2169-3536 |
IngestDate | Wed Aug 27 01:31:16 EDT 2025 Mon Jun 30 04:16:47 EDT 2025 Tue Jul 01 04:14:05 EDT 2025 Thu Apr 24 23:08:23 EDT 2025 Wed Aug 27 02:37:46 EDT 2025 |
IsDoiOpenAccess | true |
IsOpenAccess | true |
IsPeerReviewed | true |
IsScholarly | true |
Language | English |
License | https://creativecommons.org/licenses/by-nc-nd/4.0 |
LinkModel | DirectLink |
MergedId | FETCHMERGED-LOGICAL-c409t-dc14515c274a8f053dc55da8aefd8a43ff471775521809016d55e48918661f043 |
Notes | ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14 |
ORCID | 0009-0003-4550-4472 0000-0003-4521-1082 |
OpenAccessLink | https://proxy.k.utb.cz/login?url=https://ieeexplore.ieee.org/document/10290887 |
PQID | 2885652380 |
PQPubID | 4845423 |
PageCount | 1 |
ParticipantIDs | doaj_primary_oai_doaj_org_article_995dd370fd414541b3de041acb61eca2 crossref_primary_10_1109_ACCESS_2023_3326722 crossref_citationtrail_10_1109_ACCESS_2023_3326722 proquest_journals_2885652380 ieee_primary_10290887 |
ProviderPackageCode | CITATION AAYXX |
PublicationCentury | 2000 |
PublicationDate | 2023-01-01 |
PublicationDateYYYYMMDD | 2023-01-01 |
PublicationDate_xml | – month: 01 year: 2023 text: 2023-01-01 day: 01 |
PublicationDecade | 2020 |
PublicationPlace | Piscataway |
PublicationPlace_xml | – name: Piscataway |
PublicationTitle | IEEE access |
PublicationTitleAbbrev | Access |
PublicationYear | 2023 |
Publisher | IEEE The Institute of Electrical and Electronics Engineers, Inc. (IEEE) |
Publisher_xml | – name: IEEE – name: The Institute of Electrical and Electronics Engineers, Inc. (IEEE) |
References | ref13 ref12 ref15 (ref23) 2023 ref14 ref11 ref10 mitrou (ref26) 2022 ref16 (ref29) 2015 ref19 ehu?ek (ref27) 2023 kingma (ref18) 2013; abs 1312 6114 bekkerman (ref21) 2003 (ref24) 2023 ref20 ref22 makris (ref31) 2023 ehu?ek (ref25) 2023 nam (ref9) 2017 ref28 blei (ref17) 2007 ref8 (ref30) 2023 blei (ref2) 2003; 3 ref7 ref4 ref3 (ref1) 2023 ref6 ref5 |
References_xml | – ident: ref11 doi: 10.1109/ACCESS.2020.3041651 – ident: ref4 doi: 10.1016/j.eswa.2022.117215 – ident: ref3 doi: 10.1007/s10994-011-5272-5 – year: 2023 ident: ref30 publication-title: SciPy – ident: ref15 doi: 10.3115/1699510.1699543 – ident: ref6 doi: 10.1109/IEIT53149.2021.9587387 – ident: ref19 doi: 10.1007/s10994-017-5689-6 – ident: ref20 doi: 10.1016/S0306-4573(01)00045-0 – year: 2003 ident: ref21 publication-title: Using Bigrams in Text Categorization – year: 2023 ident: ref31 article-title: Scientific code (GitHub) – ident: ref5 doi: 10.1007/s11704-017-7031-7 – ident: ref14 doi: 10.1007/s10489-020-01798-x – ident: ref28 doi: 10.1145/2684822.2685324 – start-page: 1 year: 2022 ident: ref26 article-title: KALLIPOS: The project that is shaping the OER landscape in Greece publication-title: Proc Innovating Higher Educ Conf (I-HE) – volume: abs 1312 6114 start-page: 1 year: 2013 ident: ref18 article-title: Auto-encoding variational Bayes publication-title: CoRR – year: 2023 ident: ref27 publication-title: Gensim Topic Modelling for Humans – ident: ref12 doi: 10.1109/ICPR.2016.7899673 – year: 2015 ident: ref29 publication-title: KALLIPOS Subject Terms Catalogue – start-page: 1 year: 2017 ident: ref9 article-title: Maximizing subset accuracy with recurrent neural networks in multi-label classification publication-title: Proc 31st Int Conf Neural Inf Process Syst – year: 2023 ident: ref23 publication-title: Snowball Stemmer – volume: 3 start-page: 993 year: 2003 ident: ref2 article-title: Latent Dirichlet allocation publication-title: J Mach Learn Res – ident: ref13 doi: 10.1016/j.ins.2022.12.022 – ident: ref8 doi: 10.14569/IJACSA.2021.0120352 – ident: ref10 doi: 10.1109/ACCESS.2020.3029429 – year: 2023 ident: ref1 publication-title: LCSH – ident: ref22 doi: 10.1109/ICCMC53470.2022.9754079 – ident: ref16 doi: 10.1007/s00500-021-06310-2 – year: 2023 ident: ref24 publication-title: Lancaster Stemmer – year: 2023 ident: ref25 publication-title: Gensim Topic Modelling for Humans – ident: ref7 doi: 10.1007/s10994-011-5256-5 – start-page: 1 year: 2007 ident: ref17 article-title: Supervised topic models publication-title: Proc NIPS |
SSID | ssj0000816957 |
Score | 2.2721984 |
Snippet | In this paper, a new method for automatically analyzing and classifying books and book collections according to the subjects they cover is presented. It is... |
SourceID | doaj proquest crossref ieee |
SourceType | Open Website Aggregation Database Enrichment Source Index Database Publisher |
StartPage | 1 |
SubjectTerms | Artificial Intelligence Classification Classification Algorithms Digital Libraries Dirichlet problem Documents E-books Ensemble learning Flattening Formulas (mathematics) Latent Dirichlet Allocation Mixtures Multi-subject Classification Physical sciences Probabilistic logic Probabilistic models Probability distribution Recurrent neural networks Resource management Statistical analysis Statistical Natural Language Processing Subject Headings Task analysis University Coursebooks Vocabulary |
SummonAdditionalLinks | – databaseName: DOAJ Open Access Full Text dbid: DOA link: http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwrV07TxwxELYQVVJEgRDlEkAuUrLg565dHicQQiRNANFZfiqR0BLljiJ_Ir-ZGa8PXRQpadKsTnvehz2z831j2d8Q8lGUEooMpvNC5k71gXeeeThwaUKMFjASNwp_-txf3KjLO323UeoL14RN8sDTwJ1Yq1OSAytJcaUVDzJlpriPoec5-hp9AfM2kqkagw3vrR6azBBn9mS-WECPjrFa-LEEzjII8RsUVcX-VmLlj7hcweb8NXnVWCKdT2-3Q7byuEtebmgHviG_6tbZ5WPAiRS61hahfky0FrrEJUB11OlDobijYfoPf9E6W1A3NCyP6CngWKLQztMv7W7XEK7pLaBcwEWqP-uFQBTpFRDTcUUhTH6LX8HgdH6PWIg32iM352fXi4uuFVfoIqR0qy5FrNGrI2Sl3hT4FFPUOnnjc0nGK1kKwNYwaIB3w4A09EnrrIxFgTxemJJvyfb4MOZ3hNoglRI8ZCkCsC8bhDdamwyZn_RsGGZErMfZxaY8jgUw7l3NQJh1k3EcGsc148zI0fNF3yfhjb83P0UDPjdF1ex6AnzJNV9y__KlGdlD8288T-AyMOjA_tofXPvEl04YA2QYGA97_z-e_YG8wP5Mszv7ZHv14zEfAN9ZhcPq2k8b6_nU priority: 102 providerName: Directory of Open Access Journals |
Title | Multisubject analysis and classification of books and book collections, based on a subject term vocabulary and the Latent Dirichlet Allocation |
URI | https://ieeexplore.ieee.org/document/10290887 https://www.proquest.com/docview/2885652380 https://doaj.org/article/995dd370fd414541b3de041acb61eca2 |
Volume | 11 |
hasFullText | 1 |
inHoldings | 1 |
isFullTextHit | |
isPrint | |
link | http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV1Lb9QwELZoT3CAAkUslMoHjk1w_NjYx-2KqkLQE5V6s_wUiCqL2Gyl8iP4zcw43lUBgbhEVmI7jmbs-WYyD0Je85x9Fl43jovUyLnvGsccXDqhfQgGZCQGCn-4mJ9fyndX6qoGq5dYmJRScT5LLTbLv_y4Chs0lcEO5-iW0--RPdDcpmCtnUEFK0gY1dfMQh0zbxbLJXxEiwXCWwEwpef8F-lTkvTXqip_HMVFvpw9IhfblU1uJV_azejb8P23pI3_vfQD8rAiTbqYWOMxuZeGJ-TBnfyDT8mPEn673ng0xlBX85NAI9KAqBrdiArl6CpThOPTM2xRZKDixjWsTyjKwkihn6Pb2fDIpzcgKT06ut6WgQA26XsAt8NI4aj9HD4B09DFNcpTnOiQXJ69_bg8b2qBhiaAWjg2MWCdXxVAs3U6w3aOQanotEs5aidFziD6-l4BRNAMgMc8KpWkNphkr8tMimdkf1gN6TmhxgspeeeT4B4QnPHcaaV0Au1RONb3M8K3hLOhZi_HIhrXtmgxzNiJ2hapbSu1Z-RkN-jrlLzj391PkSN2XTHzdrkBlLR1I1tjVIyiZzlK-HjZeRETk50Lft6l4GCSQ6T-nfdNhJ-Roy2D2XpMrC3XGgA1oCb24i_DXpL7uMTJ6HNE9sdvm_QKYNDoj4v54Lhsgp881AW9 |
linkProvider | IEEE |
linkToHtml | http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV1Lj9MwELZgOQAHnosoLOADx01I_GjsY6lYFej2tCvtzfJTIFYpoikS_Ah-MzOOWy0gEJfISmzH0Yw9n52Zbwh5yVJyiTtVWcZjJaaurWxj4dJy5bzXYCMxUPh0NV2ci3cX8qIEq-dYmBhjdj6LNRbzv_yw9ls8KoMZztAtp7tOboDhl2wM19ofqWAOCS27wi3UNvrVbD6Hz6gxRXjNAah0jP1ifzJNf8mr8sdinC3MyV2y2o1tdCz5VG8HV_vvv9E2_vfg75E7BWvS2agc98m12D8gt68wED4kP3IA7mbr8DiG2sJQAoVAPeJqdCTKsqPrRBGQj8-wRFGFsiNXvzmmaA0DhXqW7nrDRZ9-BVvp0NX1W24IcJMuAd72A4XF9qP_AGpDZ5doUbGjQ3J-8uZsvqhKiobKw8ZwqILHTL_Sw97WqgQTOngpg1U2pqCs4CmB8es6kBXyhAG8DFJGoTTS7LWpEfwROejXfXxMqHZcCNa6yJkDDKcds0pKFWH_yG3TdRPCdoIzvvCXYxqNS5P3MY02o7QNStsUaU_I8b7R55G-49_VX6NG7Ksi93a-AZI0ZSobrWUIvGtSEPDxonU8xEa01rtpG72FTg5R-lfeNwp-Qo52CmbKQrExTCmA1ICbmid_afaC3FycnS7N8u3q_VNyC4c7HgEdkYPhyzY-A1A0uOd5KvwE2ikIEg |
openUrl | ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=Multisubject+Analysis+and+Classification+of+Books+and+Book+Collections%2C+Based+on+a+Subject+Term+Vocabulary+and+the+Latent+Dirichlet+Allocation&rft.jtitle=IEEE+access&rft.au=Makris%2C+Nikolaos&rft.au=Mitrou%2C+Nikolaos&rft.date=2023-01-01&rft.issn=2169-3536&rft.eissn=2169-3536&rft.volume=11&rft.spage=120881&rft.epage=120898&rft_id=info:doi/10.1109%2FACCESS.2023.3326722&rft.externalDBID=n%2Fa&rft.externalDocID=10_1109_ACCESS_2023_3326722 |
thumbnail_l | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=2169-3536&client=summon |
thumbnail_m | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=2169-3536&client=summon |
thumbnail_s | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=2169-3536&client=summon |