Multisubject analysis and classification of books and book collections, based on a subject term vocabulary and the Latent Dirichlet Allocation

In this paper, a new method for automatically analyzing and classifying books and book collections according to the subjects they cover is presented. It is based on a combination of the LDA method for discovering latent topics in the collection, on the one hand, and the description of subjects by me...

Full description

Saved in:
Bibliographic Details
Published inIEEE access Vol. 11; p. 1
Main Authors Makris, N., Mitrou, N.
Format Journal Article
LanguageEnglish
Published Piscataway IEEE 01.01.2023
The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
Subjects
Online AccessGet full text

Cover

Loading…
Abstract In this paper, a new method for automatically analyzing and classifying books and book collections according to the subjects they cover is presented. It is based on a combination of the LDA method for discovering latent topics in the collection, on the one hand, and the description of subjects by means of a subject term vocabulary , on the other. Books, topics and subjects, all are modelled as bag-of-words , with specific distributions over the underlying word vocabulary. The Table of Contents (ToC) was used to describe the books, instead of their entire body, while subject (or standard ) documents are produced by a subject term hierarchy of the respective disciplines. Frequency-of-terms in the documents and word-generative probabilistic models (as the ones postulated by LDA) were integrated into a consistent statistical framework. Using Bayesian statistics and simple marginalization equations we were able to transform the expressions of the books from distributions over unlabeled topics (derived by the LDA) to distributions over labeled subjects representing the respective disciplines (Physical sciences, Health sciences, Mathematics, etc). More specifically, the necessary theoretical basis is firstly established, with each subject formally defined by the respective branch of a subject term hierarchy (much like a ToC) or the respective bag of words (single words and biwords) produced by flattening the hierarchy branch; flattening is realized by taking all the terms of the nodes and leaves of the branch with repetitions allowed. Being confined within a closed set of subjects, we are able to invert the frequency-of-terms in each subject [also interpreted as the probability of generating a term ( w n ) when sampling the subject ( s i ) and denoted by Pr{ w n | s i })] and express each term as a weighted mixture (or probability distribution ) of subjects , denoted by Pr{ s i | w n }. This is the key idea of the proposed method. Then, any document (d m ) can be expressed as a weighted mixture of subjects (or the respective distribution, denoted by Pr{ s i |d m }) by simply summing up the distributions of the individual terms contained in the document. This is made possible by virtue of some simple formulas that have been formally proven for the union of documents (Pr{ s i |(d 1 ∪d 2 )}) and for the union of subjects (Pr{( s i ∪ s j )|d}). Since not all vocabulary terms are found in a particular set of books, nor, conversely, all corpus words are included in the subject vocabulary either, two important measures come to the foreground and are calculated with the proposed formulation: the coverage of a book or a corpus by the subject term vocabulary and, conversely, the vocabulary coverage by a set of books . These measures are useful for updating/enriching the subject term vocabulary, whenever it happens that documents with new subjects are included in the corpus under analysis. Following the theoretical formulation, the derived results are combined with the LDA in order to further facilitate our multisubject analysis task: using the subject term vocabulary, LDA is applied on the corpus under study and results in expressing each book (b m ), as a probability distribution over hidden topics (denoted by Pr{t k |b m }). In the same framework, each topic (t k ) is expressed as a probability distribution over words (Pr{ w n |t k }). Having estimated each word's probability distribution over subjects (Pr{ s i | w n }), we can express each discovered topic as a weighted mixture of subjects [Pr{ s i |t k } = Σ n Pr{ w n |t k } Pr{ s i | w n }] and, by using that, we express each book in the same manner [Pr{ s i |b m } = Σ k Pr{t k |b m } Pr{ s i |t k }]. This is a very clear and formal way towards obtaining the desired result. The proposed methodology was applied to a Springer's e-book collection with more than 50,000 books, while a subject term hierarchy developed by KALLIPOS, a project creating open-access e-books, was used for the proof of concept. A number of experiments were conducted to showcase the validity and usefulness of the proposed approach.
AbstractList In this paper, a new method for automatically analyzing and classifying books and book collections according to the subjects they cover is presented. It is based on a combination of the LDA method for discovering latent topics in the collection, on the one hand, and the description of subjects by means of a subject term vocabulary, on the other. Books, topics and subjects, all are modelled as bag-of-words, with specific distributions over the underlying word vocabulary. The Table of Contents (ToC) was used to describe the books, instead of their entire body, while subject (or standard) documents are produced by a subject term hierarchy of the respective disciplines. Frequency-of-terms in the documents and word-generative probabilistic models (as the ones postulated by LDA) were integrated into a consistent statistical framework. Using Bayesian statistics and simple marginalization equations we were able to transform the expressions of the books from distributions over unlabeled topics (derived by the LDA) to distributions over labeled subjects representing the respective disciplines (Physical sciences, Health sciences, Mathematics, etc). More specifically, the necessary theoretical basis is firstly established, with each subject formally defined by the respective branch of a subject term hierarchy (much like a ToC) or the respective bag of words (single words and biwords) produced by flattening the hierarchy branch; flattening is realized by taking all the terms of the nodes and leaves of the branch with repetitions allowed. Being confined within a closed set of subjects, we are able to invert the frequency-of-terms in each subject [also interpreted as the probability of generating a term ( <tex-math notation="LaTeX">$w_{n}$ </tex-math>) when sampling the subject ( <tex-math notation="LaTeX">$s_{i}$ </tex-math>) and denoted by Pr{ <tex-math notation="LaTeX">$w_{n}\vert $ </tex-math> <tex-math notation="LaTeX">$s_{i}$ </tex-math>})] and express each term as a weighted mixture (or probability distribution) of subjects, denoted by Pr{ <tex-math notation="LaTeX">$s_{i}\vert $ </tex-math> <tex-math notation="LaTeX">$w_{n}$ </tex-math>}. This is the key idea of the proposed method. Then, any document (dm) can be expressed as a weighted mixture of subjects (or the respective distribution, denoted by Pr{ <tex-math notation="LaTeX">$s_{i}\vert \text{d}_{m}$ </tex-math>}) by simply summing up the distributions of the individual terms contained in the document. This is made possible by virtue of some simple formulas that have been formally proven for the union of documents ( <tex-math notation="LaTeX">$\Pr \left \{{{\mathrm {s}_{i}\vert (\mathbf {d}}_{1}\mathrm {\cup }\mathbf {d}_{2}) }\right \})$ </tex-math> and for the union of subjects <tex-math notation="LaTeX">$(\Pr \left \{{{\mathrm {(\mathbf {s}}}_{i}\mathrm {\cup }\mathbf {s}_{j}\mathrm {)\vert \mathbf {d}} }\right \})$ </tex-math>. Since not all vocabulary terms are found in a particular set of books, nor, conversely, all corpus words are included in the subject vocabulary either, two important measures come to the foreground and are calculated with the proposed formulation: the coverage of a book or a corpus by the subject term vocabulary and, conversely, the vocabulary coverage by a set of books. These measures are useful for updating/enriching the subject term vocabulary, whenever it happens that documents with new subjects are included in the corpus under analysis. Following the theoretical formulation, the derived results are combined with the LDA in order to further facilitate our multisubject analysis task: using the subject term vocabulary, LDA is applied on the corpus under study and results in expressing each book (bm), as a probability distribution over hidden topics (denotedby <tex-math notation="LaTeX">$\text{Pr}\left\{\mathbf{t}_k \mid \mathbf{b}_m\right\}$ </tex-math>). In the same framework, each topic <tex-math notation="LaTeX">$\left(\mathbf{t}_k\right)$ </tex-math> is expressed as a probability distribution over words <tex-math notation="LaTeX">$\left(\text{Pr}\left\{w_n \mid \mathbf{t}_k\right\}\right)$ </tex-math>. Having estimated each word's probability distribution over subjects <tex-math notation="LaTeX">$\left(\text{Pr}\left\{\mathbf{s}_i \mid w_n\right\}\right)$ </tex-math>, we can express each discovered topic as a weighted mixture of subjects <tex-math notation="LaTeX">$\left[\text{Pr}\left\{\mathbf{s}_i \mid \mathbf{t}_k\right\}=\sum_n \text{Pr}\left\{w_n \mid \mathbf{t}_k\right\} \text{Pr}\left\{\mathbf{s}_i \mid w_n\right\}\right]$ </tex-math> and, by using that, we express each book in the same manner <tex-math notation="LaTeX">$\left[\text{Pr}\left\{\mathbf{s}_i \mid \mathbf{b}_m\right\}=\sum_k \text{Pr}\left\{\mathbf{t}_k \mid \mathbf{b}_m\right\} \text{Pr}\left\{\mathbf{s}_i \mid \mathbf{t}_k\right\}\right]$ </tex-math>. This is a very clear and formal way towards obtaining the desired result. The proposed methodology was applied to a Springer's e-book collection with more than 50,000 books, while a subject term hierarchy developed by KALLIPOS, a project creating openaccess e-books, was used for the proof of concept. A number of experiments were conducted to showcase the validity and usefulness of the proposed approach.
In this paper, a new method for automatically analyzing and classifying books and book collections according to the subjects they cover is presented. It is based on a combination of the LDA method for discovering latent topics in the collection, on the one hand, and the description of subjects by means of a subject term vocabulary, on the other. Books, topics and subjects, all are modelled as bag-of-words, with specific distributions over the underlying word vocabulary. The Table of Contents (ToC) was used to describe the books, instead of their entire body, while subject (or standard) documents are produced by a subject term hierarchy of the respective disciplines. Frequency-of-terms in the documents and word-generative probabilistic models (as the ones postulated by LDA) were integrated into a consistent statistical framework. Using Bayesian statistics and simple marginalization equations we were able to transform the expressions of the books from distributions over unlabeled topics (derived by the LDA) to distributions over labeled subjects representing the respective disciplines (Physical sciences, Health sciences, Mathematics, etc). More specifically, the necessary theoretical basis is firstly established, with each subject formally defined by the respective branch of a subject term hierarchy (much like a ToC) or the respective bag of words (single words and biwords) produced by flattening the hierarchy branch; flattening is realized by taking all the terms of the nodes and leaves of the branch with repetitions allowed. Being confined within a closed set of subjects, we are able to invert the frequency-of-terms in each subject [also interpreted as the probability of generating a term ([Formula Omitted]) when sampling the subject ([Formula Omitted]) and denoted by Pr{[Formula Omitted] [Formula Omitted]})] and express each term as a weighted mixture (or probability distribution) of subjects, denoted by Pr{[Formula Omitted] [Formula Omitted]}. This is the key idea of the proposed method. Then, any document (dm) can be expressed as a weighted mixture of subjects (or the respective distribution, denoted by Pr{[Formula Omitted]}) by simply summing up the distributions of the individual terms contained in the document. This is made possible by virtue of some simple formulas that have been formally proven for the union of documents ([Formula Omitted] and for the union of subjects [Formula Omitted]. Since not all vocabulary terms are found in a particular set of books, nor, conversely, all corpus words are included in the subject vocabulary either, two important measures come to the foreground and are calculated with the proposed formulation: the coverage of a book or a corpus by the subject term vocabulary and, conversely, the vocabulary coverage by a set of books. These measures are useful for updating/enriching the subject term vocabulary, whenever it happens that documents with new subjects are included in the corpus under analysis. Following the theoretical formulation, the derived results are combined with the LDA in order to further facilitate our multisubject analysis task: using the subject term vocabulary, LDA is applied on the corpus under study and results in expressing each book (bm), as a probability distribution over hidden topics (denotedby [Formula Omitted]). In the same framework, each topic [Formula Omitted] is expressed as a probability distribution over words [Formula Omitted]. Having estimated each word's probability distribution over subjects [Formula Omitted], we can express each discovered topic as a weighted mixture of subjects [Formula Omitted] and, by using that, we express each book in the same manner [Formula Omitted]. This is a very clear and formal way towards obtaining the desired result. The proposed methodology was applied to a Springer's e-book collection with more than 50,000 books, while a subject term hierarchy developed by KALLIPOS, a project creating openaccess e-books, was used for the proof of concept. A number of experiments were conducted to showcase the validity and usefulness of the proposed approach.
In this paper, a new method for automatically analyzing and classifying books and book collections according to the subjects they cover is presented. It is based on a combination of the LDA method for discovering latent topics in the collection, on the one hand, and the description of subjects by means of a subject term vocabulary , on the other. Books, topics and subjects, all are modelled as bag-of-words , with specific distributions over the underlying word vocabulary. The Table of Contents (ToC) was used to describe the books, instead of their entire body, while subject (or standard ) documents are produced by a subject term hierarchy of the respective disciplines. Frequency-of-terms in the documents and word-generative probabilistic models (as the ones postulated by LDA) were integrated into a consistent statistical framework. Using Bayesian statistics and simple marginalization equations we were able to transform the expressions of the books from distributions over unlabeled topics (derived by the LDA) to distributions over labeled subjects representing the respective disciplines (Physical sciences, Health sciences, Mathematics, etc). More specifically, the necessary theoretical basis is firstly established, with each subject formally defined by the respective branch of a subject term hierarchy (much like a ToC) or the respective bag of words (single words and biwords) produced by flattening the hierarchy branch; flattening is realized by taking all the terms of the nodes and leaves of the branch with repetitions allowed. Being confined within a closed set of subjects, we are able to invert the frequency-of-terms in each subject [also interpreted as the probability of generating a term ( w n ) when sampling the subject ( s i ) and denoted by Pr{ w n | s i })] and express each term as a weighted mixture (or probability distribution ) of subjects , denoted by Pr{ s i | w n }. This is the key idea of the proposed method. Then, any document (d m ) can be expressed as a weighted mixture of subjects (or the respective distribution, denoted by Pr{ s i |d m }) by simply summing up the distributions of the individual terms contained in the document. This is made possible by virtue of some simple formulas that have been formally proven for the union of documents (Pr{ s i |(d 1 ∪d 2 )}) and for the union of subjects (Pr{( s i ∪ s j )|d}). Since not all vocabulary terms are found in a particular set of books, nor, conversely, all corpus words are included in the subject vocabulary either, two important measures come to the foreground and are calculated with the proposed formulation: the coverage of a book or a corpus by the subject term vocabulary and, conversely, the vocabulary coverage by a set of books . These measures are useful for updating/enriching the subject term vocabulary, whenever it happens that documents with new subjects are included in the corpus under analysis. Following the theoretical formulation, the derived results are combined with the LDA in order to further facilitate our multisubject analysis task: using the subject term vocabulary, LDA is applied on the corpus under study and results in expressing each book (b m ), as a probability distribution over hidden topics (denoted by Pr{t k |b m }). In the same framework, each topic (t k ) is expressed as a probability distribution over words (Pr{ w n |t k }). Having estimated each word's probability distribution over subjects (Pr{ s i | w n }), we can express each discovered topic as a weighted mixture of subjects [Pr{ s i |t k } = Σ n Pr{ w n |t k } Pr{ s i | w n }] and, by using that, we express each book in the same manner [Pr{ s i |b m } = Σ k Pr{t k |b m } Pr{ s i |t k }]. This is a very clear and formal way towards obtaining the desired result. The proposed methodology was applied to a Springer's e-book collection with more than 50,000 books, while a subject term hierarchy developed by KALLIPOS, a project creating open-access e-books, was used for the proof of concept. A number of experiments were conducted to showcase the validity and usefulness of the proposed approach.
Author Mitrou, N.
Makris, N.
Author_xml – sequence: 1
  givenname: N.
  orcidid: 0009-0003-4550-4472
  surname: Makris
  fullname: Makris, N.
  organization: School of Electrical & Computer Engineering National Technical University of Athens, Greece
– sequence: 2
  givenname: N.
  surname: Mitrou
  fullname: Mitrou, N.
  organization: School of Electrical & Computer Engineering National Technical University of Athens, Greece
BookMark eNp9kctuFDEQRS0UJELIF8DCEltm8LPbvRwNIYk0iEVgbVX7QTw47WC7kfITfDOe6USKWOCNS1V1bpV9X6OTKU0OobeUrCklw8fNdntxc7NmhPE156zrGXuBThnthhWXvDt5Fr9C56XsSTuqpWR_iv58mWMNZR73zlQME8SHEkoLLDYRSgk-GKghTTh5PKb0c6kdImxSjI1qxfIBj1Ccxa0P8JNadfkO_04GxjlCfjiC9dbhHVQ3Vfwp5GBuo6t4E2NaprxBLz3E4s4f7zP0_fPFt-3Vavf18nq72a2MIENdWUOFpNKwXoDyRHJrpLSgwHmrQHDvRU_7XkpGFRkI7ayUTqiBqq6jngh-hq4XXZtgr-9zuGsL6gRBHxMp_9CQazDR6WGQ1vKeeCvaUEFHbh0RFMzYUWeANa33i9Z9Tr9mV6repzm3nyyaKSU7ybgirWtYukxOpWTntQn1-OaaIURNiT64qRc39cFN_ehmY_k_7NPG_6feLVRwzj0j2ECU6vlfGj-uhg
CODEN IAECCG
CitedBy_id crossref_primary_10_1002_ep_14549
Cites_doi 10.1109/ACCESS.2020.3041651
10.1016/j.eswa.2022.117215
10.1007/s10994-011-5272-5
10.3115/1699510.1699543
10.1109/IEIT53149.2021.9587387
10.1007/s10994-017-5689-6
10.1016/S0306-4573(01)00045-0
10.1007/s11704-017-7031-7
10.1007/s10489-020-01798-x
10.1145/2684822.2685324
10.1109/ICPR.2016.7899673
10.1016/j.ins.2022.12.022
10.14569/IJACSA.2021.0120352
10.1109/ACCESS.2020.3029429
10.1109/ICCMC53470.2022.9754079
10.1007/s00500-021-06310-2
10.1007/s10994-011-5256-5
ContentType Journal Article
Copyright Copyright The Institute of Electrical and Electronics Engineers, Inc. (IEEE) 2023
Copyright_xml – notice: Copyright The Institute of Electrical and Electronics Engineers, Inc. (IEEE) 2023
DBID 97E
ESBDL
RIA
RIE
AAYXX
CITATION
7SC
7SP
7SR
8BQ
8FD
JG9
JQ2
L7M
L~C
L~D
DOA
DOI 10.1109/ACCESS.2023.3326722
DatabaseName IEEE All-Society Periodicals Package (ASPP) 2005–Present
IEEE Xplore Open Access
IEEE All-Society Periodicals Package (ASPP) 1998–Present
IEEE Electronic Library (IEL)
CrossRef
Computer and Information Systems Abstracts
Electronics & Communications Abstracts
Engineered Materials Abstracts
METADEX
Technology Research Database
Materials Research Database
ProQuest Computer Science Collection
Advanced Technologies Database with Aerospace
Computer and Information Systems Abstracts – Academic
Computer and Information Systems Abstracts Professional
DOAJ Open Access Full Text
DatabaseTitle CrossRef
Materials Research Database
Engineered Materials Abstracts
Technology Research Database
Computer and Information Systems Abstracts – Academic
Electronics & Communications Abstracts
ProQuest Computer Science Collection
Computer and Information Systems Abstracts
Advanced Technologies Database with Aerospace
METADEX
Computer and Information Systems Abstracts Professional
DatabaseTitleList
Materials Research Database

Database_xml – sequence: 1
  dbid: DOA
  name: DOAJ Directory of Open Access Journals
  url: https://www.doaj.org/
  sourceTypes: Open Website
– sequence: 2
  dbid: RIE
  name: IEEE
  url: https://proxy.k.utb.cz/login?url=https://ieeexplore.ieee.org/
  sourceTypes: Publisher
DeliveryMethod fulltext_linktorsrc
Discipline Engineering
EISSN 2169-3536
EndPage 1
ExternalDocumentID oai_doaj_org_article_995dd370fd414541b3de041acb61eca2
10_1109_ACCESS_2023_3326722
10290887
Genre orig-research
GroupedDBID 0R~
5VS
6IK
97E
AAJGR
ABAZT
ABVLG
ACGFS
ADBBV
ALMA_UNASSIGNED_HOLDINGS
BCNDV
BEFXN
BFFAM
BGNUA
BKEBE
BPEOZ
EBS
ESBDL
GROUPED_DOAJ
IPLJI
JAVBF
KQ8
M43
M~E
O9-
OCL
OK1
RIA
RIE
RNS
4.4
AAYXX
AGSQL
CITATION
EJD
RIG
7SC
7SP
7SR
8BQ
8FD
JG9
JQ2
L7M
L~C
L~D
ID FETCH-LOGICAL-c409t-dc14515c274a8f053dc55da8aefd8a43ff471775521809016d55e48918661f043
IEDL.DBID RIE
ISSN 2169-3536
IngestDate Wed Aug 27 01:31:16 EDT 2025
Mon Jun 30 04:16:47 EDT 2025
Tue Jul 01 04:14:05 EDT 2025
Thu Apr 24 23:08:23 EDT 2025
Wed Aug 27 02:37:46 EDT 2025
IsDoiOpenAccess true
IsOpenAccess true
IsPeerReviewed true
IsScholarly true
Language English
License https://creativecommons.org/licenses/by-nc-nd/4.0
LinkModel DirectLink
MergedId FETCHMERGED-LOGICAL-c409t-dc14515c274a8f053dc55da8aefd8a43ff471775521809016d55e48918661f043
Notes ObjectType-Article-1
SourceType-Scholarly Journals-1
ObjectType-Feature-2
content type line 14
ORCID 0009-0003-4550-4472
0000-0003-4521-1082
OpenAccessLink https://proxy.k.utb.cz/login?url=https://ieeexplore.ieee.org/document/10290887
PQID 2885652380
PQPubID 4845423
PageCount 1
ParticipantIDs doaj_primary_oai_doaj_org_article_995dd370fd414541b3de041acb61eca2
crossref_primary_10_1109_ACCESS_2023_3326722
crossref_citationtrail_10_1109_ACCESS_2023_3326722
proquest_journals_2885652380
ieee_primary_10290887
ProviderPackageCode CITATION
AAYXX
PublicationCentury 2000
PublicationDate 2023-01-01
PublicationDateYYYYMMDD 2023-01-01
PublicationDate_xml – month: 01
  year: 2023
  text: 2023-01-01
  day: 01
PublicationDecade 2020
PublicationPlace Piscataway
PublicationPlace_xml – name: Piscataway
PublicationTitle IEEE access
PublicationTitleAbbrev Access
PublicationYear 2023
Publisher IEEE
The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
Publisher_xml – name: IEEE
– name: The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
References ref13
ref12
ref15
(ref23) 2023
ref14
ref11
ref10
mitrou (ref26) 2022
ref16
(ref29) 2015
ref19
ehu?ek (ref27) 2023
kingma (ref18) 2013; abs 1312 6114
bekkerman (ref21) 2003
(ref24) 2023
ref20
ref22
makris (ref31) 2023
ehu?ek (ref25) 2023
nam (ref9) 2017
ref28
blei (ref17) 2007
ref8
(ref30) 2023
blei (ref2) 2003; 3
ref7
ref4
ref3
(ref1) 2023
ref6
ref5
References_xml – ident: ref11
  doi: 10.1109/ACCESS.2020.3041651
– ident: ref4
  doi: 10.1016/j.eswa.2022.117215
– ident: ref3
  doi: 10.1007/s10994-011-5272-5
– year: 2023
  ident: ref30
  publication-title: SciPy
– ident: ref15
  doi: 10.3115/1699510.1699543
– ident: ref6
  doi: 10.1109/IEIT53149.2021.9587387
– ident: ref19
  doi: 10.1007/s10994-017-5689-6
– ident: ref20
  doi: 10.1016/S0306-4573(01)00045-0
– year: 2003
  ident: ref21
  publication-title: Using Bigrams in Text Categorization
– year: 2023
  ident: ref31
  article-title: Scientific code (GitHub)
– ident: ref5
  doi: 10.1007/s11704-017-7031-7
– ident: ref14
  doi: 10.1007/s10489-020-01798-x
– ident: ref28
  doi: 10.1145/2684822.2685324
– start-page: 1
  year: 2022
  ident: ref26
  article-title: KALLIPOS: The project that is shaping the OER landscape in Greece
  publication-title: Proc Innovating Higher Educ Conf (I-HE)
– volume: abs 1312 6114
  start-page: 1
  year: 2013
  ident: ref18
  article-title: Auto-encoding variational Bayes
  publication-title: CoRR
– year: 2023
  ident: ref27
  publication-title: Gensim Topic Modelling for Humans
– ident: ref12
  doi: 10.1109/ICPR.2016.7899673
– year: 2015
  ident: ref29
  publication-title: KALLIPOS Subject Terms Catalogue
– start-page: 1
  year: 2017
  ident: ref9
  article-title: Maximizing subset accuracy with recurrent neural networks in multi-label classification
  publication-title: Proc 31st Int Conf Neural Inf Process Syst
– year: 2023
  ident: ref23
  publication-title: Snowball Stemmer
– volume: 3
  start-page: 993
  year: 2003
  ident: ref2
  article-title: Latent Dirichlet allocation
  publication-title: J Mach Learn Res
– ident: ref13
  doi: 10.1016/j.ins.2022.12.022
– ident: ref8
  doi: 10.14569/IJACSA.2021.0120352
– ident: ref10
  doi: 10.1109/ACCESS.2020.3029429
– year: 2023
  ident: ref1
  publication-title: LCSH
– ident: ref22
  doi: 10.1109/ICCMC53470.2022.9754079
– ident: ref16
  doi: 10.1007/s00500-021-06310-2
– year: 2023
  ident: ref24
  publication-title: Lancaster Stemmer
– year: 2023
  ident: ref25
  publication-title: Gensim Topic Modelling for Humans
– ident: ref7
  doi: 10.1007/s10994-011-5256-5
– start-page: 1
  year: 2007
  ident: ref17
  article-title: Supervised topic models
  publication-title: Proc NIPS
SSID ssj0000816957
Score 2.2721984
Snippet In this paper, a new method for automatically analyzing and classifying books and book collections according to the subjects they cover is presented. It is...
SourceID doaj
proquest
crossref
ieee
SourceType Open Website
Aggregation Database
Enrichment Source
Index Database
Publisher
StartPage 1
SubjectTerms Artificial Intelligence
Classification
Classification Algorithms
Digital Libraries
Dirichlet problem
Documents
E-books
Ensemble learning
Flattening
Formulas (mathematics)
Latent Dirichlet Allocation
Mixtures
Multi-subject Classification
Physical sciences
Probabilistic logic
Probabilistic models
Probability distribution
Recurrent neural networks
Resource management
Statistical analysis
Statistical Natural Language Processing
Subject Headings
Task analysis
University Coursebooks
Vocabulary
SummonAdditionalLinks – databaseName: DOAJ Open Access Full Text
  dbid: DOA
  link: http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwrV07TxwxELYQVVJEgRDlEkAuUrLg565dHicQQiRNANFZfiqR0BLljiJ_Ir-ZGa8PXRQpadKsTnvehz2z831j2d8Q8lGUEooMpvNC5k71gXeeeThwaUKMFjASNwp_-txf3KjLO323UeoL14RN8sDTwJ1Yq1OSAytJcaUVDzJlpriPoec5-hp9AfM2kqkagw3vrR6azBBn9mS-WECPjrFa-LEEzjII8RsUVcX-VmLlj7hcweb8NXnVWCKdT2-3Q7byuEtebmgHviG_6tbZ5WPAiRS61hahfky0FrrEJUB11OlDobijYfoPf9E6W1A3NCyP6CngWKLQztMv7W7XEK7pLaBcwEWqP-uFQBTpFRDTcUUhTH6LX8HgdH6PWIg32iM352fXi4uuFVfoIqR0qy5FrNGrI2Sl3hT4FFPUOnnjc0nGK1kKwNYwaIB3w4A09EnrrIxFgTxemJJvyfb4MOZ3hNoglRI8ZCkCsC8bhDdamwyZn_RsGGZErMfZxaY8jgUw7l3NQJh1k3EcGsc148zI0fNF3yfhjb83P0UDPjdF1ex6AnzJNV9y__KlGdlD8288T-AyMOjA_tofXPvEl04YA2QYGA97_z-e_YG8wP5Mszv7ZHv14zEfAN9ZhcPq2k8b6_nU
  priority: 102
  providerName: Directory of Open Access Journals
Title Multisubject analysis and classification of books and book collections, based on a subject term vocabulary and the Latent Dirichlet Allocation
URI https://ieeexplore.ieee.org/document/10290887
https://www.proquest.com/docview/2885652380
https://doaj.org/article/995dd370fd414541b3de041acb61eca2
Volume 11
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
link http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV1Lb9QwELZoT3CAAkUslMoHjk1w_NjYx-2KqkLQE5V6s_wUiCqL2Gyl8iP4zcw43lUBgbhEVmI7jmbs-WYyD0Je85x9Fl43jovUyLnvGsccXDqhfQgGZCQGCn-4mJ9fyndX6qoGq5dYmJRScT5LLTbLv_y4Chs0lcEO5-iW0--RPdDcpmCtnUEFK0gY1dfMQh0zbxbLJXxEiwXCWwEwpef8F-lTkvTXqip_HMVFvpw9IhfblU1uJV_azejb8P23pI3_vfQD8rAiTbqYWOMxuZeGJ-TBnfyDT8mPEn673ng0xlBX85NAI9KAqBrdiArl6CpThOPTM2xRZKDixjWsTyjKwkihn6Pb2fDIpzcgKT06ut6WgQA26XsAt8NI4aj9HD4B09DFNcpTnOiQXJ69_bg8b2qBhiaAWjg2MWCdXxVAs3U6w3aOQanotEs5aidFziD6-l4BRNAMgMc8KpWkNphkr8tMimdkf1gN6TmhxgspeeeT4B4QnPHcaaV0Au1RONb3M8K3hLOhZi_HIhrXtmgxzNiJ2hapbSu1Z-RkN-jrlLzj391PkSN2XTHzdrkBlLR1I1tjVIyiZzlK-HjZeRETk50Lft6l4GCSQ6T-nfdNhJ-Roy2D2XpMrC3XGgA1oCb24i_DXpL7uMTJ6HNE9sdvm_QKYNDoj4v54Lhsgp881AW9
linkProvider IEEE
linkToHtml http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV1Lj9MwELZgOQAHnosoLOADx01I_GjsY6lYFej2tCvtzfJTIFYpoikS_Ah-MzOOWy0gEJfISmzH0Yw9n52Zbwh5yVJyiTtVWcZjJaaurWxj4dJy5bzXYCMxUPh0NV2ci3cX8qIEq-dYmBhjdj6LNRbzv_yw9ls8KoMZztAtp7tOboDhl2wM19ofqWAOCS27wi3UNvrVbD6Hz6gxRXjNAah0jP1ifzJNf8mr8sdinC3MyV2y2o1tdCz5VG8HV_vvv9E2_vfg75E7BWvS2agc98m12D8gt68wED4kP3IA7mbr8DiG2sJQAoVAPeJqdCTKsqPrRBGQj8-wRFGFsiNXvzmmaA0DhXqW7nrDRZ9-BVvp0NX1W24IcJMuAd72A4XF9qP_AGpDZ5doUbGjQ3J-8uZsvqhKiobKw8ZwqILHTL_Sw97WqgQTOngpg1U2pqCs4CmB8es6kBXyhAG8DFJGoTTS7LWpEfwROejXfXxMqHZcCNa6yJkDDKcds0pKFWH_yG3TdRPCdoIzvvCXYxqNS5P3MY02o7QNStsUaU_I8b7R55G-49_VX6NG7Ksi93a-AZI0ZSobrWUIvGtSEPDxonU8xEa01rtpG72FTg5R-lfeNwp-Qo52CmbKQrExTCmA1ICbmid_afaC3FycnS7N8u3q_VNyC4c7HgEdkYPhyzY-A1A0uOd5KvwE2ikIEg
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=Multisubject+Analysis+and+Classification+of+Books+and+Book+Collections%2C+Based+on+a+Subject+Term+Vocabulary+and+the+Latent+Dirichlet+Allocation&rft.jtitle=IEEE+access&rft.au=Makris%2C+Nikolaos&rft.au=Mitrou%2C+Nikolaos&rft.date=2023-01-01&rft.issn=2169-3536&rft.eissn=2169-3536&rft.volume=11&rft.spage=120881&rft.epage=120898&rft_id=info:doi/10.1109%2FACCESS.2023.3326722&rft.externalDBID=n%2Fa&rft.externalDocID=10_1109_ACCESS_2023_3326722
thumbnail_l http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=2169-3536&client=summon
thumbnail_m http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=2169-3536&client=summon
thumbnail_s http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=2169-3536&client=summon