Content-based subject classification at article level in biomedical context

Subject classification is an important task to analyze scholarly publications. In general, mainly two kinds of approaches are used: classification at a journal level and classification at the article level. We propose a mixed approach, leveraging on embeddings technique in NLP to train classifiers w...

Full description

Saved in:

Bibliographic Details
Main Author	Jeangirard, Eric
Format	Journal Article
Language	English
Published	30.04.2021
Subjects	Computer Science - Digital Libraries
Online Access	Get full text
DOI	10.48550/arxiv.2104.14800

Cover

Loading…

More Information
Summary:	Subject classification is an important task to analyze scholarly publications. In general, mainly two kinds of approaches are used: classification at a journal level and classification at the article level. We propose a mixed approach, leveraging on embeddings technique in NLP to train classifiers with article metadata (title, abstract, keywords in particular) labelled with the journal-level classification FoR (Fields of Research) and then apply these classifiers at the article level. We use this approach in the context of biomedical publications using metadata from Pubmed. Fasttext classifiers are trained with FoR codes and used to classify publications based on their available metadata. Results show that using a stratification sampling strategy for training help reduce the bias due to unbalanced field distribution. An implementation of the method is proposed on the repository https://github.com/dataesr/scientific_tagger
DOI:	10.48550/arxiv.2104.14800