Content-based subject classification at article level in biomedical context
Subject classification is an important task to analyze scholarly publications. In general, mainly two kinds of approaches are used: classification at a journal level and classification at the article level. We propose a mixed approach, leveraging on embeddings technique in NLP to train classifiers w...
Saved in:
Main Author | |
---|---|
Format | Journal Article |
Language | English |
Published |
30.04.2021
|
Subjects | |
Online Access | Get full text |
DOI | 10.48550/arxiv.2104.14800 |
Cover
Loading…
Summary: | Subject classification is an important task to analyze scholarly
publications. In general, mainly two kinds of approaches are used:
classification at a journal level and classification at the article level. We
propose a mixed approach, leveraging on embeddings technique in NLP to train
classifiers with article metadata (title, abstract, keywords in particular)
labelled with the journal-level classification FoR (Fields of Research) and
then apply these classifiers at the article level. We use this approach in the
context of biomedical publications using metadata from Pubmed. Fasttext
classifiers are trained with FoR codes and used to classify publications based
on their available metadata. Results show that using a stratification sampling
strategy for training help reduce the bias due to unbalanced field
distribution. An implementation of the method is proposed on the repository
https://github.com/dataesr/scientific_tagger |
---|---|
DOI: | 10.48550/arxiv.2104.14800 |