A Text Mining Approach to Uncover the Structure of Subject Metadata in the Biodiversity Heritage Library

ABSTRACT We propose a bottom‐up, data‐driven pipeline to uncover the structure of biodiversity subject metadata using a combination of text mining approaches. In this study, we analyze 721,035 subject terms in the Biodiversity Heritage Library (BHL). We utilize named entity recognition and word‐embe...

Full description

Saved in:

Bibliographic Details
Published in	Proceedings of the Association for Information Science and Technology Vol. 60; no. 1; pp. 926 - 928
Main Authors	Cheng, Yi‐Yun, Parulian, Nikolaus Nova, Dinh, Ly
Format	Journal Article
Language	English
Published	Hoboken, USA John Wiley & Sons, Inc 01.10.2023
Subjects	Biodiversity Heritage Library Subject headings text mining
Online Access	Get full text

Cover

Loading…

More Information
Summary:	ABSTRACT We propose a bottom‐up, data‐driven pipeline to uncover the structure of biodiversity subject metadata using a combination of text mining approaches. In this study, we analyze 721,035 subject terms in the Biodiversity Heritage Library (BHL). We utilize named entity recognition and word‐embedding methods to systematically label and group terms based on their vector‐space distances. The results show that the subject terms from BHL are clustered into several prominent themes relating to environmental regulations, geographic locations, organisms, and subject access points. We hope that our approach can serve as a first step to group similar subject terms together in large‐scale, constant growing digital collections with aggregated metadata from multiple sources. Ultimately, we hope the next phases of this project can become a basis for biodiversity digital libraries to standardize their vocabularies.
ISSN:	2373-9231 2373-9231
DOI:	10.1002/pra2.900