BiodiViz: Leveraging NER and RE for Automated Knowledge Graph Generation in Biodiversity Research

In biodiversity research, the integration of machine learning and data visualization is increasingly important for uncovering valuable insights from academic literature. This study introduces an innovative knowledge graph application, BiodiViz, designed to translate intricate text into intuitive vis...

Full description

Saved in:
Bibliographic Details
Published inBiodiversity Information Science and Standards Vol. 8; p. 842
Main Authors Tan, Angela Shannen, Dimayuga, Paul Michael, Gabud, Roselyn
Format Journal Article
LanguageEnglish
Published Sofia Pensoft Publishers 29.10.2024
Subjects
Online AccessGet full text

Cover

Loading…
More Information
Summary:In biodiversity research, the integration of machine learning and data visualization is increasingly important for uncovering valuable insights from academic literature. This study introduces an innovative knowledge graph application, BiodiViz, designed to translate intricate text into intuitive visual representations, fostering a deeper comprehension of biodiversity relationships. BiodiViz uses the top-performing Named Entity Recognition (NER) and Relation Extraction (RE) models to automatically generate a comprehensive knowledge graph for biodiversity research. The NER model extracts and categorizes entities like organisms, phenomena, and habitats, while the RE model identifies relationships such as "have," "occur in," and "influence" from the BiodivNERE dataset (Abdelmageed et al. 2022). These entities and relationships are organized into nodes and edges within a graph. Researchers input text into BiodiViz, producing a visual knowledge graph that simplifies the analysis of complex biodiversity data, reducing manual effort and enhancing efficiency. Named Entity Recognition & Relation Extraction BiodiViz leverages advanced Bidirectional Encoder Representations from Transformers (BERT)-based Large Language Models (LLMs) (Rogers et al. 2020), fine-tuned specifically for NER and RE tasks using the BiodivNERE dataset. The fine-tuning process involved various models, including BERT (Devlin et al. 2019), ELECTRA (Clark et al. 2020), and BiodivBERT (Abdelmageed et al. 2023). These models were evaluated for performance using the results of their F1-score as the main metric, which is the harmonic mean of precision (the proportion of true positive results among all positive predictions) and recall (the proportion of true positive results among all actual positives), with BiodivBERT achieving an F1-score of 77.16% for the NER task, while BERT excelled in the RE task with an F1-score of 81.28%. Rigorous hyperparameter optimization further enhanced the performance of BiodivBERT in the RE task by 3.38%. The BiodivNERE corpora by Abdelmageed et al. (2022) were used to fine-tune several models for NER and RE tasks in the biodiversity domain. The first corpus from the BiodivNERE corpora is BiodivNER, which is a gold standard dataset (manually labelled test corpora) for evaluating NER tasks. The fine-tuning process employed the token classification method from the Hugging Face library (Hugging Face 2023b), which assigns labels to each token in a sequence. Experiments were conducted with a batch size of four, meaning the model processes four examples/rows of data at a time before making an update to improve its learning. This is due to the constraints of the NVIDIA® GeForce RTX™ 3060 graphics processor. (NVIDIA 2024) Model performance was evaluated using the seqeval library (Nakayama 2018), focusing on accuracy, precision, recall, and F1 scores. For text classification, the second corpus, BiodivRE, was utilized, following previous research recommendations to explore fine-tuning settings for BiodivBERT. Hyperparameter optimization (Feurer and Hutter 2019) was conducted using Hugging Face’s Trainer API with an Optuna backend (Hugging Face 2023a), concentrating on learning rate and the number of training epochs (i.e., the number of complete passes through the entire dataset during model training). The BiodiViz Knowledge Graph Application The fine-tuned NER and RE models with the best F1-scores—BiodivBERT and BERT, respectively—were integrated into the knowledge graph application. Fig. 1 illustrates the flowchart of the application pipeline. Each sentence in the input text will go through the NER model to identify and label the entities within the sentence. Subsequently, these labeled entities, together with the original sentence, will be input into the RE model. The RE model will analyze every pair of entities for a potential relation and output the type of relation they share. The application will then utilize this data to create a graph with appropriate labels and color-coding. An example of the application's user interface with the knowledge graph is shown in Fig. 2. This study highlights the practical application of machine learning and data visualization in advancing biodiversity research, emphasizing the importance of developing user-friendly tools to support scientific exploration and discovery. The BiodiViz application, including the code and resources, is available on GitHub*1, providing an accessible tool for biodiversity researchers to streamline their analyses.
ISSN:2535-0897
2535-0897
DOI:10.3897/biss.8.140428