Using large language models to extract plant functional traits from unstructured text

Premise Functional plant ecology seeks to understand how functional traits govern species distributions, community assembly, and ecosystem functions. While global trait datasets have advanced the field, substantial gaps remain, and extracting trait information from text in books, research articles,...

Full description

Saved in:
Bibliographic Details
Published inApplications in plant sciences Vol. 13; no. 3; pp. e70011 - n/a
Main Authors Domazetoski, Viktor, Kreft, Holger, Bestova, Helena, Wieder, Philipp, Koynov, Radoslav, Zarei, Alireza, Weigelt, Patrick
Format Journal Article
LanguageEnglish
Published United States John Wiley & Sons, Inc 01.05.2025
John Wiley and Sons Inc
Wiley
Subjects
Online AccessGet full text

Cover

Loading…
More Information
Summary:Premise Functional plant ecology seeks to understand how functional traits govern species distributions, community assembly, and ecosystem functions. While global trait datasets have advanced the field, substantial gaps remain, and extracting trait information from text in books, research articles, and online sources via machine learning offers a valuable complement to costly field campaigns. Methods We propose a natural language processing pipeline that extracts traits from unstructured species descriptions by using classification models for categorical traits and question‐answering models for numerical traits. The pipeline's performance is evaluated on two large databases with over 50,000 species descriptions, utilizing approaches ranging from a keyword search to large language models. Results Our final optimized pipeline used a transformer architecture and obtained a mean precision of 90.8% (range 81.6–97%) and a mean recall of 88.6% (77.4–97%) across five categorical traits, representing a 9.83% increase in precision and 42.35% increase in recall over a regular expression‐based approach. The question‐answering model yielded a normalized mean absolute error of 10.3% averaged across three numerical traits. Discussion The natural language processing pipeline we propose has the potential to facilitate the digitization and extraction of large amounts of plant functional trait information residing in scattered textual descriptions.
Bibliography:ObjectType-Article-1
SourceType-Scholarly Journals-1
ObjectType-Feature-2
content type line 14
content type line 23
ISSN:2168-0450
2168-0450
DOI:10.1002/aps3.70011