Using large language models to extract plant functional traits from unstructured text

Premise Functional plant ecology seeks to understand how functional traits govern species distributions, community assembly, and ecosystem functions. While global trait datasets have advanced the field, substantial gaps remain, and extracting trait information from text in books, research articles,...

Full description

Saved in:

Bibliographic Details
Published in	Applications in plant sciences Vol. 13; no. 3; pp. e70011 - n/a
Main Authors	Domazetoski, Viktor, Kreft, Holger, Bestova, Helena, Wieder, Philipp, Koynov, Radoslav, Zarei, Alireza, Weigelt, Patrick
Format	Journal Article
Language	English
Published	United States John Wiley & Sons, Inc 01.05.2025 John Wiley and Sons Inc Wiley
Subjects	Algorithms Application Artificial intelligence automatic trait extraction biodiversity Classification Ecology Flowers & plants functional plant ecology Language Large language models Natural language processing Plant extracts Probability Taxonomy vascular plants large language models biodiversity vascular plants automatic trait extraction natural language processing functional plant ecology
Online Access	Get full text

Cover

Loading…

More Information
Summary:	Premise Functional plant ecology seeks to understand how functional traits govern species distributions, community assembly, and ecosystem functions. While global trait datasets have advanced the field, substantial gaps remain, and extracting trait information from text in books, research articles, and online sources via machine learning offers a valuable complement to costly field campaigns. Methods We propose a natural language processing pipeline that extracts traits from unstructured species descriptions by using classification models for categorical traits and question‐answering models for numerical traits. The pipeline's performance is evaluated on two large databases with over 50,000 species descriptions, utilizing approaches ranging from a keyword search to large language models. Results Our final optimized pipeline used a transformer architecture and obtained a mean precision of 90.8% (range 81.6–97%) and a mean recall of 88.6% (77.4–97%) across five categorical traits, representing a 9.83% increase in precision and 42.35% increase in recall over a regular expression‐based approach. The question‐answering model yielded a normalized mean absolute error of 10.3% averaged across three numerical traits. Discussion The natural language processing pipeline we propose has the potential to facilitate the digitization and extraction of large amounts of plant functional trait information residing in scattered textual descriptions.
Bibliography:	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14 content type line 23
ISSN:	2168-0450 2168-0450
DOI:	10.1002/aps3.70011