Large language model enhanced corpus of CO2 reduction electrocatalysts and synthesis procedures

CO 2 electroreduction has garnered significant attention from both the academic and industrial communities. Extracting crucial information related to catalysts from domain literature can help scientists find new and effective electrocatalysts. Herein, we used various advanced machine learning, natur...

Full description

Saved in:
Bibliographic Details
Published inScientific data Vol. 11; no. 1; pp. 347 - 12
Main Authors Chen, Xueqing, Gao, Yang, Wang, Ludi, Cui, Wenjuan, Huang, Jiamin, Du, Yi, Wang, Bin
Format Journal Article
LanguageEnglish
Published London Nature Publishing Group UK 06.04.2024
Nature Publishing Group
Nature Portfolio
Subjects
Online AccessGet full text

Cover

Loading…
More Information
Summary:CO 2 electroreduction has garnered significant attention from both the academic and industrial communities. Extracting crucial information related to catalysts from domain literature can help scientists find new and effective electrocatalysts. Herein, we used various advanced machine learning, natural language processing techniques and large language models (LLMs) approaches to extract relevant information about the CO 2 electrocatalytic reduction process from scientific literature. By applying the extraction pipeline, we present an open-source corpus for electrocatalytic CO 2 reduction. The database contains two types of corpus: (1) the benchmark corpus, which is a collection of 6,985 records extracted from 1,081 publications by catalysis postgraduates; and (2) the extended corpus, which consists of content extracted from 5,941 documents using traditional NLP techniques and LLMs techniques. The Extended Corpus I and II contain 77,016 and 30,283 records, respectively. Furthermore, several domain literature fine-tuned LLMs were developed. Overall, this work will contribute to the exploration of new and effective electrocatalysts by leveraging information from domain literature using cutting-edge computer techniques.
Bibliography:ObjectType-Article-1
SourceType-Scholarly Journals-1
ObjectType-Feature-2
content type line 14
ObjectType-Article-2
ObjectType-Undefined-1
ObjectType-Feature-3
content type line 23
ISSN:2052-4463
2052-4463
DOI:10.1038/s41597-024-03180-9