Language agents achieve superhuman synthesis of scientific knowledge

Language models are known to hallucinate incorrect information, and it is unclear if they are sufficiently accurate and reliable for use in scientific research. We developed a rigorous human-AI comparison methodology to evaluate language model agents on real-world literature search tasks covering in...

Full description

Saved in:

Bibliographic Details
Published in	arXiv.org
Main Authors	Skarlinski, Michael D, Cox, Sam, Laurent, Jon M, Braza, James D, Hinks, Michaela, Hammerling, Michael J, Ponnapati, Manvitha, Rodriques, Samuel G, White, Andrew D
Format	Paper
Language	English
Published	Ithaca Cornell University Library, arXiv.org 26.09.2024
Subjects	Accuracy Biological effects Encyclopedias Information retrieval Language Literature reviews Searching
Online Access	Get full text

Cover

Loading…

More Information
Summary:	Language models are known to hallucinate incorrect information, and it is unclear if they are sufficiently accurate and reliable for use in scientific research. We developed a rigorous human-AI comparison methodology to evaluate language model agents on real-world literature search tasks covering information retrieval, summarization, and contradiction detection tasks. We show that PaperQA2, a frontier language model agent optimized for improved factuality, matches or exceeds subject matter expert performance on three realistic literature research tasks without any restrictions on humans (i.e., full access to internet, search tools, and time). PaperQA2 writes cited, Wikipedia-style summaries of scientific topics that are significantly more accurate than existing, human-written Wikipedia articles. We also introduce a hard benchmark for scientific literature research called LitQA2 that guided design of PaperQA2, leading to it exceeding human performance. Finally, we apply PaperQA2 to identify contradictions within the scientific literature, an important scientific task that is challenging for humans. PaperQA2 identifies 2.34 +/- 1.99 contradictions per paper in a random subset of biology papers, of which 70% are validated by human experts. These results demonstrate that language model agents are now capable of exceeding domain experts across meaningful tasks on scientific literature.
ISSN:	2331-8422