Extracting circumstances of Covid-19 transmission from free text with large language models

Identifying the circumstances of transmission of an emerging infectious disease rapidly is central for mitigation efforts. Here, we explore how large language models (LLMs) can automatically extract such circumstances from free-text descriptions in online surveys, in the context of Covid-19. In a na...

Full description

Saved in:
Bibliographic Details
Published inNature communications Vol. 16; no. 1; pp. 5836 - 13
Main Authors Bizel-Bizellot, Gaston, Galmiche, Simon, Lelandais, Benoît, Charmet, Tiffany, Coudeville, Laurent, Fontanet, Arnaud, Zimmer, Christophe
Format Journal Article
LanguageEnglish
Published London Nature Publishing Group UK 01.07.2025
Nature Publishing Group
Nature Portfolio
Subjects
Online AccessGet full text

Cover

Loading…
More Information
Summary:Identifying the circumstances of transmission of an emerging infectious disease rapidly is central for mitigation efforts. Here, we explore how large language models (LLMs) can automatically extract such circumstances from free-text descriptions in online surveys, in the context of Covid-19. In a nationwide study conducted online in France, we enrolled 545,958 adults with recent SARS-CoV-2 infection and inquired about the circumstances of transmission in both closed-ended and open-ended questions. First, we trained a classification model based on a pretrained LLM to predict one of seven predefined infection contexts (Work, Family, Friends, Sports, Cultural, Religious, Other) from the free text in answers to open-ended questions. We achieved an unbalanced accuracy of 75%, which increased to 91% when eliminating the 43% highest entropy responses. Second, we used topic modeling to define clusters of transmission circumstances agnostically. This led to 23 clusters, which agreed with the seven predefined infection contexts, but also provided finer details on previously undefined circumstances of transmission. Our study suggests that LLM-based analysis of free text may alleviate the need for closed-ended questions in epidemiological surveys and enable insights into previously unsuspected circumstances of transmission. This approach is poised to accelerate and enrich the acquisition of epidemiological insights in future pandemics. Open-ended survey questions may provide useful detail on possible venues of transmission of infectious diseases, but data are difficult to analyse at scale. Here, the authors use large language models to extract potential transmission venues in ~80,000 responses to an open-ended COVID-19 survey question in France.
Bibliography:ObjectType-Article-1
SourceType-Scholarly Journals-1
ObjectType-Feature-2
content type line 14
content type line 23
PMCID: PMC12219669
ISSN:2041-1723
2041-1723
DOI:10.1038/s41467-025-60762-w