Hidden challenges in evaluating spillover risk of zoonotic viruses using machine learning models

Background Machine learning models have been deployed to assess the zoonotic spillover risk of viruses by identifying their potential for human infectivity. However, the lack of comprehensive datasets for viral infectivity poses a major challenge, limiting the predictable range of viruses. Methods I...

Full description

Saved in:

Bibliographic Details
Published in	Communications medicine Vol. 5; no. 1; pp. 187 - 10
Main Authors	Kawasaki, Junna, Suzuki, Tadaki, Hamada, Michiaki
Format	Journal Article
Language	English
Published	London Nature Publishing Group UK 20.05.2025 Springer Nature B.V Nature Portfolio
Subjects	45 631/114/2163 631/326/596/2564 Datasets Genomes Influenza Large language models Machine learning Medicine Medicine & Public Health Metadata Viruses Zoonoses
Online Access	Get full text

Cover

Loading…

More Information
Summary:	Background Machine learning models have been deployed to assess the zoonotic spillover risk of viruses by identifying their potential for human infectivity. However, the lack of comprehensive datasets for viral infectivity poses a major challenge, limiting the predictable range of viruses. Methods In this study, we address this limitation through two key strategies: constructing expansive datasets across 26 viral families and developing the BERT-infect model, which leverages large language models pre-trained on extensive nucleotide sequences. Results Here we show that our approach substantially boosts model performance. This enhancement is particularly notable in segmented RNA viruses, which are involved with severe zoonoses but have been overlooked due to limited data availability. Our model also exhibits high predictive performance even with partial viral sequences, such as high-throughput sequencing reads or contig sequences from de novo sequence assemblies, indicating the model’s applicability for mining zoonotic viruses from virus metagenomic data. Furthermore, models trained on data up to 2018 demonstrate robust predictive capability for most viruses identified post-2018. Nonetheless, high-resolution evaluation based on phylogenetic analysis reveals general limitations in current machine learning models: the difficulty in alerting the human infectious risk in specific zoonotic viral lineages, including SARS-CoV-2. Conclusions Our study provides a comprehensive benchmark for viral infectivity prediction models and highlights unresolved issues in fully exploiting machine learning to prepare for future zoonotic threats. Plain language summary To prepare for future pandemics caused by animal-derived viruses, there is a growing need for computational models that can predict whether a virus might infect humans. We constructed extensive datasets covering information about different viruses, including key human pathogens. We developed computational models using these datasets, which outperformed existing approaches across many virus types. However, we also revealed that current models share the same unresolved challenges when assessing whether specific viruses will infect humans, including SARS-CoV-2. These findings suggest that current models may fail to identify animal viruses that can infect humans, which underscores the urgent need for improved predictive models to strengthen pandemic preparedness. Kawasaki et al. construct a dataset covering 26 viral families and use large language models pre-trained on nucleotide sequences to identify zoonotic viruses with human infectivity potential. High predictive performance was obtained, even with partial viral sequences, but not all zoonotic lineages could be identified.
Bibliography:	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14 content type line 23
ISSN:	2730-664X 2730-664X
DOI:	10.1038/s43856-025-00903-w