MAMMAL -- Molecular Aligned Multi-Modal Architecture and Language
Drug discovery typically consists of multiple steps, including identifying a target protein key to a disease's etiology, validating that interacting with this target could prevent symptoms or cure the disease, discovering a small molecule or biologic therapeutic to interact with it, and optimiz...
Saved in:
Main Authors | , , , , , , , , , , , , , , , , , , |
---|---|
Format | Journal Article |
Language | English |
Published |
28.10.2024
|
Subjects | |
Online Access | Get full text |
Cover
Loading…
Summary: | Drug discovery typically consists of multiple steps, including identifying a
target protein key to a disease's etiology, validating that interacting with
this target could prevent symptoms or cure the disease, discovering a small
molecule or biologic therapeutic to interact with it, and optimizing the
candidate molecule through a complex landscape of required properties. Drug
discovery related tasks often involve prediction and generation while
considering multiple entities that potentially interact, which poses a
challenge for typical AI models. For this purpose we present MAMMAL - Molecular
Aligned Multi-Modal Architecture and Language - a method that we applied to
create a versatile multi-task multi-align foundation model that learns from
large-scale biological datasets (2 billion samples) across diverse modalities,
including proteins, small molecules, and genes. We introduce a prompt syntax
that supports a wide range of classification, regression, and generation tasks.
It allows combining different modalities and entity types as inputs and/or
outputs. Our model handles combinations of tokens and scalars and enables the
generation of small molecules and proteins, property prediction, and
transcriptomic lab test predictions. We evaluated the model on 11 diverse
downstream tasks spanning different steps within a typical drug discovery
pipeline, where it reaches new SOTA in 9 tasks and is comparable to SOTA in 2
tasks. This performance is achieved while using a unified architecture serving
all tasks, in contrast to the original SOTA performance achieved using tailored
architectures.
The model code and pretrained weights are publicly available at
https://github.com/BiomedSciAI/biomed-multi-alignment and
https://huggingface.co/ibm/biomed.omics.bl.sm.ma-ted-458m. |
---|---|
DOI: | 10.48550/arxiv.2410.22367 |