Detecting Remote Evolutionary Relationships among Proteins by Large-Scale Semantic Embedding

Virtually every molecular biologist has searched a protein or DNA sequence database to find sequences that are evolutionarily related to a given query. Pairwise sequence comparison methods--i.e., measures of similarity between query and target sequences--provide the engine for sequence database sear...

Full description

Saved in:

Bibliographic Details
Published in	PLoS computational biology Vol. 7; no. 1; p. e1001047
Main Authors	Melvin, Iain, Weston, Jason, Noble, William Stafford, Leslie, Christina
Format	Journal Article
Language	English
Published	United States Public Library of Science 01.01.2011 Public Library of Science (PLoS)
Subjects	Algorithms Bioinformatics Biological Evolution Computational Biology/Protein Homology Detection DNA Markov processes Methods Neighborhoods Physiological aspects Proteins Proteins - chemistry Proteins - genetics Semantics Sequence Analysis, DNA Studies United States Biological Evolution Proteins Algorithms Sequence Analysis, DNA
Online Access	Get full text

Cover

Loading…

Abstract	Virtually every molecular biologist has searched a protein or DNA sequence database to find sequences that are evolutionarily related to a given query. Pairwise sequence comparison methods--i.e., measures of similarity between query and target sequences--provide the engine for sequence database search and have been the subject of 30 years of computational research. For the difficult problem of detecting remote evolutionary relationships between protein sequences, the most successful pairwise comparison methods involve building local models (e.g., profile hidden Markov models) of protein sequences. However, recent work in massive data domains like web search and natural language processing demonstrate the advantage of exploiting the global structure of the data space. Motivated by this work, we present a large-scale algorithm called ProtEmbed, which learns an embedding of protein sequences into a low-dimensional "semantic space." Evolutionarily related proteins are embedded in close proximity, and additional pieces of evidence, such as 3D structural similarity or class labels, can be incorporated into the learning process. We find that ProtEmbed achieves superior accuracy to widely used pairwise sequence methods like PSI-BLAST and HHSearch for remote homology detection; it also outperforms our previous RankProp algorithm, which incorporates global structure in the form of a protein similarity network. Finally, the ProtEmbed embedding space can be visualized, both at the global level and local to a given query, yielding intuition about the structure of protein sequence space.
AbstractList	Virtually every molecular biologist has searched a protein or DNA sequence database to find sequences that are evolutionarily related to a given query. Pairwise sequence comparison methods--i.e., measures of similarity between query and target sequences--provide the engine for sequence database search and have been the subject of 30 years of computational research. For the difficult problem of detecting remote evolutionary relationships between protein sequences, the most successful pairwise comparison methods involve building local models (e.g., profile hidden Markov models) of protein sequences. However, recent work in massive data domains like web search and natural language processing demonstrate the advantage of exploiting the global structure of the data space. Motivated by this work, we present a large-scale algorithm called ProtEmbed, which learns an embedding of protein sequences into a low-dimensional "semantic space." Evolutionarily related proteins are embedded in close proximity, and additional pieces of evidence, such as 3D structural similarity or class labels, can be incorporated into the learning process. We find that ProtEmbed achieves superior accuracy to widely used pairwise sequence methods like PSI-BLAST and HHSearch for remote homology detection; it also outperforms our previous RankProp algorithm, which incorporates global structure in the form of a protein similarity network. Finally, the ProtEmbed embedding space can be visualized, both at the global level and local to a given query, yielding intuition about the structure of protein sequence space. Virtually every molecular biologist has searched a protein or DNA sequence database to find sequences that are evolutionarily related to a given query. Pairwise sequence comparison methods—i.e., measures of similarity between query and target sequences—provide the engine for sequence database search and have been the subject of 30 years of computational research. For the difficult problem of detecting remote evolutionary relationships between protein sequences, the most successful pairwise comparison methods involve building local models (e.g., profile hidden Markov models) of protein sequences. However, recent work in massive data domains like web search and natural language processing demonstrate the advantage of exploiting the global structure of the data space. Motivated by this work, we present a large-scale algorithm called P rot E mbed , which learns an embedding of protein sequences into a low-dimensional “semantic space.” Evolutionarily related proteins are embedded in close proximity, and additional pieces of evidence, such as 3D structural similarity or class labels, can be incorporated into the learning process. We find that P rot E mbed achieves superior accuracy to widely used pairwise sequence methods like PSI-BLAST and HHSearch for remote homology detection; it also outperforms our previous R ank P rop algorithm, which incorporates global structure in the form of a protein similarity network. Finally, the P rot E mbed embedding space can be visualized, both at the global level and local to a given query, yielding intuition about the structure of protein sequence space. Searching a protein or DNA sequence database to find sequences that are evolutionarily related to a query is one of the foundational problems in computational biology. These database searches rely on pairwise comparisons of sequence similarity between the query and targets, but despite years of method refinements, pairwise comparisons still often fail to detect more distantly related targets. In this study, we adapt recent work from natural language processing to exploit the global structure of the data space in this detection problem. In particular, we borrow the idea of a semantic embedding, where by training on a large text data set, one learns an embedding of words into a low-dimensional semantic space such that words embedded close to each other are likely to be semantically related. We present the ProtEmbed algorithm, which learns an embedding of protein sequences into a semantic space where evolutionarily-related proteins are embedded in close proximity. The flexible training algorithm allows additional pieces of evidence, such as 3D structural information, to be incorporated in the learning process and enables ProtEmbed to achieve state-of-the-art performance for the task of detecting targets that have remote evolutionary relationships to the query. Virtually every molecular biologist has searched a protein or DNA sequence database to find sequences that are evolutionarily related to a given query. Pairwise sequence comparison methods--i.e., measures of similarity between query and target sequences--provide the engine for sequence database search and have been the subject of 30 years of computational research. For the difficult problem of detecting remote evolutionary relationships between protein sequences, the most successful pairwise comparison methods involve building local models (e.g., profile hidden Markov models) of protein sequences. However, recent work in massive data domains like web search and natural language processing demonstrate the advantage of exploiting the global structure of the data space. Motivated by this work, we present a large-scale algorithm called ProtEmbed, which learns an embedding of protein sequences into a low-dimensional "semantic space." Evolutionarily related proteins are embedded in close proximity, and additional pieces of evidence, such as 3D structural similarity or class labels, can be incorporated into the learning process. We find that ProtEmbed achieves superior accuracy to widely used pairwise sequence methods like PSI-BLAST and HHSearch for remote homology detection; it also outperforms our previous RankProp algorithm, which incorporates global structure in the form of a protein similarity network. Finally, the ProtEmbed embedding space can be visualized, both at the global level and local to a given query, yielding intuition about the structure of protein sequence space.Virtually every molecular biologist has searched a protein or DNA sequence database to find sequences that are evolutionarily related to a given query. Pairwise sequence comparison methods--i.e., measures of similarity between query and target sequences--provide the engine for sequence database search and have been the subject of 30 years of computational research. For the difficult problem of detecting remote evolutionary relationships between protein sequences, the most successful pairwise comparison methods involve building local models (e.g., profile hidden Markov models) of protein sequences. However, recent work in massive data domains like web search and natural language processing demonstrate the advantage of exploiting the global structure of the data space. Motivated by this work, we present a large-scale algorithm called ProtEmbed, which learns an embedding of protein sequences into a low-dimensional "semantic space." Evolutionarily related proteins are embedded in close proximity, and additional pieces of evidence, such as 3D structural similarity or class labels, can be incorporated into the learning process. We find that ProtEmbed achieves superior accuracy to widely used pairwise sequence methods like PSI-BLAST and HHSearch for remote homology detection; it also outperforms our previous RankProp algorithm, which incorporates global structure in the form of a protein similarity network. Finally, the ProtEmbed embedding space can be visualized, both at the global level and local to a given query, yielding intuition about the structure of protein sequence space. Virtually every molecular biologist has searched a protein or DNA sequence database to find sequences that are evolutionarily related to a given query. Pairwise sequence comparison methods--i.e., measures of similarity between query and target sequences--provide the engine for sequence database search and have been the subject of 30 years of computational research. For the difficult problem of detecting remote evolutionary relationships between protein sequences, the most successful pairwise comparison methods involve building local models (e.g., profile hidden Markov models) of protein sequences. However, recent work in massive data domains like web search and natural language processing demonstrate the advantage of exploiting the global structure of the data space. Motivated by this work, we present a large-scale algorithm called ProtEmbed, which learns an embedding of protein sequences into a low-dimensional "semantic space." Evolutionarily related proteins are embedded in close proximity, and additional pieces of evidence, such as 3D structural similarity or class labels, can be incorporated into the learning process. We find that ProtEmbed achieves superior accuracy to widely used pairwise sequence methods like PSI-BLAST and HHSearch for remote homology detection; it also outperforms our previous RankProp algorithm, which incorporates global structure in the form of a protein similarity network. Finally, the ProtEmbed embedding space can be visualized, both at the global level and local to a given query, yielding intuition about the structure of protein sequence space. Virtually every molecular biologist has searched a protein or DNA sequence database to find sequences that are evolutionarily related to a given query. Pairwise sequence comparison methods--i.e., measures of similarity between query and target sequences--provide the engine for sequence database search and have been the subject of 30 years of computational research. For the difficult problem of detecting remote evolutionary relationships between protein sequences, the most successful pairwise comparison methods involve building local models (e.g., profile hidden Markov models) of protein sequences. However, recent work in massive data domains like web search and natural language processing demonstrate the advantage of exploiting the global structure of the data space. Motivated by this work, we present a large-scale algorithm called ProtEmbed, which learns an embedding of protein sequences into a low-dimensional "semantic space." Evolutionarily related proteins are embedded in close proximity, and additional pieces of evidence, such as 3D structural similarity or class labels, can be incorporated into the learning process. We find that ProtEmbed achieves superior accuracy to widely used pairwise sequence methods like PSI-BLAST and HHSearch for remote homology detection; it also outperforms our previous RankProp algorithm, which incorporates global structure in the form of a protein similarity network. Finally, the ProtEmbed embedding space can be visualized, both at the global level and local to a given query, yielding intuition about the structure of protein sequence space. Searching a protein or DNA sequence database to find sequences that are evolutionarily related to a query is one of the foundational problems in computational biology. These database searches rely on pairwise comparisons of sequence similarity between the query and targets, but despite years of method refinements, pairwise comparisons still often fail to detect more distantly related targets. In this study, we adapt recent work from natural language processing to exploit the global structure of the data space in this detection problem. In particular, we borrow the idea of a semantic embedding, where by training on a large text data set, one learns an embedding of words into a low-dimensional semantic space such that words embedded close to each other are likely to be semantically related. We present the ProtEmbed algorithm, which learns an embedding of protein sequences into a semantic space where evolutionarily-related proteins are embedded in close proximity. The flexible training algorithm allows additional pieces of evidence, such as 3D structural information, to be incorporated in the learning process and enables ProtEmbed to achieve state-of-the-art performance for the task of detecting targets that have remote evolutionary relationships to the query.
Audience	Academic
Author	Noble, William Stafford Leslie, Christina Melvin, Iain Weston, Jason
AuthorAffiliation	Stanford University, United States of America 1 NEC Laboratories America, Princeton, New Jersey, United States of America 4 Computational Biology Program, Memorial Sloan-Kettering Cancer Center, New York, New York, United States of America 3 Department of Genome Sciences, University of Washington, Seattle, Washington, United States of America 2 Google, New York, New York, United States of America
AuthorAffiliation_xml	– name: 1 NEC Laboratories America, Princeton, New Jersey, United States of America – name: 2 Google, New York, New York, United States of America – name: Stanford University, United States of America – name: 3 Department of Genome Sciences, University of Washington, Seattle, Washington, United States of America – name: 4 Computational Biology Program, Memorial Sloan-Kettering Cancer Center, New York, New York, United States of America
Author_xml	– sequence: 1 givenname: Iain surname: Melvin fullname: Melvin, Iain – sequence: 2 givenname: Jason surname: Weston fullname: Weston, Jason – sequence: 3 givenname: William Stafford surname: Noble fullname: Noble, William Stafford – sequence: 4 givenname: Christina surname: Leslie fullname: Leslie, Christina
BackLink	https://www.ncbi.nlm.nih.gov/pubmed/21298082$$D View this record in MEDLINE/PubMed
BookMark	eNqVkk1v00AQhi1URD_gHyDwDTgk7If3wxyQqhIgUgSogRvSar07cbeyvcFrV_TfM2kS1HCohPbg0fiZd2Zn39PsqIsdZNlzSqaUK_r2Oo59Z5vp2lVhSgmhpFCPshMqBJ8oLvTRvfg4O03pmhAMS_kkO2aUlZpodpL9_AADuCF0dX4JbRwgn93EZhxC7Gx_i7nGbuJ0FdYpt21E7luPWOhSXt3mC9vXMFk620C-hNZ2Q3D5rK3Ae5R8mj1e2SbBs933LPvxcfb94vNk8fXT_OJ8MXGSq2FSABOWa7HivvJQVGwFVnrmPJNUQSFcBVJxR4XkhDgpraSCy5Ir6RzFOn6WvdzqrpuYzG4xyVCORxeaKiTmW8JHe23WfWjxdibaYO4Ssa-N7XH2BgyzTCmgqsJhCu-F9V5ja-e1ZURxQK33u25j1YJ30A29bQ5ED_904crU8cZwwkrGSxR4tRPo468R0mDakBw0je0gjsloQYQkJSVIvn6QpFoxzcuy4IhOt2iNb2FCt4rY2-Hx0AaH1lkFzJ-zQhclF6LAgjcHBcgM8Huo7ZiSmS8v_4P9csi-uL-ev3vZew6Bd1vA9TGlHlbGheHOZThxaAwlZmPw_TuajcHNzuBYXPxTvNd_sOwPROkBcA
CitedBy_id	crossref_primary_10_1038_s42256_022_00457_9 crossref_primary_10_3390_life12020307 crossref_primary_10_1002_prot_25669 crossref_primary_10_1093_bib_bbw108 crossref_primary_10_1093_bioinformatics_btv413 crossref_primary_10_1016_j_ab_2020_114013 crossref_primary_10_1093_bioinformatics_btw271 crossref_primary_10_1016_j_sbi_2011_03_005 crossref_primary_10_1038_srep32333 crossref_primary_10_1073_pnas_1102727108 crossref_primary_10_1109_TCBB_2017_2765331 crossref_primary_10_12720_jomb_3_1_17_22 crossref_primary_10_1093_bioinformatics_btt709 crossref_primary_10_1109_TCBB_2018_2789880 crossref_primary_10_1146_annurev_pharmtox_010611_134630 crossref_primary_10_1038_s41592_019_0511_y crossref_primary_10_1109_ACCESS_2019_2929363 crossref_primary_10_1016_j_sbi_2025_102984 crossref_primary_10_1093_bib_bby104 crossref_primary_10_1093_bioinformatics_btx429
Cites_doi	10.1093/bioinformatics/btq034 10.1093/bioinformatics/btm358 10.1093/bioinformatics/btn567 10.1073/pnas.0308067101 10.1016/S0022-2836(05)80134-2 10.1093/bioinformatics/btp452 10.1016/S0022-2836(05)80360-2 10.1111/1467-9868.00346 10.1111/j.2517-6161.1995.tb02031.x 10.1093/nar/25.17.3389 10.1093/nar/gki096 10.1093/nar/gki408 10.1110/ps.0215902 10.1110/ps.9.2.232 10.1016/0022-2836(81)90087-5 10.1093/nar/28.1.254
ContentType	Journal Article
Copyright	COPYRIGHT 2011 Public Library of Science Melvin et al. 2011 2011 Melvin et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited: Melvin I, Weston J, Noble WS, Leslie C (2011) Detecting Remote Evolutionary Relationships among Proteins by Large-Scale Semantic Embedding. PLoS Comput Biol 7(1): e1001047. doi:10.1371/journal.pcbi.1001047
Copyright_xml	– notice: COPYRIGHT 2011 Public Library of Science – notice: Melvin et al. 2011 – notice: 2011 Melvin et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited: Melvin I, Weston J, Noble WS, Leslie C (2011) Detecting Remote Evolutionary Relationships among Proteins by Large-Scale Semantic Embedding. PLoS Comput Biol 7(1): e1001047. doi:10.1371/journal.pcbi.1001047
DBID	AAYXX CITATION CGR CUY CVF ECM EIF NPM ISN ISR 7QO 8FD FR3 P64 7X8 5PM DOA
DOI	10.1371/journal.pcbi.1001047
DatabaseName	CrossRef Medline MEDLINE MEDLINE (Ovid) MEDLINE MEDLINE PubMed Gale In Context: Canada Gale In Context: Science Biotechnology Research Abstracts Technology Research Database Engineering Research Database Biotechnology and BioEngineering Abstracts MEDLINE - Academic PubMed Central (Full Participant titles) DOAJ Directory of Open Access Journals
DatabaseTitle	CrossRef MEDLINE Medline Complete MEDLINE with Full Text PubMed MEDLINE (Ovid) Engineering Research Database Biotechnology Research Abstracts Technology Research Database Biotechnology and BioEngineering Abstracts MEDLINE - Academic
DatabaseTitleList	MEDLINE MEDLINE - Academic Engineering Research Database
Database_xml	– sequence: 1 dbid: DOA name: DOAJ Directory of Open Access Journals url: https://www.doaj.org/ sourceTypes: Open Website – sequence: 2 dbid: NPM name: PubMed url: https://proxy.k.utb.cz/login?url=http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=PubMed sourceTypes: Index Database – sequence: 3 dbid: EIF name: MEDLINE url: https://proxy.k.utb.cz/login?url=https://www.webofscience.com/wos/medline/basic-search sourceTypes: Index Database
DeliveryMethod	fulltext_linktorsrc
Discipline	Biology
DocumentTitleAlternate	Detecting Remote Evolutionary Relationships
EISSN	1553-7358
ExternalDocumentID	1313184817 oai_doaj_org_article_2a277e17bdbd4dd5add8156cd8a2073e PMC3029239 A248493554 21298082 10_1371_journal_pcbi_1001047
Genre	Journal Article Research Support, N.I.H., Extramural
GeographicLocations	United States
GeographicLocations_xml	– name: United States
GrantInformation_xml	– fundername: NIGMS NIH HHS grantid: R01 GM074257 – fundername: NIGMS NIH HHS grantid: R01GM074257
GroupedDBID	--- 123 29O 2WC 53G 5VS 7X7 88E 8FE 8FG 8FH 8FI 8FJ AAFWJ AAKPC AAUCC AAWOE AAYXX ABDBF ABUWG ACGFO ACIHN ACIWK ACPRK ACUHS ADBBV ADRAZ AEAQA AENEX AEUYN AFKRA AFPKN AFRAH AHMBA ALIPV ALMA_UNASSIGNED_HOLDINGS AOIJS ARAPS AZQEC B0M BAWUL BBNVY BCNDV BENPR BGLVJ BHPHI BPHCQ BVXVI BWKFM C1A CCPQU CITATION CS3 DIK DWQXO E3Z EAP EAS EBD EBS EJD EMK EMOBN ESX F5P FPL FYUFA GNUQQ GROUPED_DOAJ GX1 HCIFZ HMCUK HYE IAO IGS INH INR IPNFZ ISN ISR ITC J9A K6V K7- KQ8 LK8 M1P M48 M7P O5R O5S OK1 OVT P2P P62 PHGZM PHGZT PIMPY PQQKQ PROAC PSQYO RIG RNS RPM SV3 TR2 TUS UKHRP WOW XSB ~8M CGR CUY CVF ECM EIF NPM PJZUB PPXIY PQGLB PMFND 7QO 8FD FR3 P64 7X8 5PM PUEGO 3V. AAPBV ABPTK M0N M~E N95 PQEST PQUKI
ID	FETCH-LOGICAL-c637t-4e25a385f3dbde4b2fea6d2cd2617e45cbe673c156300c66a615369376cc13853
IEDL.DBID	M48
ISSN	1553-7358 1553-734X
IngestDate	Sun Aug 06 00:39:29 EDT 2023 Wed Aug 27 01:32:21 EDT 2025 Thu Aug 21 17:14:18 EDT 2025 Fri Jul 11 04:47:53 EDT 2025 Fri Jul 11 11:34:00 EDT 2025 Tue Jun 10 20:41:10 EDT 2025 Fri Jun 27 04:22:17 EDT 2025 Fri Jun 27 03:40:20 EDT 2025 Mon Jul 21 05:57:13 EDT 2025 Thu Apr 24 23:00:51 EDT 2025 Tue Jul 01 05:25:29 EDT 2025
IsDoiOpenAccess	true
IsOpenAccess	true
IsPeerReviewed	true
IsScholarly	true
Issue	1
Keywords	Biological Evolution Proteins Algorithms Sequence Analysis, DNA
Language	English
License	This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are properly credited. Creative Commons Attribution License
LinkModel	DirectLink
MergedId	FETCHMERGED-LOGICAL-c637t-4e25a385f3dbde4b2fea6d2cd2617e45cbe673c156300c66a615369376cc13853
Notes	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 23 Conceived and designed the experiments: JW WSN CL. Performed the experiments: IM. Analyzed the data: IM JW. Wrote the paper: IM JW WSN CL.
OpenAccessLink	http://journals.scholarsportal.info/openUrl.xqy?doi=10.1371/journal.pcbi.1001047
PMID	21298082
PQID	1872839943
PQPubID	23462
ParticipantIDs	plos_journals_1313184817 doaj_primary_oai_doaj_org_article_2a277e17bdbd4dd5add8156cd8a2073e pubmedcentral_primary_oai_pubmedcentral_nih_gov_3029239 proquest_miscellaneous_850560910 proquest_miscellaneous_1872839943 gale_infotracacademiconefile_A248493554 gale_incontextgauss_ISR_A248493554 gale_incontextgauss_ISN_A248493554 pubmed_primary_21298082 crossref_citationtrail_10_1371_journal_pcbi_1001047 crossref_primary_10_1371_journal_pcbi_1001047
ProviderPackageCode	CITATION AAYXX
PublicationCentury	2000
PublicationDate	2011-01-01
PublicationDateYYYYMMDD	2011-01-01
PublicationDate_xml	– month: 01 year: 2011 text: 2011-01-01 day: 01
PublicationDecade	2010
PublicationPlace	United States
PublicationPlace_xml	– name: United States – name: San Francisco, USA
PublicationTitle	PLoS computational biology
PublicationTitleAlternate	PLoS Comput Biol
PublicationYear	2011
Publisher	Public Library of Science Public Library of Science (PLoS)
Publisher_xml	– name: Public Library of Science – name: Public Library of Science (PLoS)
References	J Soding (ref6) 2005; 33 D Grangier (ref14) 2005 AR Ortiz (ref15) 2002; 11 T Jaakkola (ref18) 1999 SR Eddy (ref4) 1995 SF Altschul (ref3) 1997; 25 Y Benjamini (ref21) 1995; 57 I Melvin (ref19) 2009; 25 A Heger (ref17) 2005; 33 C Kemena (ref22) 2009; 25 C Burges (ref13) 2005 T Joachims (ref12) 2002 J Weston (ref7) 2004; 101 L Rychlewski (ref5) 2000; 9 JD Storey (ref20) 2002; 64 AG Murzin (ref10) 1995; 247 A Heger (ref23) 2007; 23 SE Brenner (ref16) 2000; 28 SF Altschul (ref1) 1990; 215 T Smith (ref2) 1981; 147 R Collobert (ref9) 2008 R Herbrich (ref11) 2000 B Bai (ref8) 2009 C Yeats (ref24) 2010; 26
References_xml	– start-page: 359 year: 2005 ident: ref14 article-title: Inferring document similarity from hyperlinks. – volume: 26 start-page: 745 year: 2010 ident: ref24 article-title: A fast and automated solution for accurately resolving protein domain architectures. publication-title: Bioinformatics doi: 10.1093/bioinformatics/btq034 – start-page: 149 year: 1999 ident: ref18 article-title: Using the Fisher kernel method to detect remote protein homologies. – volume: 23 start-page: 2361 year: 2007 ident: ref23 article-title: The global trace graph, a novel paradigm for searching protein sequence databases. publication-title: Bioinformatics doi: 10.1093/bioinformatics/btm358 – volume: 25 start-page: 121 year: 2009 ident: ref19 article-title: RANKPROP: a web server for protein remote homology detection. publication-title: Bioinformatics doi: 10.1093/bioinformatics/btn567 – volume: 101 start-page: 6559 year: 2004 ident: ref7 article-title: Protein ranking: From local to global structure in the protein similarity network. publication-title: Proc Natl Acad Sci U S A doi: 10.1073/pnas.0308067101 – start-page: 160 year: 2008 ident: ref9 article-title: A unified architecture for natural language processing: deep neural networks with multitask learning. – volume: 247 start-page: 536 year: 1995 ident: ref10 article-title: SCOP: A structural classification of proteins database for the investigation of sequences and structures. publication-title: J Mol Biol doi: 10.1016/S0022-2836(05)80134-2 – start-page: 133 year: 2002 ident: ref12 article-title: Optimizing search engines using clickthrough data. – volume: 25 start-page: 2455 year: 2009 ident: ref22 article-title: Upcoming challenges for multiple sequence alignment methods in the high-throughput era. publication-title: Bioinformatics doi: 10.1093/bioinformatics/btp452 – volume: 215 start-page: 403 year: 1990 ident: ref1 article-title: A basic local alignment search tool. publication-title: J Mol Biol doi: 10.1016/S0022-2836(05)80360-2 – start-page: 64 year: 2009 ident: ref8 article-title: Polynomial semantic indexing. – volume: 64 start-page: 479 year: 2002 ident: ref20 article-title: A direct approach to false discovery rates. publication-title: J R Stat Soc Series B doi: 10.1111/1467-9868.00346 – start-page: 114 year: 1995 ident: ref4 article-title: Multiple alignment using hidden Markov models. – start-page: 89 year: 2005 ident: ref13 article-title: Learning to rank using gradient descent. – volume: 57 start-page: 289 year: 1995 ident: ref21 article-title: Controlling the false discovery rate: a practical and powerful approach to multiple testing. publication-title: J R Stat Soc Series B doi: 10.1111/j.2517-6161.1995.tb02031.x – volume: 25 start-page: 3389 year: 1997 ident: ref3 article-title: Gapped BLAST and PSI-BLAST: A new generation of protein database search programs. publication-title: Nucleic Acids Res doi: 10.1093/nar/25.17.3389 – volume: 33 start-page: 188 year: 2005 ident: ref17 article-title: ADDA: a domain database with global coverage of the protein universe. publication-title: Nucleic Acids Res doi: 10.1093/nar/gki096 – volume: 33 start-page: W244 year: 2005 ident: ref6 article-title: The HHpred interactive server for protein homology detection and structure prediction. publication-title: Nucleic Acids Res doi: 10.1093/nar/gki408 – volume: 11 start-page: 2606 year: 2002 ident: ref15 article-title: MAMMOTH (Matching molecular models obtained from theory): An automated method for model comparison. publication-title: Protein Sci doi: 10.1110/ps.0215902 – start-page: 115 year: 2000 ident: ref11 article-title: Large margin rank boundaries for ordinal regression. – volume: 9 start-page: 232 year: 2000 ident: ref5 article-title: Comparison of sequence profiles: Strategies for structural predictions using sequence information. publication-title: Protein Sci doi: 10.1110/ps.9.2.232 – volume: 147 start-page: 195 year: 1981 ident: ref2 article-title: Identification of common molecular subsequences. publication-title: J Mol Biol doi: 10.1016/0022-2836(81)90087-5 – volume: 28 start-page: 254 year: 2000 ident: ref16 article-title: The ASTRAL compendium for sequence and structure analysis. publication-title: Nucleic Acids Res doi: 10.1093/nar/28.1.254
SSID	ssj0035896
Score	2.1019363
Snippet	Virtually every molecular biologist has searched a protein or DNA sequence database to find sequences that are evolutionarily related to a given query.... Virtually every molecular biologist has searched a protein or DNA sequence database to find sequences that are evolutionarily related to a given query....
SourceID	plos doaj pubmedcentral proquest gale pubmed crossref
SourceType	Open Website Open Access Repository Aggregation Database Index Database Enrichment Source
StartPage	e1001047
SubjectTerms	Algorithms Bioinformatics Biological Evolution Computational Biology/Protein Homology Detection DNA Markov processes Methods Neighborhoods Physiological aspects Proteins Proteins - chemistry Proteins - genetics Semantics Sequence Analysis, DNA Studies
SummonAdditionalLinks	– databaseName: DOAJ Directory of Open Access Journals dbid: DOA link: http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwrV1Li9swEBYlUOil9L3uC7UUenLXlmTLPm4fy7bQPbRdyKFg9Bh3DVknxEkh_74zthzi0mUvxTk544NmRppP1vj7GHuTmbwQVtpYAs4mldQQl17rWHiHeNg5qXs5n6_n-dmF-jLP5gdSX9QTNtADD447FkZoDam23nrlfYbzkQhOnC-MwPQEWn2x5o2bqWENllnRK3ORKE6spZqHj-akTo9DjN6tnG16BqKEpFUOilLP3b9foWerxbL7F_z8u4vyoCyd3mN3A57kJ8M47rNb0D5gtweFyd1D9vMj0CkB1ie-BowKcPgdks2sd3w9tsJdNquO98JDvGduaNqO2x1fUJ943GEcgXdwhVFoHIcrC55K3iN2cfrpx4ezOAgqxC6XehMrEJmRRVZLdCYoK2owuRfOEy07qMxZyLV0KXGGJS7PDaHBHAFM7lyKz8nHbNYuWzhiXJWQK9xd2MSActqbRLgy07X3ifKqSCMmR49WLrCNk-jFouqP0DTuOgYHVRSHKsQhYvH-qdXAtnGD_XsK1t6WuLL7G5hBVcig6qYMithrCnVFbBgttdv8Mtuuqz5_P69OhCoUMdCra42-TYzeBqN6iYN1JnzigC4jlq2J5RHl1TioDoeIF0kb4JhejblW4VSn8xvTwnKLNoVGMFiWSkaMX2NTEKIlDBixJ0N67p2DIKUsEPFFTE8Sd-K96T9tc9kzjstE4EagfPo_3P2M3Rney9PvOZtt1lt4gcBuY1_2c_gPQNdNtg priority: 102 providerName: Directory of Open Access Journals
Title	Detecting Remote Evolutionary Relationships among Proteins by Large-Scale Semantic Embedding
URI	https://www.ncbi.nlm.nih.gov/pubmed/21298082 https://www.proquest.com/docview/1872839943 https://www.proquest.com/docview/850560910 https://pubmed.ncbi.nlm.nih.gov/PMC3029239 https://doaj.org/article/2a277e17bdbd4dd5add8156cd8a2073e http://dx.doi.org/10.1371/journal.pcbi.1001047
Volume	7
hasFullText	1
inHoldings	1
isFullTextHit
isPrint
link	http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwhV1ba9swFBZtymAvZfd6l6CNwZ5cHEm27IcxkrVZN2gY3QJ5GBhZkttA6mR2MpZ_v3NkO8yjYTg4kBwZzkU6Rxd_HyFvQxXFLOOZzy30JhHk1k-MlD4zGuphrbl0dD6Xk-hiKr7MwtkBaTlbGwNWd07tkE9qWi5Of__cfoAO_96xNshB2-h0pbO5wxQKhDwkR5CbJHIaXIrdvgIP4yRqXqDb19LBA7MkDmLWyVUO0n83cPdWi2V1V1X67-HKv7LV-AE5bspMOqzj4iE5sMUjcq8mntw-Jj_OLG4eQNqipQVnWWp_NTGoyi0t2xNyN_NVRR0fEXWADvOiotmWLvD4uF-Bey2t7C04Z66pvc2swUz4hEzH598_XvgNz4KvIy7XvrAsVDwOc24yY0XGcqsiw7RBtHYrQp3ZSHI9QCixQEeRwiIxgrom0noA7fhT0iuWhT0hVCQ2EjDpyAJlhZZGBUwnocyNCYQR8cAjvLVoqhsQcuTCWKRuZ03CZKQ2UIouSRuXeMTftVrVIBz_kR-hs3ayCKHtfliW12nTI1OmmJR2IDNQWhgTwkCPyDnaxIrBuGc98gZdnSJIRoGncK7VpqrSz98m6ZCJWCAwvdgrdNURetcI5UtQVqvmzQcwGYJvdSRPMK5apSpQES5kPACdXrexlsIIgNs6qrDLDcjEEmrEJBHcI3SPTIyFLpaGHnlWh-fOOG2we0R2Ardjve4_xfzGAZHzgMH8IHm-95kvyP16DR4_L0lvXW7sKyji1lmfHMqZhHs8_tQnR8PR2WgM36PzydervlsY6bue-wd2JU1h
linkProvider	Scholars Portal
openUrl	ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=Detecting+remote+evolutionary+relationships+among+proteins+by+large-scale+semantic+embedding&rft.jtitle=PLoS+computational+biology&rft.au=Melvin%2C+Iain&rft.au=Weston%2C+Jason&rft.au=Noble%2C+William+Stafford&rft.au=Leslie%2C+Christina&rft.date=2011-01-01&rft.eissn=1553-7358&rft.volume=7&rft.issue=1&rft.spage=e1001047&rft_id=info:doi/10.1371%2Fjournal.pcbi.1001047&rft_id=info%3Apmid%2F21298082&rft.externalDocID=21298082
thumbnail_l	http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=1553-7358&client=summon
thumbnail_m	http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=1553-7358&client=summon
thumbnail_s	http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=1553-7358&client=summon