Mining the Structural Genomics Pipeline: Identification of Protein Properties that Affect High-throughput Experimental Analysis

Structural genomics projects represent major undertakings that will change our understanding of proteins. They generate unique datasets that, for the first time, present a standardized view of proteins in terms of their physical and chemical properties. By analyzing these datasets here, we are able...

Full description

Saved in:
Bibliographic Details
Published inJournal of molecular biology Vol. 336; no. 1; pp. 115 - 130
Main Authors Goh, Chern-Sing, Lan, Ning, Douglas, Shawn M, Wu, Baolin, Echols, Nathaniel, Smith, Andrew, Milburn, Duncan, Montelione, Gaetano T, Zhao, Hongyu, Gerstein, Mark
Format Journal Article
LanguageEnglish
Published England Elsevier Ltd 06.02.2004
Subjects
Online AccessGet full text

Cover

Loading…
More Information
Summary:Structural genomics projects represent major undertakings that will change our understanding of proteins. They generate unique datasets that, for the first time, present a standardized view of proteins in terms of their physical and chemical properties. By analyzing these datasets here, we are able to discover correlations between a protein's characteristics and its progress through each stage of the structural genomics pipeline, from cloning, expression, purification, and ultimately to structural determination. First, we use tree-based analyses (decision trees and random forest algorithms) to discover the most significant protein features that influence a protein's amenability to high-throughput experimentation. Based on this, we identify potential bottlenecks in various stages of the structural genomics process through specialized “pipeline schematics”. We find that the properties of a protein that are most significant are: (i) whether it is conserved across many organisms; (ii) the percentage composition of charged residues; (iii) the occurrence of hydrophobic patches; (iv) the number of binding partners it has; and (v) its length. Conversely, a number of other properties that might have been thought to be important, such as nuclear localization signals, are not significant. Thus, using our tree-based analyses, we are able to identify combinations of features that best differentiate the small group of proteins for which a structure has been determined from all the currently selected targets. This information may prove useful in optimizing high-throughput experimentation. Further information is available from http://mining.nesg.org/.
Bibliography:ObjectType-Article-1
SourceType-Scholarly Journals-1
ObjectType-Feature-2
content type line 23
ISSN:0022-2836
1089-8638
DOI:10.1016/j.jmb.2003.11.053