Mining the Structural Genomics Pipeline: Identification of Protein Properties that Affect High-throughput Experimental Analysis

Structural genomics projects represent major undertakings that will change our understanding of proteins. They generate unique datasets that, for the first time, present a standardized view of proteins in terms of their physical and chemical properties. By analyzing these datasets here, we are able...

Full description

Saved in:

Bibliographic Details
Published in	Journal of molecular biology Vol. 336; no. 1; pp. 115 - 130
Main Authors	Goh, Chern-Sing, Lan, Ning, Douglas, Shawn M, Wu, Baolin, Echols, Nathaniel, Smith, Andrew, Milburn, Duncan, Montelione, Gaetano T, Zhao, Hongyu, Gerstein, Mark
Format	Journal Article
Language	English
Published	England Elsevier Ltd 06.02.2004
Subjects	Algorithms charged residues COGs Computational Biology Databases, Protein Decision Trees Genomics hydrophobicity Protein Conformation Protein Sorting Signals Proteins - chemistry Proteins - genetics Sequence Analysis, Protein structural genomics structural genomics hydrophobicity oob, out-of-bag COGs charged residues decision trees NLS, nuclear localization signal COG, clusters of orthologous groups
Online Access	Get full text

Cover

Loading…

More Information
Summary:	Structural genomics projects represent major undertakings that will change our understanding of proteins. They generate unique datasets that, for the first time, present a standardized view of proteins in terms of their physical and chemical properties. By analyzing these datasets here, we are able to discover correlations between a protein's characteristics and its progress through each stage of the structural genomics pipeline, from cloning, expression, purification, and ultimately to structural determination. First, we use tree-based analyses (decision trees and random forest algorithms) to discover the most significant protein features that influence a protein's amenability to high-throughput experimentation. Based on this, we identify potential bottlenecks in various stages of the structural genomics process through specialized “pipeline schematics”. We find that the properties of a protein that are most significant are: (i) whether it is conserved across many organisms; (ii) the percentage composition of charged residues; (iii) the occurrence of hydrophobic patches; (iv) the number of binding partners it has; and (v) its length. Conversely, a number of other properties that might have been thought to be important, such as nuclear localization signals, are not significant. Thus, using our tree-based analyses, we are able to identify combinations of features that best differentiate the small group of proteins for which a structure has been determined from all the currently selected targets. This information may prove useful in optimizing high-throughput experimentation. Further information is available from http://mining.nesg.org/.
Bibliography:	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 23
ISSN:	0022-2836 1089-8638
DOI:	10.1016/j.jmb.2003.11.053