Hypergeometric Model of Evolution of Conserved Protein Coding Sequences in the Proteomes
The diversity of protein sequences that exists today has probably evolved from antecedent evolutionarily- conserved domain-like sequences (i.e. motifs, repeats, structural domains) encoded by short ancient genes. We have studied the statistical distributions of the occurrences of the domain-like fam...
Saved in:
Published in | Fluctuation and noise letters Vol. 3; no. 3; pp. L295 - L324 |
---|---|
Main Author | |
Format | Journal Article |
Language | English |
Published |
World Scientific Publishing Company
01.09.2003
|
Subjects | |
Online Access | Get full text |
Cover
Loading…
Summary: | The diversity of protein sequences that exists today has probably
evolved from antecedent evolutionarily- conserved domain-like
sequences (i.e. motifs, repeats, structural domains) encoded by short
ancient genes. We have studied the statistical distributions of the
occurrences of the domain-like families within proteins in the
proteomes. A generalized hypergeometric stochastic process is
introduced in order to model the evolution dynamics of these conserved
sequences. We found that the limiting probability function associated
with this process fits the empirical distributions for the 90
fully-sequence bacterial, archaeal and eukaryotic organisms. For
eukaryotes, our limiting distribution is reduced to Waring's
distribution. However, for many archaeal and bacterial organisms the
empirical distributions degenerate to the Yule-like distribution.
Comparison of all of these distributions implies
critical evolutionary events, which lead to the proportional growth of
the number of new protein-coding genes and proteome complexity in the
eukaryotic organisms and suggest that evolution of many archaeal and
bacterial organisms are subject to external global (ecological)
forces. Best-fit model data predicts that (1) there are only
~ 5500 or so of the distinct InterPro domains in a given higher
eukaryotic organism and that (2) a general trend in eukaryotic
proteome evolution is described by the increase in frequency of
multi-domain proteins composed of already-existing (older) distinct
domains as oppose to creating new ones. Our model can be applicable
for analysis of the evolution of word distributions in the texts and
be used in other large-scale evolutional systems like the Internet,
the economy and the universe. |
---|---|
ISSN: | 0219-4775 1793-6780 |
DOI: | 10.1142/S0219477503001397 |