Models of DNA evolution

A number of different Markov models of DNA sequence evolution have been proposed. This is because evolutionary processes vary between genomes and between different regions of a genome, for example different evolutionary processes apply to coding and noncoding regions. These models mostly differ in the parametrization of the rate matrix and in the modeling of rate variation.

DNA Evolution as a Continuous Time Markov Chain

Continuous Time Markov Chains

Continuous-time Markov chains have the usual transition matrices which are, in addition, parameterized by time, $t\$ . Specifically, if $E_{1},\ldots ,E_{4}\$ are the states, then the transition matrix

P(t)={\big (}P_{ij}(t){\big )}

where each individual entry,

P_{ij}(t)\

refers to the probability that state

E_{i}\

will change to state

E_{j}\

in time

t\

.

Example: We would like to model the substitution process in DNA sequences (i.e. Jukes-Cantor, Kimura, etc.) in a continuous time fashion. The corresponding transition matrices will look like:

$P(t)={\begin{pmatrix}p_{AA}(t)&p_{GA}(t)&p_{CA}(t)&p_{TA}(t)\\p_{AG}(t)&p_{GG}(t)&p_{CG}(t)&p_{TG}(t)\\p_{AC}(t)&p_{GC}(t)&p_{CC}(t)&p_{TC}(t)\\p_{AT}(t)&p_{GT}(t)&p_{CT}(t)&p_{TT}(t)\end{pmatrix}}$

where the top-left and bottom-right $2\times 2\$ blocks correspond to transition probabilities and the top-right and bottom-left $2\times 2\$ blocks corresponds to transversion probabilities.

Assumption: If at some time $t_{0}\$ , the Markov chain is in state $E_{i}\$ , then the probability that at time $t_{0}+t\$ , it will be in state $E_{j}\$ depends only upon $i\$ , $j\$ and $t\$ . This then allows us to write that probability as $p_{ij}(t)\$ .

Theorem: Continuous-time transition matrices satisfy:

P(t+\tau )=P(t)P(\tau )\

Deriving the Dynamics of Substitution

Consider a DNA sequence of fixed length m evolving in time by base replacement. Assume that the processes followed by the m sites are Markovian independent, identically distributed and constant in time. For a fixed site, let

\mathbf {P} (t)=(p_{A}(t),\ p_{G}(t),\ p_{C}(t),\ p_{T}(t))^{T}

be the column vector of probabilities of states $A,\$ $\ G,\$ $\ C,\$ and $\ T\$ at time $t\$ . Let

{\mathcal {E}}=\{A,\ G,\ C,\ T\}

be the state-space. For two distinct

x,y\in {\mathcal {E}}

, let

\mu _{xy}\

be the transition rate from state $x\$ to state $y\$ . Similarly, for any $x\$ , let:

\mu _{x}=\sum _{y\neq x}\mu _{xy}

The changes in the probability distribution $p_{A}(t)\$ for small increments of time $\Delta t\$ are given by:

p_{A}(t+\Delta t)=p_{A}(t)-p_{A}(t)\mu _{A}\Delta t+\sum _{x\neq A}p_{x}(t)\mu _{xA}\Delta t

In other words (in frequentist language), the frequency of $A\$ 's at time $t+\Delta t\$ is equal to the frequency at time $t\$ minus the frequency of the lost $A\$ 's plus the frequency of the newly created $A\$ 's.

Similarly for the probabilities $p_{G}(t),\ p_{C}(t),\ \mathrm {and} \ p_{T}(t)$ . We can write these compactly as:

\mathbf {P} (t+\Delta t)=\mathbf {P} (t)+Q\mathbf {P} (t)\Delta t

where,

Q={\begin{pmatrix}-\mu _{A}&\mu _{GA}&\mu _{CA}&\mu _{TA}\\\mu _{AG}&-\mu _{G}&\mu _{CG}&\mu _{TG}\\\mu _{AC}&\mu _{GC}&-\mu _{C}&\mu _{TC}\\\mu _{AT}&\mu _{GT}&\mu _{CT}&-\mu _{T}\end{pmatrix}}

or, alternately:

\mathbf {P} '(t)=Q\mathbf {P} (t)

where, $Q\$ is the rate matrix. Note that by definition, the rows of $Q\$ sum to zero.

Ergodicity

If all the transition probabilities, $\mu _{xy}\$ are positive, i.e. if all states $x,y\in {\mathcal {E}}\$ communicate, then the Markov chain has a stationary distribution $\mathbf {\Pi } =\{\pi _{x},\ x\in {\mathcal {E}}\}$ where each $\pi _{x}\$ is the proportion of time spent in state $x\$ after the Markov chain has run for infinite time, and this probability does not depend upon the initial state of the process. Such a Markov chain is called, ergodic. In DNA evolution, under the assumption of a common process for each site, the stationary frequencies, $\pi _{A},\pi _{G},\pi _{C},\pi _{T}\$ correspond to equilibrium base compositions.

Definition A Markov process is stationary if its current distribution is the stationary distribution, i.e. $\mathbf {P} (t)=\Pi \$ Thus, by using the differential equation above,

{\frac {d\Pi }{dt}}=Q\Pi =0

Time Reversibility

Definition: A stationary Markov process is time reversible if (in the steady state) the amount of change from state $x\$ to $y\$ is equal to the amount of change from $y\$ to $x\$ , (although the two states may occur with different frequencies). This means that:

\pi _{x}\mu _{xy}=\pi _{y}\mu _{yx}\

Not all stationary processes are reversible, however, almost all DNA evolution models assume time reversibility, which is considered to be a reasonable assumption.

Under the time reversibility assumption, let $s_{xy}=\mu _{xy}/\pi _{y}\$ , then it is easy to see that:

s_{xy}=s_{yx}\

Definition The symmetric term $s_{xy}\$ is called the exchangeability between states $x\$ and $y\$ . In other words, $s_{xy}\$ is the fraction of the frequency of state $x\$ that results as a result of transitions from state $y\$ to state $x\$ .

Corollary The 12 off-diagonal enteries of the rate matrix, $Q\$ (note the off-diagonal enteries determine the diagonal enteries, since the rows of $Q\$ sum to zero) can be completely determined by 9 numbers; these are: 6 exchangeability terms and 3 stationary frequencies $\pi _{x}\$ , (since the stationary frequencies sum to 1).

Most Common Models of DNA Evolution

JC69 model (Jukes and Cantor, 1969)

JC69 is the simplest substitution model. There are several assumptions. It assumes equal base frequencies ( $\pi _{1}=\pi _{2}=\pi _{3}=\pi _{4}={1 \over 4}$ ) and equal mutation rates. The only parameter of this model is therefore $\mu$ , the overall substitution rate.

Q={\begin{pmatrix}{*}&{\mu  \over 4}&{\mu  \over 4}&{\mu  \over 4}\\{\mu  \over 4}&{*}&{\mu  \over 4}&{\mu  \over 4}\\{\mu  \over 4}&{\mu  \over 4}&{*}&{\mu  \over 4}\\{\mu  \over 4}&{\mu  \over 4}&{\mu  \over 4}&{*}\end{pmatrix}}

P={\begin{pmatrix}{{1 \over 4}+{3 \over 4}e^{-t\mu }}&{{1 \over 4}-{1 \over 4}e^{-t\mu }}&{{1 \over 4}-{1 \over 4}e^{-t\mu }}&{{1 \over 4}-{1 \over 4}e^{-t\mu }}\\\\{{1 \over 4}-{1 \over 4}e^{-t\mu }}&{{1 \over 4}+{3 \over 4}e^{-t\mu }}&{{1 \over 4}-{1 \over 4}e^{-t\mu }}&{{1 \over 4}-{1 \over 4}e^{-t\mu }}\\\\{{1 \over 4}-{1 \over 4}e^{-t\mu }}&{{1 \over 4}-{1 \over 4}e^{-t\mu }}&{{1 \over 4}+{3 \over 4}e^{-t\mu }}&{{1 \over 4}-{1 \over 4}e^{-t\mu }}\\\\{{1 \over 4}-{1 \over 4}e^{-t\mu }}&{{1 \over 4}-{1 \over 4}e^{-t\mu }}&{{1 \over 4}-{1 \over 4}e^{-t\mu }}&{{1 \over 4}+{3 \over 4}e^{-t\mu }}\end{pmatrix}}

Distance between two sequences is given by

d=-{3 \over 4}\ln({1-{4 \over 3}p})

where $p$ is the proportion of sites that differ between the two sequences.

K80 model (Kimura, 1980)

The K80 model distinguishes between transitions (A <-> G, i.e. from purine to purine, or C <-> T, i.e. from pyrimidine to pyrimidine) and transversions (from purine to pyrimidine or vice versa) (α/β).

It also assumes equal base frequencies ( $\pi _{T}=\pi _{C}=\pi _{A}=\pi _{G}={1 \over 4}$ )

Rate matrix $Q={\begin{pmatrix}{*}&{\kappa }&{1}&{1}\\{\kappa }&{*}&{1}&{1}\\{1}&{1}&{*}&{\kappa }\\{1}&{1}&{\kappa }&{*}\end{pmatrix}}$

The Kimura two-parameter distance is given by:

d=-{1 \over 2}\ln(1-2p-q)-{1 \over 4}\ln(1-2q)

where $p$ is the proportion of sites that show transitional differences and $q$ is the proportion of sites that show transversional differences.

F81 model (Felsenstein 1981)

Unequal base frequencies ( $\pi _{T}\neq \pi _{C}\neq \pi _{A}\neq \pi _{G}$ )

Rate matrix $Q={\begin{pmatrix}{*}&{\pi _{T}}&{\pi _{T}}&{\pi _{T}}\\{\pi _{C}}&{*}&{\pi _{C}}&{\pi _{C}}\\{\pi _{A}}&{\pi _{A}}&{*}&{\pi _{A}}\\{\pi _{G}}&{\pi _{G}}&{\pi _{G}}&{*}\end{pmatrix}}$

HKY85 model (Hasegawa, Kishino and Yano 1985)

The HKY85 model distinguishes between transitions and transversions (α/β).

It allows unequal base frequencies ( $\pi _{T}\neq \pi _{C}\neq \pi _{A}\neq \pi _{G}$ ).

Rate matrix $Q={\begin{pmatrix}{*}&{\kappa \pi _{T}}&{\pi _{T}}&{\pi _{T}}\\{\kappa \pi _{C}}&{*}&{\pi _{C}}&{\pi _{C}}\\{\pi _{A}}&{\pi _{A}}&{*}&{\kappa \pi _{A}}\\{\pi _{G}}&{\pi _{G}}&{\kappa \pi _{G}}&{*}\end{pmatrix}}$

T92 model (Tamura 1992)

One frequency only $\pi _{GC}$

$\pi _{G}=\pi _{C}={\pi _{GC} \over 2}$

$\pi _{A}=\pi _{T}={(1-\pi _{GC}) \over 2}$

Rate matrix $Q={\begin{pmatrix}{*}&{\kappa (1-\pi _{GC})/2}&{(1-\pi _{GC})/2}&{(1-\pi _{GC})/2}\\{\kappa \pi _{GC}/2}&{*}&{\pi _{GC}/2}&{\pi _{GC}/2}\\{(1-\pi _{GC})/2}&{(1-\pi _{GC})/2}&{*}&{\kappa (1-\pi _{GC})/2}\\{\pi _{GC}/2}&{\pi _{GC}/2}&{\kappa \pi _{GC}/2}&{*}\end{pmatrix}}$

The evolutionary distance between two noncoding sequences according to this model is given by

d=-h\ln(1-{p \over h}-q)-{1 \over 2}(1-h)\ln(1-2q)

where $h=2\theta (1-\theta )$ where $\theta \in (0,1)$ is the GC content.

TN93 model (Tamura and Nei 1993)

The TN93 model distinguishes between the two different types of transition - i.e. (A <-> G) is allowed to have a different rate to (C<->T). Transversions are all assumed to occur at the same rate, but that rate is allowed to be different from both of the rates for transitions.

TN93 also allows unequal base frequencies ( $\pi _{T}\neq \pi _{C}\neq \pi _{A}\neq \pi _{G}$ ).

Rate matrix $Q={\begin{pmatrix}{*}&{\kappa _{1}\pi _{T}}&{\pi _{T}}&{\pi _{T}}\\{\kappa _{1}\pi _{C}}&{*}&{\pi _{C}}&{\pi _{C}}\\{\pi _{A}}&{\pi _{A}}&{*}&{\kappa _{2}\pi _{A}}\\{\pi _{G}}&{\pi _{G}}&{\kappa _{2}\pi _{G}}&{*}\end{pmatrix}}$

GTR: Generalised time reversible

GTR is the most general neutral, independent, finite-sites, time-reversible model possible. It was first described in a general form by Simon Tavaré in 1986.

The GTR parameters consist of an equilibrium base frequency vector, $\Pi =(\pi _{1},\pi _{2},\pi _{3},\pi _{4})$ , giving the frequency at which each base occurs at each site, and the rate matrix

Q={\begin{pmatrix}{-(x_{1}+x_{2}+x_{3})}&{\pi _{1}x_{1} \over \pi _{2}}&{\pi _{1}x_{2} \over \pi _{3}}&{\pi _{1}x_{3} \over \pi _{4}}\\{x_{1}}&{-({\pi _{1}x_{1} \over \pi _{2}}+x_{4}+x_{5})}&{\pi _{2}x_{4} \over \pi _{3}}&{\pi _{2}x_{5} \over \pi _{4}}\\{x_{2}}&{x_{4}}&{-({\pi _{1}x_{2} \over \pi _{3}}+{\pi _{2}x_{4} \over \pi _{3}}+x_{6})}&{\pi _{3}x_{6} \over \pi _{4}}\\{x_{3}}&{x_{5}}&{x_{6}}&{-({\pi _{1}x_{3} \over \pi _{4}}+{\pi _{2}x_{5} \over \pi _{4}}+{\pi _{3}x_{6} \over \pi _{4}})}\end{pmatrix}}

Therefore, GTR (for four characters, as is often the case in phylogenetics) requires 6 substitution rate parameters, as well as 4 equilibrium base frequency parameters. However, this is usually eliminated down to 9 parameters plus $\mu$ , the overall number of substitutions per unit time. When measuring time in substitutions ( $\mu$ =1) only 9 free parameters remain.

In general, to compute the number of parameters, you count the number of entries above the diagonal in the matrix, i.e. for n trait values per site ${{n^{2}-n} \over 2}$ , and then add n for the equilibrium base frequencies, and subtract 1 because $\mu$ is fixed. You get

{{n^{2}-n} \over 2}+n-1={1 \over 2}n^{2}+{1 \over 2}n-1.

For example, for an amino acid sequence (there are 20 "standard" amino acids that make up proteins), you would find there are 209 parameters. However, when studying coding regions of the genome, it is more common to work with a codon substitution model (a codon is three bases and codes for one amino acid in a protein). There are $4^{3}=64$ codons, but the rates for transitions between codons which differ by more than one base is assumed to be zero. Hence, there are ${{20\times 19\times 3} \over 2}+64-1=633$ parameters.

References

Jukes, T.H. and C.R. Cantor. (1969) Evolution of Protein Molecules, pp. 21-132. Academic Press, New York.
Kimura, M. (1980) A simple method for estimating evolutionary rates of base substitutions through comparative studies of nucleotide sequences. Journal of Molecular Evolution, 16, 111-120.
Hasegawa, M., H. Kishino, and T. Yano. (1985) Dating of human-ape splitting by a molecular clock of mitochondrial DNA. Journal of Molecular Evolution, 22, 160-174.
Felsenstein, J. (1981) Evolutionary trees from DNA sequences: a maximum likelihood approach. Journal of Molecular Evolution, 17, 368-376.
Gu, X. and W. Li. 1992. Higher rates of amino acid substitution in rodents than in man. Molecular Phylogenetics and Evolution, 1:211-214. [1]
Li, W.-H. D.L. Ellsworth, J. Krushkal, B.H.-J. Chang, and D. Hewett-Emmett. 1996. Rates of nucleotide substitution in primates and rodents and the generation-time effect hypothesis. Molecular Phylogenetics and Evolution, 5:182-187. [2]
Tamura, K. 1992. Estimation of the number of nucleotide substitutions when there are strong transition-transversion and G+C content biases. Molecular Biology and Evolution 9:678-687. [3]
Tamura, K., and M. Nei. 1993. Estimation of the number of nucleotide substitutions in the control region of mitochondrial DNA in humans and chimpanzees. Molecular Biology and Evolution 10:512-526. [4]

External links

DAWG: DNA Assembly With Gaps — free software for simulating sequence evolution

v t e Molecular evolution
Natural selection	Balancing selection Directional selection Disruptive selection Negative selection Stabilizing selection Selective sweep
Models	Models of DNA evolution Models of nucleotide substitution Allele frequency Ka/Ks ratio Tajima's D Fay and Wu's H
Molecular processes	Gene conversion Gene duplication Silent mutation Synonymous substitution Nonsynonymous substitution

v t e Evolutionary biology
Introduction Outline Timeline of evolution History of life Index
Evolution	Abiogenesis Adaptation Adaptive radiation Altruism Cheating Reciprocal Baldwin effect Cladistics Coevolution Mutualism Common descent Convergence Divergence Earliest known life forms Evidence of evolution Evolutionary arms race Evolutionary pressure Exaptation Extinction Event Homology Last universal common ancestor Macroevolution Microevolution Mismatch Non-adaptive radiation Origin of life Panspermia Parallel evolution Signalling theory Handicap principle Speciation Species Species complex Taxonomy Unit of selection Gene-centered view of evolution
Population genetics	Artificial selection Biodiversity Evolutionarily stable strategy Fisher's principle Fitness Inclusive Gene flow Genetic drift Kin selection Parental investment Parent–offspring conflict Mutation Population Natural selection Sexual dimorphism Sexual selection Flowering plants Fungi Mate choice Social selection Trivers–Willard hypothesis Variation
Development	Canalisation Evolutionary developmental biology Genetic assimilation Inversion Modularity Phenotypic plasticity
Of taxa	Bacteria Birds origin Brachiopods Molluscs Cephalopods Dinosaurs Fish Fungi Insects butterflies Life Mammals cats canids wolves dogs hyenas dolphins and whales horses Kangaroos primates humans lemurs sea cows Plants pollinator-mediated Reptiles Spiders Tetrapods Viruses
Of organs	Cell DNA Flagella Eukaryotes symbiogenesis chromosome endomembrane system mitochondria nucleus plastids In animals eye hair auditory ossicle nervous system brain
Of processes	Aging Death Programmed cell death Avian flight Biological complexity Cooperation Color vision in primates Emotion Empathy Ethics Eusociality Immune system Metabolism Monogamy Morality Mosaic evolution Multicellularity Sexual reproduction Gamete differentiation/sexes Life cycles/nuclear phases Mating types Meiosis Sex-determination Snake venom
Tempo and modes	Gradualism/Punctuated equilibrium/Saltationism Micromutation/Macromutation Uniformitarianism/Catastrophism
Speciation	Allopatric Anagenesis Catagenesis Cladogenesis Cospeciation Ecological Hybrid Non-ecological Parapatric Peripatric Reinforcement Sympatric
History	Renaissance and Enlightenment Transmutation of species David Hume Dialogues Concerning Natural Religion Charles Darwin On the Origin of Species History of paleontology Transitional fossil Blending inheritance Mendelian inheritance The eclipse of Darwinism Neo-Darwinism Modern synthesis History of molecular evolution Extended evolutionary synthesis
Philosophy	Darwinism Alternatives Catastrophism Lamarckism Orthogenesis Mutationism Saltationism Structuralism Spandrel Theistic Vitalism Teleology in biology
Related	Biogeography Ecological genetics Evolutionary medicine Group selection Cultural evolution Cultural group selection Dual inheritance theory Hologenome theory of evolution Missing heritability problem Molecular evolution Astrobiology Phylogenetics Tree Polymorphism Protocell Systematics Transgenerational epigenetic inheritance
Category Portal