Markov models for the evolution of duplicate genes, and microsatellites

Stark, TL

doi:10.25959/100.00028366

Markov models for the evolution of duplicate genes, and microsatellites

thesis

posted on 2023-05-28, 09:45 authored by Stark, TL

Duplicate genes and microsatellites are two key sequences in the study of evolutionary genomics. Gene duplication has been identified as a central process driving functional change in genomes, since it creates functional redundancy in the genome and allows for subsequent mutation to occur in the absence of selective pressure. Microsatellites are rapidly evolving sequences which can be studied over much smaller timescales than most other sequences, and are thus key to the study of population demographics and forensic science. In this thesis we construct mathematical models for the evolution of duplicate genes, and microsatellites, respectively. We analyse the models in order to make scientific predictions, and derive the following novel results. We introduce and analyse a modified hazard function, which we use to investigate the preservation of gene duplicates. Further, we construct individual-level models, and present a framework for the extension to population-level models. Also, we construct mappings from mechanistically-motivated intuitive models for gene duplicate evolution, to less intuitive models, which have smaller state spaces and hence are more computationally tractable. Throughout this analysis, we make scientific predictions based on the properties of the models. We find that the pattern of gene duplicate preservation is more consistent with subfunctionalization than with neofunctionalization. This result is of particular scientific interest, since it is the opposite conclusion of earlier work in the gene duplication literature. \\({Duplicate}\\) \\({genes}\\) Several biological models exist for the evolution of a pair of duplicate genes after a duplication event, and it is believed that gene duplicates can evolve in different ways, according to one process, or a mix of processes. Subfunctionalization is a process under which the two duplicates can be preserved by dividing up the functions of the original gene between them. Here, we find that subfunctionalization is highly consistent with the pattern of gene duplicate preservation, in contrast to previous analysis in the literature. Another process important to gene duplicate evolution is neofunctionalization, under which both duplicates can be preserved when one copy mutates so as to produce some new beneficial function. Our analysis of neofunctionalization suggests that this process is not a significant contributor to the preservation of duplicates over the timescales during which regulatory subfunctionalization is resolved. Instead, it is likely that neofunctionalization occurs subsequent to previous subfunctionalization, which acts to preserve copies over the longer time frames required for rare beneficial mutations to have any significant probability of occurring. Analysis of genomic data using sub- and neofunctionalization models has thus far been relatively coarse-grained, with mathematical treatments usually focusing on the phenomenological features of gene duplicate evolution. In contrast, we develop mechanistically motivated Markov models, and fit directly to duplicate preservation data. We introduce a modified-cause-specific hazard function to analyse the preservation of gene duplicates. In the context of gene duplication, we refer to this as the pseudogenization rate, owing to the biological interpretation. We analyse the properties of the modified-cause-specific hazard rate in detail, including limit analysis of the general case, and discuss the shape properties of the specific case of the pseudogenization rate. Further, we extend our model for the evolution of a pair of gene duplicates to model a population of duplicate pairs, by modelling the birth of such pairs as a homogeneous Poisson process. We show that the age distribution of preserved duplicates follows an inhomogenous Poisson distribution, with its rate function depending on the individuallevel model. We then fit this distribution to count-data of surviving duplicates in the genomes of four animal species. Additionally, we extend the individual-level model to a model that includes the process of neofunctionalization, and next, to a model of subfunctionalization for families of gene duplicates. Finally, we map these intuitive models, to less intuitive but more computationally tractable models, and discuss a number of related computational considerations. \\({Microsatellites}\\) Microsatellites are repetitive regions of DNA where a short motif is repeated many times. Mutations in the number of repeat units occur frequently compared to point mutations and thus provide a useful source of genetic variation for studying recent events. Empirical studies have suggested that the rate of length-changing mutations due to slipped-strand mispairing may depend on the purity of the repeat units, i.e. how well they each match the motif. However, most studies that use microsatellite data are based on models that only track the number of repeat units. In order to address this gap, we introduce a series of models on a two-dimensional state-space (which are level-dependent quasi-birth-and-death processes) that track the length of the sequence as the level variable, and the number of interruptions (purity) as the phase variable. Our models account for the biological process of point mutation, and its observed effect on the rate of slipped-strand mispairing. We find that modelling microsatellite purity leads to some complications due to the nature of available data. In terms of the initial model, we discover what constitutes a state-dependent bias in the reporting of repeat sequences by Tandem Repeats Finder (or any similar software used to search whole-genomes for microsatellite sequences). Consequently, we construct a modified model such that all states fall into one of two categories - 'observable states', against which the reporting algorithm is unbiased, and 'unobservable states', which are never reported. We consider two approaches for treating the unobservable states, first to condition on the process being in the observable states, second to treat unobservable states as absorbing. Our initial analysis and underlying biological intuition suggest that transitions from the unobservable to observable states are very rare, and thus we ultimately treat the unobservable states as absorbing. Additionally, we extend the individual-level model to a population-level model by modelling the birth of microsatellites as a homogeneous Poisson process. We then derive the transient distribution of such model in terms of the individual-level process. This distribution has appropriate relative clock via the inclusion of point mutation. We fit this transient distribution to whole-genome derived sequence data, however we encounter some dificulties in the optimisation owing to the presence of many local optima. The standard approach for microsatellite models is to make the assumption that the empirical distribution is at equilibrium, and then to fit the stationary distribution to data. The key exception to this is the step-wise mutation model, which predicts infinite growth of the repeat number. Here we fit the above-mentioned transient distribution, and thus do not assume that the empirical distribution is at equilibrium. In contrast to the step-wise mutation model, our model does not predict infinite sequence lengths in the long run.

History

Publication status

Unpublished

Rights statement

Repository Status

Restricted

Usage metrics

Markov models for the evolution of duplicate genes, and microsatellites

History

Publication status

Rights statement

Repository Status

Usage metrics

Categories

Keywords

Licence

Exports