# Rate matrices : a bridge between molecular biology and linear algebra : two way traffic between evolutionary sequence analysis and novel mathematical frameworks

Phylogenetics is the science of building phylogenetic trees. Phylogenetic trees are diagrams that depict evolutionary history. A rate matrix is a square matrix whose entries represent the rates of items in particular states changing to be in other states. Rate matrices are used as a part of the process of generating phylogenetic trees from biological sequence data. Both phylogenetic trees and rate matrices can be conceptualised from both biological and mathematical standpoints. In this thesis, we explore ways that mathematical ideas are inspired by biology and biological models are inspired by mathematical structures. This is achieved through two-way traffic between phylogenetic trees and rate matrices: not only do we explore generating phylogenetic trees from rate matrices but we derive rate matrices from phylogenetic trees.

In phylogenetics, it is sometimes assumed that each branch had a different evolutionary process having acted. In the case of the same model acting on two branches with different parameter values, it is desirable for the “average” process over these two branches to be in the same model. Previous research has shown that this property is achieved if the set of rate matrices for the model forms a matrix Lie algebra. Said research also characterised a large range of DNA rate matrix models which form a Lie algebra. Codon models are larger than DNA models, with rate matrices being 64 x 64 where DNA rate matrices are 4 x 4, but unlike DNA models can describe information about whether or not a point mutation will result in a change of amino acid in the corresponding protein. Typically codon models include the parameter ധ that controls the non-synonymous/synonymous mutation rate ratio, where a synonymous mutation does not result in an amino acid change whereas a non-synonymous mutation does. Chapter 3 of this thesis explores constructing codon models whose rate matrices form a Lie algebra and included the ability to distinguish synonymous and non-synonymous rates. We found that any codon model which could distinguish synonymous from non-synonymous substitutions and whose rate matrices formed a Lie algebra was too large to be of practical use. Further investigations suggested that it is the breaking of symmetries that are forced by the structure of the genetic code which resulted in the difficulty of the problem. To alleviate the unsatisfactory state of affairs, it is instead proposed to relax our strict mathematical requirements to allow a broader range of models to be explored. This was done by defining \linear" codon models and different methods of generating these models are discussed.

In phylogenetics, often a model is assumed to be correct before the parameters are fitted. However, given we are not aware of the true processes that influence genetic sequence evolution, it is likely that the model was the incorrect one in the first instance. If the incorrect model was fitted, can the fitted parameters still contain useful information about the evolutionary processes at play? In Chapter 4, with the use of a toy model framework, we simulate evolutionary sequence data under one codon model then fit under another. We then determine through mathematical analysis if the _tted parameters hold information about the original simulation parameter values. To follow on from Chapter 3, we conduct this analysis using both linear and non-linear codon models to test the performance differences.

Analysis techniques when testing origin of life hypotheses involving how protein synthesis came to distinguish amino acids often suffer from tautologies: the same empirical amino acid rate substitution matrices are used for both sequence alignment and calculating the likelihood of one sequence having evolved to another. In Chapter 5 phylogenetic tree structures are used in a novel way to generate sets of mechanistic amino acid rate matrix models. A phylogenetic tree structure is established where the number of leaves is equal to the number of states, rates are assigned to the internal tree nodes and states are assigned to the leaves. The rate of change between two states is defined as being the rate assigned to the internal node which is the most recent common ancestor of those two states. The structure of the trees are representative of properties of amino acids where a split in the tree represents the complexity increase in the mechanics which select an amino acid to append to the polypeptide chain in protein synthesis. The novel approach to this analysis allowed us to test hypotheses on amino acid distinction mechanics in cells without suffering from the same tautologies as previous research in this field. This research had a positive result: it was found that both amino-acyl tRNA synthase class and polarity (as binary characteristics of amino acids) in the trees resulted in substitution matrices which were closer to empirical ones.

Chapter 6 builds off Chapter 5 where a method of creating rate matrices from phylogenetic trees. In Chapter 6, investigations are made about the mathematical properties of rate matrix sets which are generated by phylogenetic trees. This method of generating rate matrix sets has the potential to create mechanistic amino acid substitution models whose parameters could convey information about binary properties of amino acids such as polarity and aaRS class. It is established that any rate matrix set generated this way results in an abelian matrix algebra (a matrix vector space closed under matrix multiplication). Further investigations are conducted to explore cases where constraints are imposed on the rates: if two rates assigned to the internal nodes of a tree are set to be equal, does this constraint remain after two matrices generated from this tree are multiplied together? The answer was found to often be no. The conditions under which such a constraint is preserved after matrix multiplication within the set are defined.

As previously discussed, there are advantages for a rate matrix set to form a Lie algebra. Methods have been established of defining rate matrix sets to be matrix Lie algebras. It is advantageous to then seek to characterise the set of substitution matrices which are drawn from the corresponding Lie group. Typically in Lie theory, however, the starting point is a Lie group and the Lie algebra is generated later and relatively little prior research has been conducted on the reverse procedure. Chapter 7 interrogates the problem of generating a Lie group from a Lie algebra. Complexities arise in this problem because while for each Lie group there is only one Lie algebra which can be generated, the inverse is not true and there can be many Lie groups that lead to a single Lie algebra. In this research, we investigate special cases of finding the smallest Lie group which generates a particular Lie algebra and we discuss methods of deriving the smallest Lie group associated to a particular Lie algebra and explore illuminating examples.

Concepts in linear algebra have been shown to have unexpected applications in phylogenetic modelling. In this thesis, mathematical structures have inspired ideas in how to generate phylogenetic rate matrices. We also see in this thesis that concepts in phylogenetics can inform linear algebra. This thesis explores the mutually beneficial relationship between rate matrices and phylogenetic trees as well as the one between biology and mathematics.

## History

## Sub-type

- PhD Thesis