University of Tasmania

File(s) not publicly available

In Silico Detection and Characterisation of Biological Regulatory Elements

posted on 2023-05-26, 14:25 authored by Uren, PJ
This thesis concentrates upon the detection of biological transcriptional regulatory elements through computational methods. Current approaches are focused upon a representation of DNA which is essentially an abstraction to a string using a four letter alphabet. This fails to make explicit a large amount of relevant information describing how the molecule functions in its cellular environment. A major contribution of this work is the exploration of existing higher order physical and chemical properties as a representational scheme for DNA. Classification mechanisms based upon such a representation are evaluated on several tasks associated with the recognition and localisation of transcriptional control elements. The computational approaches used come from a variety of backgrounds, but the focus is primarily on machine learning methods. It is shown that promoters can be effectively predicted using a representation based on higher-order physical and chemical properties. This representation also allows more explicit insight into the biological functioning of the promoter by highlighting which regions are important for classification with respect to each model. This physico-chemical representation is also shown to be effective in clustering transcription factor binding sites for a single factor into sub-groups. These groups are used to construct weight matrices which demonstrate improved binding site classification over their original counterparts. The newly constructed composite matrices are also shown to produce fewer positive predictions but equivalent classification performance when used within a promoter prediction scheme. Motif based representations for characterising promoters are also prevalent. These have traditionally focused on a relatively small, often fixed, number of core promoter elements. While this is easily mapped into a supervised learning scenario, the more challenging task of using a variable number of motifs is considered within this work. An approach is presented to handle the scenario in which both the number of elements and their frequency of occurrence are not known a priori. This representation, handled via the multiple instance learning paradigm, is shown to be effective when combined with physico-chemical property based promoter prediction. Finally, comparative approaches also exist for the identification of regulatory elements and are often heavily reliant on a multiple sequence alignment algorithm. Such an algorithm, using simulated annealing to search for an optimal alignment ordering and based on a recent solution to the aligning alignments problem, is introduced within this work. This thesis explores the application of the new algorithm to problems involving both protein and nucleotide data.





School of Information and Communication Technology


University of Tasmania

Repository Status

  • Restricted

Usage metrics

    Thesis collection


    Ref. manager