Application of multivariate statistical analysis and machine learning on the signals recorded in analytical chemistry in the metabolomic studies of natural products
Authentication and quality control of complex samples, such are natural products, presents significant challenges in analytical chemistry. Traditionally, authentication and quality control are performed through targeted approaches (profiling), where for each natural product several quality marker compounds are determined. In contrast, in natural product metabolomic studies, the whole signal recorded with analytical instrumentation can be used. This signal represents the fingerprint for the analysed natural product and contains information of the whole metabolome of the examined natural product. Fingerprints can be recorded either applying solely spectroscopy, mass spectrometry (MS), nuclear magnet resonance (NMR) detection systems or with hyphenated analytical techniques, such are gas chromatography hyphenated with MS or flame ionisation detector (FID), liquid chromatography hyphenated with MS or UV-Vis, etc... Due to complexity and high degree of dimensionality of recorded signals in analytical chemistry, multivariate statistical techniques are utilised to reveal patterns in the recorded signals. While multivariate approaches applied to hyphenated chromatography analytical systems allow data mining and identification of metabolites contributing to the sample discrimination and quality, multivariate approaches applied solely to detection systems offer other advantages, such as fast quality control and authentication of natural products with high sample throughput. Also, application of spectroscopic techniques in metabolomic studies allows development of miniaturised and customised devices for in field and on-site analysis. This thesis explores different analytical techniques, multivariate approaches and machine learning algorithms for the purpose of simplifying and increasing prediction accuracy in natural product metabolomic studies for purpose of characterisation, authentication and quality control of natural products. Through 4 experimental chapters the following areas were investigated: 1. Development and comparison of different data reduction procedures on gaschromatography hyphenated with electron impact mass spectrometry (GC-EI-MS) data to increase the performance of multivariate statistical approaches in essential oil (EO) authentication, quality control and biological activity prediction. 2. Application of Random Forests machine learning algorithms in the detection ofadulterated and natural products with lower quality based on analysis performedwith GC-EI-MS and handheld Raman spectroscopy. 3. Characterisation, quality control and authentication, of extremely complexsamples‚ÄövÑvp such are natural product blends and high-value perfumes bydetermining quality of samples used in their creation. 4. Development of a PLS model based on spectra recorded using an LED-basedspectrophotometer for monitoring primary and secondary products in theFuracellTM process. GC-EI-MS is a most common technique for the analysis of volatile naturalproducts. Single sample, analysed by GC-EI-MS produces a three-way data array, as a function of time, m/z and their intensities. By analysing multiple samples, four-way arrays are created. Most of the multivariate statistical tools cannot handle four-way data arrays and further data reduction is required. The first experimental chapter of this thesis examines and compares three different GC-EI-MS data reduction procedures applied for the purpose of natural product authentication, quality control and prediction of biological activity. The first strategy, and at the same time most commonly applied, is by summing all of the m/z fragments in a single mass scan and plotting them against the time of the mass scan, creating a total ion current chromatogram (TICC), out of which a chemical composition profile is obtained. The second approach is the averaging of the summed responses for each m/z fragment over the total number of scans, the whole time of the analysis, creating a total chromatogram average mass spectrum (TCAMS). In the third approach, GC-EI-MS three-way data array is divided into time dependent sub-windows, where for each sub-window the average mass spectrum (AMS) is calculated. At the end, the AMS of all windows are added into single data set, creating the segmented average mass spectrum (SAMS). In the first experimental chapter, three strategies for GC-EI-MS data reduction were evaluated for the discrimination of ylang-ylang essential oils based on their distillation time and geographical origin. SAMS showed superior performance compared to the other two data reduction procedures, in principal component analysis (PCA), partial least squares the prediction and discrimination of ylang-ylang distillation grades and geographical origins, respectively. Also, TCAMS and SAMS were utilised for fast quantification of main compounds in ylang-ylang EOs, without using internal standards. This enabled evaluation of quality of ylang-ylang EOs through comparison with the corresponding ISO standard. In addition, a high-performance thin layer chromatography approach was utilised for the determination of radical scavenging activity (RSA) and identification of compounds contributing to the RSA. It was shown that increase in distillation time results in ylang-ylang EOs with higher RSA due to higher content of sesquiterpenes, ˜í¬±-(E,E)-farnesene and germacrene D, which are together with eugenol main contributors to the RSA. It was also shown that geographical origin has great influence on the RSA of ylang-ylang EOs. Recorded HPTLC profiles allowed discrimination of YY EO based on their geographical origin in PCA, and prediction of distillation grade utilising PLS. In the case of prediction of RSA based on three different datasets created from GC-EI-MS data, PLS model created on SAMS showed lowest relative error of prediction (REP) and mean error of prediction (MEP). Data mining on the three datasets created from GC-MS raw data allowed identification of compounds contributing to the RSA. In comparison to the GC-based data sets, ATR-FTIR showed higher accuracy, having lower REP and MEP, as well as root mean square error of prediction (RMSEP). It was also demonstrated that a PLS model created on spectra recorded on smartphone-based handheld Raman spectrometer can be used for the determination of RSA. This is of great importance since analysis and evaluation of biological potency of EOs can be performed directly in-field without any sample pre-treatment. The second experimental chapter of this thesis illustrates the procedure for the fast quality control of natural products, based on Random Forests machine learning algorithms. In the first part, the application of different GC-EI-MS data reduction procedures, TCAMS and SAMS, followed by Random Forests for the classification of twenty different classes of EOs, at the same time determining the samples with lower quality, was explored. In this work, SAMS showed better performance, where through a calculated proximity matrix it highlighted all EOs with lower quality. Random Forests, PLS-DA and PLS applied on spectra recorded on the smartphone-based portable Raman device, allowed discrimination of pure EOs from the adulterated ones where theregression (PLS) and discriminatory analysis (PLS-DA) for adulterant was quantified based on the created PLS model. Also, Random Forests enabled identification of the adulterants. The third experimental chapter of this thesis examined quality control of blends and mixtures created from several natural products. Application of TCAMS, together with multivariate curve resolution alternating least squares (MCR-ALS), allowed quality control of \extremely complex\" samples such are essential oil blends and perfume mixtures. Singular value decomposition (SVD) applied on TCAMS enabled determining the number of natural products used in creating blends. Resolved TCAMS through applied Random Forest model allowed identification of natural products used in creating the blends. Also PCA on resolved TCAMS allowed determination of distillation grade and geographical origin of ylang-ylang essential oils used in creating high-value perfumes. In the last chapter the application of UV-Vis and TCAMS enforced with PLS for the monitoring of primary and secondary products in the FuracellTM industrial process was explored. UV-Vis was used as the fast spectroscopy technique which has the potential to be used for the on-line or in-line measurements while TCAMS was created using fast GC-EI-MS analysis as the off-line reference method. PLS models based on spectra recorded on benchtop UV-Vis spectrophotometers portable UV-Vis setup and TCAMS dataset were compared for the monitoring of the Furacell\\(^{TM}\\) industrial process. Moving window PLS (mwPLS) allowed identification of UV intervals of interest enabling development of an LED-based UV spectrophotometer. Created LED-based UV spectrophotometer showed comparable performance to the portable setup as well as good accuracy in monitoring of the Furacell\\(^{TM}\\) pilot plant."
Copyright 2021 the author Section 1.2 appears to be the equivalent of a post-print version of a published article. Material from: Lebanov, L. Tedone, L., Kaykhaii, M., Linford, M. R., Paull, B., Multidimensional gas chromatography in essential oil analysis. Part 1: Technical developments, Chromatographia, 82(1), 377-398, published 2019, SpringerLink. Section 1.3 appears to be the equivalent of a post-print version of a published article. Material from: Lebanov, L. Tedone, L., Kaykhaii, M., Linford, M. R., Paull, B., Multidimensional gas chromatography in essential oil analysis. Part 2: Application to characterisation and identification, Chromatographia, 82(1), 399-414, published 2019, SpringerLink. Section 1.4 appears to be the equivalent of a post-print version of an article published as: Lebanov, L., Ghiasvand, A., Paull, B., 2021. Data handling and data analysis in metabolomic studies of essential oils using GC-MS, Journal of chromatography A, 1640, 461896. Section 2.1. appears to be the equivalent of a pre-print version of an article published as: Lebanov, L., Chatterjee, S., Tedone, L., Chapman, S. C., Linford, M. R., Paull, B., 2020. Comprehensive characterisation of ylang-ylang essential oils according to distillation time, origin, and chemical composition using a multivariate approach applied to average mass spectra and segmented average mass spectral data, Journal of chromatography A, 1618, 460853. Section 2.2 appears to be the equivalent of a post-print version of an article published as: Lebanov, L., Lam, S. C., Tedone, L., Sostaric,T., Smith, J. A., Ghiasvand, A., Paull, B., 2021. Radical scavenging activity and metabolomic profiling study of ylang-ylang essential oils based on high-performance thin-layer chromatography and multivariate statistical analysis, Journal of chromatography B, 1179, 122861. Section 3.1. appears to be the equivalent of a pre-print version of an article published as: Lebanov, L., Tedone, L., Ghiasvand, A., Paull, B., 2020. Random forests machine learning applied to gas chromatography ‚Äö- mass spectrometry derived average mass spectrum data sets for classification and characterisation of essential oils, Talanta, 208, 120471. Section 3.2. appears to be the equivalent of a post-print version of an article published as: Lebanov, L., Paull, B., 2021. Smartphone-based handheld Raman spectrometer and machine learning for essential oil quality evaluation, Analytical methods, 13(36), 4055-4062. Section 4.1. appears to be the equivalent of a pre-print version of an article published as: Lebanov, L., Tedone, L., Ghiasvand, A., Paull, B., 2020. Characterisation of complex perfume and essential oil blends using multivariate curve resolution-alternating least squares algorithms on average mass spectrum from GC-MS, Talanta, 219, 121208.